Log In
  
 
Home My Page SciDAC Projects Collaborations
 
Key Note Speech

PURPOSE: To identify best practices for creating and maintaining reliable and sustainable software for use at HPC centers.

AUDIENCE: HPC center managers and key staff responsible for HPC software development, deployment, and maintenance, and managers representing DOE science programs.

The HPC science community is increasingly benefitting from and dependent upon shared software stacks. Examples include libraries enabling advancements in multiple scientific domains, workflow tool consolidation among sites facilitating portability, and the increasing ubiquity of Linux. While this trend presents valuable scientific and economic advantages, it is increasingly important that we examine best practices for building solid and reusable software foundations for HPC centers.

This workshop is intended to facilitate collaborative progress on questions such as:

  • How do we develop, maintain, and share software elements which together form reliable HPC systems to advance wide ranges of science?
  • What are best practices and top challenges at each layer and as a complete infrastructure?
  • How do we make software which is effective for multiple generations of users and systems, at multiple sites?
  • What are the gaps and unmet needs in current HPC software?
  • How do HPC centers monitor/measure software usage and forecast needs?
  • How do we identify the programmatic costs associated with HPC software?
  • How do HPC centers identify critical software and establish support priorities?
  • What support models are in use?
  • How would education/training affect improvement of software quality?

The third in a series of best-practices workshops, this event brings together HPC library, tool and system experts to examine these issues for their specific infrastructure layer, and interaction among layers, as well as managers who face budgetary constraints to prioritize the most effective ways to support the most useful software for their users and center operation. This includes processes, standards, and policies throughout software lifecycles – from origin to long-term pervasive use.

GOALS:

  • Foster a shared understanding of software reliability and sustainability in the context of HPC centers.
  • Understand the landscape of relevant workgroups (including SHAIP, IESP, ACTS, HEC FSIO, Resilience (NSF), Resilience(DOE)).
  • Identify top challenges and open issues.
  • Share best practices and lessons learned.
  • Discuss which practices apply to which software layers (application, middleware, and core) and lifecycle stages.
  • Establish communication paths for managerial and technical staff at multiple sites to continue discussion on these topics.
  • Discuss roles and benefits of stakeholders in software lifecycles.
  • Present findings to DOE and other stakeholders to improve the reliability and sustainability of HPC software stacks.


 

DRAFT AGENDA

Monday September 28
7:30–8:15 Breakfast and registration
8:15–8:25 Plenary opening session: David Skinner
8:25–9:45 Welcome and HQ Overview (Sander Lee and Yukiko)
Welcome and introductions:
8:45-9:30 "Perspectives on HPC Software" --Rusty Lusk
9:30–10:15"Earth System Grid" --Dean Williams
10:15–10:45 Break
10:45–11:00Report from Second HPC Best Practices Workshop: Risk Management --Terri Quinn
Track 1 breakout charter: (Separate to breakout rooms)
11:00–12:30 Day 1 breakouts
12:30–1:30 Lunch
1:30–3:30 Track 1 breakouts (cont.)
3:30–4:00 Break
4:00-6:00Day 1 breakouts reports and discussion
6:00-6:30 Break
6:30 Working dinner
Dinner panel: Software Inventory
Day 2
7:30–8:00 Breakfast
8:00–9:15 Plenary panel— Landscape
SHAIP - Osni Marques,
IESP - Bill Kramer,
ACTS - Tony Drummond,
HEC-FSIO - Dan Hitchcock,
Resilience DOE - Nathan DeBardelben,
Resilience NSF-Deb Agarwal
(each panel member has hard 10-minute/3-slide limit,
then 5 minutes for group discussion, leaving 15 minutes of margin)
9:15–9:30 Charter for Track 2 breakouts and separate to rooms:
9:30–12:30 Day 2 breakouts
(Breakouts can break around 10:30, as schedule permits)
12:30–1:30 Lunch
1:30–3:30 Track 2 breakouts reports:
3:30–3:45 Break
3:45–4:45 Plenary wrap-up session:
Workshop summary, report (discussion)
Next steps: Survey forms (like in 2008)


Breakout Sessions


Context: high-quality, wide-use, and long-use HPC center software

Cross-cutting questions (to be reported on by all groups):

  • What are the best practices and tools? Inside and Outside HPC.
  • What the top challenges?
  • What new technologies are needed?

DAY 1: Software Layers – Chair: Becky Springmeyer (LLNL)

Tools David Skinner(NERSC) and Chris Atwood (DOD)

    Modern HPC architecture trends have aggressively pushed parallelism to new extremes. How does this trend impact the usability, or even applicability, of HPC tools? What are tools used for and how do we derive the most valuable methods or use-cases for which tools enable computing at new scales? What is the taxonomy of tools in use and where are there gaps or redundancies?

  • Debugging (TV, DDT, gdb)
  • Application Profiling (Eg Tau, PAPI, IPM)
  • Data movement (GridFTP, hopper, hsi)
  • Compilers (PGI, Intel, UPC, XL, Cray, whither Pathscale?)

Libraries Ken Alvin (SNL) and Tony Drummond(NERSC)

  • Numerical (Eg Trilinos, PETSc, ScaLAPACK, SuperLU, FFTW, Metis, Zoltan)
  • Data movement and communication (Eg Globus, Global Arrays, MPICH)
  • Data management (Eg netCDF, HDF, MPI-IO)

System Management Alain Roy (OSG) and William Allcock (ANL)

  • Jobs (Eg SLURM, PBS, Torque)
  • Node health and testing (INCA, RSV, NAGIOS, CACTI, Cerebro)
  • Change control (CFengine, RPM)

System Software Shane Canon (LBNL) and Sue Kelly (SNL)

  • OS, I/O subsystem, etc (Eg BLCR, FastOS, Catamount, Lustre)

OS and I/O software for HPC systems run the gamut from open source (e.g. Linux and Lustre) to proprietary (e.g. Catamount and GPFS). The complexity also varies extensively from lightweight to full featured. This session will explore/identify the life cycle development and maintenance practices that are key to successful deployments of these software components. We will begin with such questions as: What system software do you rely on for your operation and how critical are they to your mission? Are you purely a consumer or does your organization play a role in supporting and maintaining the software? With that foundation, we will consider the processes used to select, develop/port/purchase, integrate, and maintain the software. And finally, we look at planning for critical software components that may be close to end of life or will not meet anticipated future needs.

DAY 2: Software Stages – Chair: Susan Coghlan (ANL)

Planning (strategic management, procurement, funding) – Mark Gary (LLNL) and Craig Tull (LBNL)

Planning Continue (strategic management, procurement, funding) – Mark Gary (LLNL) and Craig Tull (LBNL)

  • Measuring use and forecasting needs
  • Assessing criticality and establishing support priorities
  • Identifying and managing cost
  • Strategies for interagency and international cooperation
  • Strategic choices in software licenses

Strategic management, procurement, funding HPC software products often have lives that span multiple decades while serving many generations of machines and operating environments. Careful project planning is the foundation upon which these projects are built. From requirements gathering and cost estimation to collaboration and team building, deliberate and realistic planning is the key to product usefulness and longevity. But how do HPC software projects differ from typical software development projects? Do HPC requirements or the HPC community introduce impediments to successful planning? Successful collaboration? Are we in the HPC community successfully leveraging non-HPC methodologies? This session will address these questions, investigate the facets of good software planning, and explore alternative planning approaches.

Development (from prototype to widely used) Deb Agarwal (LBNL) and Paul Iwanchuk (LANL)

  • Testing, Tracking (results, bugs, dependencies)
  • Continuous Integration, Agile
  • Validation and Verification

Software engineering best practices typically include a thorough regimen of testing, bug tracking, documentation and release. Software design practices such as agile development and continuous integration are widely employed in developing code. An HPC environment brings with it several unique aspects including that development of software for HPC systems is often concurrent with the maturation of the target system. HPC software includes the applications, the libraries, the operating system as well as software targeted to testing the software and hardware environment. Validation and verification play a central role along with regression testing, tracking and documenting results. Similarly, there is a concerted effort to assure key applications are "ready" for the new architecture.

This session will focus on the life cycle development and maintenance practices that are key to successful deployment, and operation of these software components. We will follow software practices in the maturation of a typical HPC system from procurement through production to end of life. This session will address best practices at these stages, rather than addressing software components. Questions such as: How do you assure production readiness? What is you reliance on in-house development vs vendor support? Is your custom environment helping or hindering end use.? What is your ability to use other DOE institutional resources to complete your mission. What are the barriers? DST's and updates What role is fault tolerance and resilience playing in your future?

Integration (more than the sum of its parts) Pam Hamilton (LLNL) and Vicky White (ORNL)

  • Modularity, interoperability
  • Packaging, distribution
  • System-stabilization: co-development with vendors between system assembly and production use

Fielding an HPC system is more than standing up a pretty supercomputer with impressive racks of blinking lights. It requires a huge deployment of system and security software, networking infrastructure, a development environment and user applications, inevitably from different sources. We, the DOE lab customer, are stuck with putting these pieces together and making them work. How do you work with the vendor to refine your requirements and then verify they are met through system acceptance testing? Especially when some of the work may be done at the vendor and then on site. Come hear about the solutions other labs have found useful and bring your own to share.

Sustainment (long term issues) Charles Bacon (ANL) and David Montoya (LANL)

  • Support models, upgrade paths
  • From prototype to facility software
  • What is opensource support?

This session will identify approaches and efforts to better provide a sustainable environment for the software we use. HPC software comes from a mix ofcommercial vendors, open-source communities, and in-house development. Regardless of the source, if the support structure for the software disappears, the users are faced with serious consequences.

As software and tool developers, what models do we use to interact withthe open-source communities to help maintain a long-term support structure? How does a group's support model need to change over time as software moves from a development/research state to one that runs in a production environment? What are the concerns and issues from organizations that are responsible for providing a stable production environment? What approaches have organizations used for sustainability, and how can the user community participate in a way that leads to greater sustainability?


 
 

Hotel Information:

A block of guest rooms has been reserved for this workshop at the Hotel Nikko San Francisco located at 222 Mason Street, San Francisco, California. All workshop sessions will be held in this hotel. Hotel Nikko is ideally located in the heart of San Francisco, just steps away from bustling Union Square.

Sleeping Room Block:

The negotiated single/double occupancy rate at this hotel is $109+ per night. A $30 per night charge will apply for any additional adult (18 years and older). Occupancy tax in San Francisco is currently 15.565% (subject to change without notice).

As a courtesy, this negotiated rate will be valid three days prior and three days after the workshop dates, subject to availability.

Please note: The sleeping room block will be released on September 14 at 5:00 PM Pacific Time. After this date, prevailing room rates will apply.

Making Your Reservation:

  • Online reservations - click here
  • By phone: Call 800.248.3308 (toll-free, within the US only) or 415.394.1111 and reference the group name "Lawrence Berkeley Lab - 3rd Workshop on HPC Best Practices"

Early Departure Fee:

In the event that you decide to check out of the hotel prior to your reserved checkout date, the hotel reserves the right to charge an early departure fee of $50 to your individual account. To avoid this fee, you must advise the hotel at or prior to check-in of any change in the scheduled length of your stay.

Cancellation Policy:

Cancellation of individual reservations less than 72 hours prior to the day of arrival will incur a charge equal to one night’s room and tax to the individual’s credit card on file. If canceling, be sure to obtain a cancellation number.


 

Sponsored by U.S. Department of Energy

US Department of EnergyVince Dattoria and Yukiko Sekine
Facilities Division, The Office of Advanced Scientific Computing Research (ASCR), Office of Science, U.S. Department of Energy

Robert Meisner and Sander Lee
Advanced Simulation and Computing, National Nuclear Security Administration, U.S. Department of Energy



The Department of Energy The Office of Science Scientific Discovery Through Advanced Computing
Contact: help@outreach.scidac.gov   |   Web Policies   |   Privacy Powered By GForge Collaborative Development Environment