2016 SDSC Summer Institute


SDSC Summer Institute 2016: HPC and the Long Tail of Science

Monday - Friday, August 1 – 5, 2016

San Diego Supercomputer Center (SDSC) on the University of California, San Diego (UCSD)
Monday registration: 8:00 AM
Monday - Thursday: 8:30 AM - 5:00 PM
Friday: 8:30 AM - Noon
Light refreshments and lunch provided throughout

Social evening events
Monday: West Coast Sunset Reception
Thursday: Dinner at the Beach

SDSC Summer Institute will deploy a flexible format designed to help attendees get the most out of their week. The first half of the SI will consist of plenary sessions covering the skills that are considered essential for anyone who works with big data. Topics include data management, running jobs on SDSC resources, reproducibility, database systems, characteristics of big data, and techniques for turning data into knowledge, software version control and making effective use of hardware. This will be followed by a series of parallel sessions that allow attendees to dive deeper into specialized material that is relevant to their research projects covering topics in; SPARK, Parallel Computing, Performance Optimization, Predictive Analytics, Scalable Data Management, Visualization, Workflow Management, GPU’s/CUDA and Python for Scientific Computing.

Summer Institutes are designed to be hand-on so participants are expected to bring a laptop computer to follow along with demos and hands-on instruction throughout the program.

Plenary Sessions (Monday, Tuesday)
How do I launch and manage jobs on the system?
How do I manage my data on the file system?
How do I know I'm making effective use of the machine?
How do I automate my job pipeline t ensure reproducibility?
How do I manage my software?
What are the benefits of using Science Gateways?
When can virtualization help HPC?
Half day in-depth Parallel Sessions (Wednesday, Thursday)
Machine Learning
Parallel Computing using MPI & Open MP
Performance Optimization
Python for HPC

Parallel Sessions In-Depth…

GPU computing and programming: This session provides an introduction to massively parallel computing with graphics processing units (GPUs). The use of GPUs is becoming increasingly popular across all scientific domains since GPUs can significantly accelerate time to solution for many problems. Participants will be introduced to essential background of the GPU chip architecture and will learn how to program GPUs via the use of Libraries, OpenACC compiler directives, and CUDA programming. The session will incorporate hands-on exercises for participants to acquire the skills to use and develop GPU aware applications.

Machine Learning: Machine Learning is an interdisciplinary field focused on the study and construction of computer systems that can learn from data without being explicitly programmed. This track provides an introduction to the machine learning algorithms and techniques used to explore, analyze, and leverage data to construct data-driven solutions applicable to any domain. The morning session will cover the machine learning process, R/RStudio, data exploration, and data preparation. The afternoon session will cover classification, cluster analysis, and tools and procedures to scale up machine learning techniques on Comet. Hands on exercises/demonstrations will be done in R, and Python with Spark.

Parallel Computing using MPI & Open MP: This session is targeted at attendees who are looking for a hands-on introduction to parallel computing using MPI and Open MP programming. The session will start with an introduction and basic information for getting started with MPI. An overview of the common MPI routines that are useful for beginner MPI programmers, including MPI environment set up, point-to-point communications, and collective communications routines will be provided. Simple examples illustrating distributed memory computing, with the use of common MPI routines, will be covered. The OpenMP section will provide an overview of constructs and directives for specifying parallel regions, work sharing, synchronization and data scope. Simple examples will be used to illustrate the use of OpenMP shared-memory programming model, and important run time environment variables Hands on exercises for both MPI and OpenMP will be done in C and FORTRAN.

Performance Optimization: This session is targeted at attendees who both do their own code development and need their calculations to finish as quickly as possible. We'll cover the effective use of of cache, loop-level optimizations, force reductions, optimizing compilers and their limitations, short circuiting, time-space tradeoffs and more. Exercises will be done mostly in C, but emphasis will be on general techniques that can be applied in any language.

Python for HPC: Python is rapidly becoming more widely adopted in the High Performance Computing world. In this session, we will introduce four key technologies in the Python ecosystem that provide significant benefits for scientific applications run in supercomputing environments. Previous Python experience is not required.

(1) IPython Notebook allows users to execute code on a single compute node or cluster and export the Python web interface to the local browser for interactive data exploration and visualization. IPython Notebook supports live Python code, explanatory text, LaTeX equations and plots in the same document.

(2) IPython Parallel provides a simple, flexible and scalable way of running thousands of Python serial jobs by spawning IPython kernels (namely engines) on any HPC batch scheduler. It also allows interactive control of the engines from an IPython Notebook session along with the ability to submit more Python tasks to the engines.

(3) Numba makes it possible to run pure Python code on GPUs simply by decorating functions with the data types of the input and output arguments. Pure Python prototype code can be gradually optimized by pushing the most computationally intensive functions to the GPU without the need to implement code in CUDA or OpenCL.

(4) PyTrilinos is a Python wrapper for the Trilinos, a C++ Distributed Linear Algebra library developed by Sandia National Labs. It provides a high level interface for transparently dealing with complex MPI point-to-point communication strategies for operations involving both dense and sparse matrices and vectors whose data are distributed across an arbitrary number of nodes.

Spark for Scientific Computing: Apache Spark is a cluster computing framework extensively used in Industry to process large amount of data (up to 1PB) distributed across thousands of nodes. It has been designed as a successor of Hadoop focusing on performance and usability. It provides interface in Python, Scala and Java. This session will provide an overview of the capabilities of Spark and how they can be leveraged to solve problems in Scientific Computing. Next it will feature a hands-on introduction to Spark, from batch and interactive usage on Comet to running a sample map/reduce example in Python. The final part will be devoted to two key libraries in the Spark ecosystem: Spark SQL, a general purpose query engine that can interface to SQL databases or JSON files and Spark MLlib, a scalable Machine Learning library.

Visualization is largely understood and used as an excellent communication tool by researchers. This narrow view often keeps scientists from fully using and developing their visualization skillset. This tutorial will provide a ʺfrom the ground up" understanding of visualization and its utility in error diagnostic and exploration of data for scientific insight. When used effectively visualization can provide a complementary and effective toolset for data analysis, which is one of the most challenging problems in computational domains. In this tutorial we plan to bridge these gaps by providing end users with fundamental visualization concepts, execution tools, customization and usage examples. Finally, a short introduction to SeedMe.org will be provided where users will learn how to share their visualization results ubiquitously.