Programming for the future: scaling forward with cores and vectors

H. Pabst

slides (pdf)

This tutorial assumes a scientific computing audience, hence the focus is on OpenMP as an industry standard that is now going beyond shared memory programming (there will be enough time during the event to discuss other programming models and standards). In its latest revision, OpenMP 4.0 not only introduced SIMD directives but also introduced support for attached devices including GPUs, FPGAs and coprocessors. We will focus on Intel Xeon Phi coprocessors which are based on x86 architecture making porting straightforward, and code tuning beneficial for multicore and manycore.
Implicit and explicit SIMD vectorization will be the main focus through the tutorial. Code samples will illustrate the content although the focus is on concepts rather than syntax details (code samples will be available for download). People will learn leveraging compiler optimization reports in order to get around vectorization blockers (aliasing, etc.), learn about beneficial tuning steps such as aligned memory accesses, loop blocking, streaming stores, as well as some of the more complex changes touching the memory layout i.e., array of structures, structures of arrays, and hybrid layouts. Finally, important compiler switches, choices within the Intel Math Functions (IMF), and choices when using the Intel Math Kernel Library (Intel MKL) are presented. Those choices often achieve "performance for free" by making some informed tradeoffs e.g., by relaxing the floating-point (FP) accuracy, or by using mixed precision. To aim for an active audience, the main part finishes with some interactive demonstrations to further illustrate the content presented.
In the closing part, the hybrid programming perspective becomes more real by summarizing all "performance dimensions" available today. This attempts to give an outlook of what will be beneficial with the "next big leap". In particular, MPI and OpenMP incl. SIMD as well as offloading to attached devices will be put into perspective. The tutorial closes with an outlook to what is next and beyond coprocessors by hinting on how the tuning efforts needed for performance will map to codes that are fit for the future.

An Introduction to OpenCL for Scientific Computing

F. Rudolf, K. Rupp, J. Weinbub
(TU Wien)

slides (pdf, local)

slides (pdf, github)

Learn how to use OpenCL, the open standard for parallel programming of heterogeneous systems, for cross-platform scientific computations on CPUs, GPUs, and Intel's many integrated core architecture. We introduce the OpenCL runtime and show how to write and run compute kernels. Advantages and disadvantages of OpenCL in comparison with other approaches, in particular OpenMP, OpenACC, and CUDA, are discussed. Finally, techniques and best-practices for obtaining portable performance on devices from different vendors are discussed.

Nvidia Cuda

T. Stich

slides (pdf)

CUDA is NVIDIA's parallel computing platform and programming model. This tutorial will cover the key features of major programming language solutions, libraries and development tools for GPU computing that are available today. CUDA 6 dramatically increases developer productivity with the introduction of Unified Memory, which simplifies memory management by automatically migrating data between the CPU and GPU. Unified Memory and other new features in CUDA tools and libraries make GPU computing easier than ever before. In this talk you'll hear about these features and get insight into the philosophy driving the development of CUDA and how it will take advantage of current and future GPUs. You will learn about NVIDIA's vision for CUDA and the challenges for the future of parallel software.

An adaptive, load-balanced MPI/GPU-Code for calculating the gain in High Power Laser media

C. Eckert
(Helmholtz-Zentrum Dresden-Rossendorf)

We present an adaptive Monte Carlo approach for computing the amplified spontaneous emission (ASE) flux in a laser medium pumped by pulsed lasers. For high energy laser systems with large apertures sufficient spatial resolution requires high computational power. We have developed an adaptive multi-node GPU algorithm with load balancing that shows close to perfect strong scaling that allows for large speedups compared to previously existing CPU implementations. This code will allow to calculate the ASE flux in large size gain media as they will be used in the upcoming generation of high-power laser systems.

Running GADGET2 on GPUs: Optimizing Tree-search Algorithms by Detailed Profiling of GPU Code

C. E. Frigaard

slides (pdf)

GADGET2 is a massively parallel cosmological N-body/SPH solver. It has recently been extended with GPU solving capabilities for the short-range direct-force calculation, speeding up some simulations as much as 10:1.
The two-level heterogeneous parallel programming model of GADGET2, now consisting of i) the MPI layer and ii) the GPU hardware supported threading layer, places a serious burden on the algorithm designer: now hardware constraints of all the sub-systems has to be taken into account for an efficient overall algorithm performance, with many of the constrains being mutually opposing.
The tedious, time consuming, trial-and-error process of manually microbenchmarking and optimizing GPU-algorithms can be eliminated by using the presented tool, GPUPROF, for profiling of GPU-kernels: a tool that allows for automated, non-intrusive performance profiling of GPU-code, just as known from existing CPU-domain profilers (like Valgrind).
The profiler will also allow a faster, automated search for an optimal GPU-algorithm, when taking the full set of (possible opposing) GPU-hardware constraints into account.
The core algorithm of GADGET2, a tree-walking method, will be discussed in the context of this GPU-profiler, and GADGET2's portability from the CPU-domain to the GPU-domain will be analyzed using the full set of GPU-hardware constraints.

Interpolation with Radial Basis Functions on GPGPUs using CUDA

D. Martin, G. Haase, G. Offner
(VRVis Research Center Vienna, University of Graz, AVL List GmbH Graz)

slides (pdf)

Various applications as cardiovascular simulation or CFD in a combustion chamber have to handle changing geometries. The same holds for shape optimization problems wherein the computational domain changes due to the modified geometry. Instead of a full remeshing of the domain (yielding to discontinuous gradients in the optimization problem) the existing computational mesh is deformed.
We use the interpolation with radial basis functions (RBF) as a method to interpolate the boundary deformation, given a set of points X={xi}i=1N with associated real function values fi = f(xi). The function f is usually unknown, but its existence is postulated for the reasonableness of the interpolation task. Sought is an approximating function s:Ω→R by interpolation. Restricted on the set X we request the interpolation condition
In the context of RBF interpolation we seek for an interpoland of the form
s(x)=∑iN λi φ(||x-xi||) + p(x), λi ∈ R, p ∈ PM.
Our application uses φ(r:=||x-xi||)=√(r2+c2)) with c≈minxi,xj ∈ X ||xi - xj|| as kernel and hence requires a constant polynom p(x)=α.
The calculation of an interpoland leads to a system of linear equations with a dense system matrix. The problem sizes in industrial applications often prohibit the direct solution of the linear system, therefore iterative methods as the FPG method, a special Krylov subspace method, are used. Therein the highest computational cost originates in the calculation of matrix-vector products with the dense system matrix.
We will present our methodology and the results for the GPGPU computation of a RBF interpolation. Depending on the problem, the accuracy and the hardware we achieved acceleration of up to 25 comparing a desktop system (4+4(HT) CPU cores and GTX 680) and up to 12 on a server system (2 * (12+12(HT)) CPU cores and Tesla C2070).

CudaMat - a toolbox for Cuda computations

R. Heintzmann
(University Jena, IPHT)

CudaMat, www.nanoimaging.de/CudaMat/ is a toolbox developed to allow fast computations in Matlab. In some respects this is now also possible with the parallel programming toolbox, but the performance does differ. In this talk the inner workings of CudaMat will be discussed.

Scaling Plasma Simulations to more than 18,000 GPUs

A. Hübl
(Helmholtz-Zentrum Dresden-Rossendorf)

We present results on simulating the relativistic Kelvin-Helmholtz Instability (KHI) on the Oak Ridge National Laboratory supercomputer TITAN. This simulation has been the largest kinetic simulation of the KHI yet and allowed for calculating the radiation spectra from billions of particles. We discuss how such synthetic diagnostics will gain importance in comparison of theory to experiment. We furthermore explain the main algorithmic building blocks needed to utilize 18,000 GPUs and present an outlook on next steps towards Exascale scalability.

Building blocks for sparse linear algebra on heterogeneous hardware

M. Kreutzer
(University Erlangen, RRZE)

slides (pdf)

It is a widely accepted presumption that future compute platforms will be more and more heterogeneous. Our goal is to provide building blocks for sparse linear algebra with a strong focus on efficient and resilient computation on modern hardware.
By determining a unified and highly performant sparse matrix storage format for all relevant architectures we can easily and flexibly utilize different architectures concurrently for computing key operations like sparse matrix-vector multiplication. In order to address fault tolerance and and to exploit new levels of parallelism we suggest a generic and affinity-aware interface for defining and controlling asynchronous tasks. Within our prototype implementation we are able to provide a manageable but highly useful set of functions that ease the programmer's burden with implementing asynchronous communication, functional parallelism, and low-overhead scalable checkpointing.

A Concurrent Algorithm for Computing the Flow Complex

L. Kühne
(University Jena, theoret. Informatik)

We present a concurrent algorithm and its implementation for computing the entire Hasse diagram of the flow complex of a point cloud in Euclidean space. Our algorithm avoids computing a geometric realization of the underlying complex and computes only the Hasse diagram that is augmented with enough information to allow a topological multi-scale analysis of point cloud data. We show experimental results for medium dimensions and discuss practical challenges concerning multicore scalability on the Intel Xeon Phi coprocessor.

Scalable Finite Volume Computations in Heterogeneous Systems

J. Langguth
(Simula, Oslo)

Finite volume methods are widely used numerical strategies for solving partial differential equations. Advantages of using finite volumes include built-in support for conservation laws and suitability for unstructured meshes. We study attainable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes using heterogeneous systems consisting of CPUs, GPUs, and Xeon Phi accelerators. Our focus lies on demonstrating how performance is dependent on communication bandwidth of the bottlenecks between different processing elements and nodes. Consequently these bottlenecks must be taken into account when distributing the problem in a heterogeneous environment, thus giving rise to a new combined approach to data partitioning.

Optimal Control of the Schrödinger Equation on Many-Core Architectures

M. Liebmann
(University of Graz)

slides (pdf)

We investigate efficient parallel algorithms for the optimal control problem of the time-dependent Schrödinger equation on modern many-core architectures. We consider controls for multi-component Schrödinger systems with external laser fields:
min (1-|<φ,ϕ(T)>|2)+γ1|ε|22.|2
subject to i∂t ϕ = (-1/(2m)Δ+V+ε(t)B) ϕ, ϕ(0)=ϕ0
Where φ∈L2(Rd,CN) is the desired target state to be achieved at time T. The external laser field amplitude ε:[0,T]→R is the control. The regularizing terms, which penalize overall field energy or overall field fluctuation, are measured in suitable norms and weighted by the regularization parameters γ12>0. The time-dependent Schrödinger equation describes the effective quantum motion of nuclei with average mass 0< m≤1 and general initial conditions ϕ0∈L2(Rd,CN). The dipole operator B initiates the transfer between the potential energy surfaces of the matrix-valued potential V:Rd→CN*N.

Parallel and simultaneous computation of eikonal and transport equations by taking full advantage of GPU computer architecture

M. Noack
(Simula, Oslo)

We present a novel method to calculate the Helmholtz’ Green’s function. The method is designed to take full advantage of parallel computer architecture, especially GPUs. New stencil types for travel time and amplitude computations are used to provide the parallel update of nodes of an entire plane. Implemented in CUDA, the computation is so low in computational cost that other factors like memory access become a bottle neck. CUDA shared memory is applied to optimize memory and latency problems.

Scalable, interactive 3D in-situ visualization of large-scale Simulations

R. Pausch
(Helmholtz-Zentrum Dresden-Rossendorf)

We present scalable, in-situ visualization of large-scale plasma simulations that allows for remote live visualization. We discuss the GPU rendering implementation, its interface to the simulation, scalable image composition on large clusters and the use of low-power visualisation clients attached to a server located at the HPC system. Such a setup challenges current HPC visualization paradigms and will potentially allow for explorative simulation surveys of large parameter spaces with strongly reduced storage footprint.

Computational Challenges for Visual Recognition with Deep Learning Architectures

E. Rodner
(University Jena, Informatik)

home page

Our research in the visual recognition group aims at developing recognition systems that are able to autonomously learn new object categories with minimal supervision and that are efficient in the sense that learning is done in an incremental, adaptive, and active manner without learning from scratch every time. We are currently working with deep learning methods that have shown impressive success for image categorization tasks. However, these methods require a large degree of parallelization and a high performance architecture to learn from millions of images within a reasonable amount of time. In my talk, I will sketch some of the current and future challenges in this field and present some recent results in the area of object part discovery.

Lessons Learned in Developing the Linear Algebra Library ViennaCL

F. Rudolf, K. Rupp, J. Weinbub
(TU Wien)

slides (pdf, local)

slides (pdf, github)

GPU and many integrated core architectures pose new challenges on the way scientific software is designed, implemented, tuned, and tested. We start with a short introduction to the free open source linear algebra library ViennaCL, which enables the use of OpenMP, CUDA, and OpenCL with a unified high-level programming interface. In the main part of the talk we address our strategies for achieving high performance on a broad range of hardware from different vendors. Finally, we share experiences in engaging as library developers with the scientific community, discuss the role of vendor libraries, and present future directions.

Implementing the Radon Transform using Advanced Techniques on GPGPUs

R. Seidler
(University Jena, Informatik)

The Radon transform is an image transformation technique mostly known from image reconstruction in medical applications like CT or MRI. Recent developments in GPGPU architectures and programmability, like dynamic parallelism or new shuffle operations, show massive potential on improving existing algorithms by elimination of synchronizing the GPU with CPU and no need for inter-thread synchronization at all. Starting with a straight-forward implementation, we show the impact of cache-aware memory accesses up to the investigation of how the new features of CUDA capable GPUs can help to improve the speed of the Radon transform even further.

An Optimized Intra-Node Communication Scheme Using Multiple CUDA Streams and OpenMP Threads

M. Sourouri
(Simula, Oslo)

In the context of multiple GPUs that share the same PCIe bus, we propose a new communication scheme that leads to a more effective overlap of communication and computation. Between each pair of GPUs, multiple CUDA streams and OpenMP threads are adopted so that data can simultaneously be sent and received. Moreover, we combine our scheme with GPUDirect to provide GPU-GPU data transfers without CPU involvement. A representative 3D stencil example is used to demonstrate the effectiveness of our scheme. We compare the performance of our new scheme with an MPI-based state-of-the-art scheme. Results show that our approach outperforms the state-of-the-art scheme, being up to 1.85× faster.

A parallel functional language for high performance finite difference stencil codes

G. Zumbusch
(University Jena)

slides (pdf)

Current CPU (processor) and GPU (graphics) architectures heavily use data and instruction parallelism at different levels. Finite Difference algorithms on these systems tend to be memory bandwidth limited, like many other explicit numerical schemes. In order to tune several Finite Difference kernels, we will discuss cache aware algorithms and vectorization strategies including non-standard memory layouts. Furthermore, a small domain specific functional language is introduced. It allows for a short and hardware independent way to express the numerical schemes and for an automatic code generation and optimization.