High-Performance Computing Crash Course

HPC.001

HPC Background

What is HPC?

HPC is a technology that uses clusters of powerful processors that work in parallel to process massive, multidimensional data sets and solve complex problems at extremely high speeds.

Standard Computing vs. Parallel Computing

Standard Computing System:

Solves problems (primarily) by using serial computing
Divides the workload into a sequence of tasks
Runs the tasks one after the other on the same processor
Requires systems to use only one processor

Parallel Computing:

Runs multiple tasks simultaneously on numerous computer servers or processors
Large compute problems are broken down into smaller problems that can be solved by multiple processors

HPC Clusters

Computer Clusters (HPC Clusters) consist of multiple high-speed computer servers (nodes) networked with a centralized scheduler that manages the parallel computing workload.

High Performance Components

All computing resources in an HPC Cluster are high speed and high throughput:

Networking
Memory
Storage
File systems

[ HPC Background ]

HPC.002

Parallelism

Purpose of Parallelism

Cost Reduction: Serial computing uses a single processor, which takes incredibly long for complex problems
Complex Problem Solving: Systems need to perform thousands to millions of tasks in an instant. Many bottlenecks if performed sequentially
Increased Efficiency: Resources can be used efficiently. Multiple tasks can be performed concurrently

Types of Parallel Computing

Bit-Level Parallelism: Processor word size is increased and number of instructions the processor must run to solve the problem decreased
Instruction-level Parallelism: Processor chooses which instructions it will run. Processors are built to perform certain operations simultaneously to improve resource optimization and increase throughput
Task Parallelism: Parallelized code across several processors simultaneously running tasks on the same data. Reduces serial time by running tasks concurrently
Superword-level Parallelism: Vectorized tactic that is used on inline code. Completes multiple similar tasks at once, saving time and resources

Concurrency

Understanding Concurrency

Concurrency refers to managing multiple tasks and is about dealing with multiple things at once. This is distinct from parallelism, which is about doing multiple things at once.

Multiple computations are happening at the same time across:

Multiple computers on a network
Multiple applications running on one computer
Multiple processors in a computer (or on one chip)

Multiprocessing

Want illusion of having CPU for each processor
In reality, we have multiprocessing with executions interleaved
Multicore processors:
- Multiple CPUs on a single chip
- Share main memory and some of the caches
- Each can execute a separate process
- Scheduling of processes done by kernel
Concurrent Process: Flows overlap in time. Sequential otherwise.

Scaling & Speedup

Scalability

Scalability refers to the ability to handle increased workload by adding resources.

Types of Scaling

Strong Scaling: Fixed global problem size with increasing processors (number of cores)
- Fix global problem size
- Increase n (core count)
Weak Scaling: Increase problem size with processors (easier to accomplish)
- Increase computational work proportional to the number of cores
- As you increase the number of cores, the amount of work each core does stays roughly consistent

Parallel Speedup

Performance improvement from parallel execution is measured as:

            
    S(n) = T1/Tn

    Where:
    - S: The number of processors you can execute on, or number of cores
    - T1: execution time on one processor
    - Tn: execution time on n processors

"Perfect Scaling" occurs when S(n) = n

Parallel Efficiency


    E(n) = S(n)/n

    - E=1 is "Perfect Scaling"
    - Runs from 0 to 1

Amdahl's Law

Theoretical Speedup Limits on multiple processors:


    Speedup = 1 / ((1 - p) + p/n)

    Where:
    p = portion of the program that can be parallelized
    n = number of processors
    1 - p = portion that cannot be parallelized

"The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used"

Parallel Architectures

Parallel Computing Architectures

Shared Memory: Computers rely on multiple processors to contact the same shared memory resource
Distributed Memory: Multiple processors with their own memory resources are linked over a network
Hybrid Memory: Combines shared memory computers on distributed memory networks. Connected CPUs can access shared memory and tasks assigned to other units on the same network

Multicore CPUs

Every core owns all of the data (visible to all cores), which allows for updating and viewing data in parallel.

Core synchronization is important for proper usage
Message-passing platform (shared): core asks other cores for data it doesn't own
Processors don't work in isolation, they have a memory hierarchy. Data has to be moved to the processor to perform operations
In Parallel: Cache coherency is important. Synchronization between caches and main memory is required

[ HPC Topics ]

HPC.003

Memory Models

UMA and NUMA

Uniform Memory Access (UMA): Each core (CPU's in graphic) can read and write to entire memory space but they do so with roughly similar latency
Non-Uniform Memory Access (NUMA): Memory access time depends on memory location relative to processor. Affinities between groups of CPU's and a certain subset of the memory

Both architectures are used in symmetric multiprocessing (SMP) systems - multiple processors share a common memory pool.

Shared vs. Distributed Memory

Shared Memory: Multiple processors access the same memory space
- All memory is visible to all cores (threading)
- Uniform Memory Access: SMP-symmetric multiprocessors
- Non-Uniform Memory Access
Distributed Memory: Each processor has its own private memory
- Every core only sees its subpiece of memory (no OpenMP)
- Processors need to know boundary information from neighboring processors
- Can be done through message passing. MPI is most commonly used
- Interior Process vs. Boundary Process

RDMA Networking

Remote Direct Memory Access (RDMA) enables one networked computer to access another networked computer's memory without involving either computer's operating system or interrupting either computer's processing.

Helps minimize latency and maximize throughput
Reduces memory bandwidth bottlenecks
Emerging high-performance RDMA fabrics—including InfiniBand, virtual interface architecture and RDMA over converged Ethernet—make cloud-based HPC possible

Memory Hierarchy

Why Memory Hierarchy Matters

Memory is organized to minimize access time by optimizing the available memory in the computer. This organization follows the principle of locality of references: same data or nearby data is likely to be accessed repeatedly.

Memory Types and Organization

Each level in the hierarchy has its own size, cost, and performance characteristics:

Registers: Smallest, fastest memory units directly in the CPU (16-64 bits)
Cache Memory (L1, L2, L3): Small, fast memory close to the CPU for frequently accessed data
- L1 Cache: Accessed in one cycle, very small but extremely fast
- L2 Cache: Larger but slightly slower
- L3 Cache: Shared among cores, larger and slower than L1/L2
Main Memory (RAM): Primary memory with larger capacity but slower access
- Static RAM (SRAM): Stores binary information in flip flops, faster but more expensive
- Dynamic RAM (DRAM): Stores binary info as charge on capacitors, higher density
External Storage: Disk drives, SSDs, and other persistent storage

Characteristics and Trade-offs

Capacity: Global volume of information the memory can store
Access Time: Time interval between the read/write request and availability of the data
Performance: Frequently accessed data is stored in faster memory
Cost per Bit: Cost increases as you move up the hierarchy (internal memory is costlier than external)

[ Memory ]

HPC.004

Low-Precision Arithmetic

Floating Point Representation

Real numbers in computers are represented as floating point numbers (scientific notation), with components:

Fractional part: actual value
Exponent term: scale
Sign

Low-Precision Benefits

FP16 (Half-Precision):
- 16-bit floating-point format
- Faster computation, less memory
- Used in NVIDIA's Tensor Cores
INT8 Quantization:
- 8-bit integer format
- Further memory reduction
- Common in inference deployments

Mixed Precision Training

Combining FP16 and FP32:

FP16 for most operations
FP32 for critical accumulations
Loss scaling to prevent underflow

This approach provides significant performance gains while maintaining numerical stability.

Performance

Performance Considerations

If performance is neglected, computations could run for extremely long periods. Understanding system capacity is crucial:

FLOPS: Floating point operations per second
Multiple CPUs make complex computations more manageable

Common Bottlenecks

Memory Bandwidth: The rate at which data can be read from or stored into memory
Compute Utilization: How efficiently computational resources are used
Data Loading: The time to load data from storage to memory
Communication Overhead: Time spent transferring data between nodes

Latency and Bandwidth

Latency: Delay before a transfer of data begins following an instruction
Bandwidth: How much you can transfer per unit of time

When training neural networks, bandwidth (for weights) is typically more important than latency.

Hiding Latency

Strategies for managing latency include:

Prefetching: Request data before it's needed
Concurrency: Use pipelining to switch between computations as data is coming in

Note that bandwidth is fundamental and can't be "hidden" - it represents a hard limit on system performance.

Optimization Techniques

Memory Management Optimizations

Tensor memory layout optimization
Pinned memory for faster transfers
Memory pool allocators
Gradient checkpointing

Data Loading Optimizations

NVIDIA DALI for efficient data loading
Prefetching and caching
GPU direct memory access
Dataset sharding for multi-GPU

Distributed Training Optimizations

Data parallel training
Model parallel training
Pipeline parallelism
Communication optimization

Kernel Optimizations

Kernel Fusion: Combining multiple operations
Memory Coalescing: Optimizing memory access patterns
Thread Block Optimization: Efficient thread organization
Pipeline Parallelism: Overlapping computation and communication

Load Balancing

Purpose of Load Balancing

Load balancing ensures that work is distributed efficiently across computing resources, preventing any single resource from becoming a bottleneck.

Key Components

Work Distribution: Dividing tasks among processors
Dynamic Load Balancing: Runtime task redistribution
Task Scheduling: Organizing execution order
Resource Allocation: Assigning computing resources

Strategies

Common load balancing strategies include:

Static Distribution: Work divided evenly before execution
Dynamic Distribution: Tasks allocated during runtime based on processor availability
Work Stealing: Idle processors "steal" work from busy ones
Hierarchical Approaches: Combining different strategies at different levels

[ Performance ]

HPC.005

Communication Protocols

MPI (Message Passing Interface)

Standard protocol for distributed computing that allows users to communicate between nodes in a cluster or across a network.

Point-to-point Communication: Direct message between two processes
Collective Communication: Operations involving all processes
Process Management: Handling multiple parallel processes

Hybrid Approach

Many HPC applications use a hybrid approach:

Local memory threaded with OpenMP
Distributed memory parallelism with MPI

Communication Patterns

Common communication patterns in HPC:

Broadcast: One process sends data to all other processes
Scatter/Gather: Distributing/collecting data across processes
Reduction: Combining results from all processes
All-to-All: Every process communicates with every other process

[ Protocols ]

Introduction to HPC

Why HPC Matters

Jump to Topic