HIGH-PERFORMANCE COMPUTING

AKA: HPC
Fundamentals of HPC systems and parallel execution.


Introduction to HPC

High-Performance Computing (HPC) refers to the practice of aggregating computing power to deliver much higher performance than a typical desktop computer or workstation to solve large problems in science, engineering, or business.

Why HPC Matters

HPC is beneficial for:


  • Scientific simulations (weather forecasting, molecular modeling)
  • Engineering design and analysis
  • Big data processing and analytics
  • Artificial intelligence and machine learning training
  • Financial modeling and risk analysis
COMPUTING RESOURCES PERFORMANCE TRADITIONAL HPC HPC PERFORMANCE SCALING

Jump to Topic

HPC.001

HPC Background

What is HPC?

HPC is a technology that uses clusters of powerful processors that work in parallel to process massive, multidimensional data sets and solve complex problems at extremely high speeds.

Standard Computing vs. Parallel Computing

Standard Computing System:

  • Solves problems (primarily) by using serial computing
  • Divides the workload into a sequence of tasks
  • Runs the tasks one after the other on the same processor
  • Requires systems to use only one processor

Parallel Computing:

  • Runs multiple tasks simultaneously on numerous computer servers or processors
  • Large compute problems are broken down into smaller problems that can be solved by multiple processors

HPC Clusters

Computer Clusters (HPC Clusters) consist of multiple high-speed computer servers (nodes) networked with a centralized scheduler that manages the parallel computing workload.

High Performance Components

All computing resources in an HPC Cluster are high speed and high throughput:

  • Networking
  • Memory
  • Storage
  • File systems
[ HPC Background ]
HPC.002

Parallelism

Purpose of Parallelism

  • Cost Reduction: Serial computing uses a single processor, which takes incredibly long for complex problems
  • Complex Problem Solving: Systems need to perform thousands to millions of tasks in an instant. Many bottlenecks if performed sequentially
  • Increased Efficiency: Resources can be used efficiently. Multiple tasks can be performed concurrently

Types of Parallel Computing

  1. Bit-Level Parallelism: Processor word size is increased and number of instructions the processor must run to solve the problem decreased
  2. Instruction-level Parallelism: Processor chooses which instructions it will run. Processors are built to perform certain operations simultaneously to improve resource optimization and increase throughput
  3. Task Parallelism: Parallelized code across several processors simultaneously running tasks on the same data. Reduces serial time by running tasks concurrently
  4. Superword-level Parallelism: Vectorized tactic that is used on inline code. Completes multiple similar tasks at once, saving time and resources
SEQUENTIAL EXECUTION TASK A TASK B TASK C PARALLEL EXECUTION CORE 1 CORE 2 CORE 3 CORE 4 SYNC POINT

Concurrency

Understanding Concurrency

Concurrency refers to managing multiple tasks and is about dealing with multiple things at once. This is distinct from parallelism, which is about doing multiple things at once.

Multiple computations are happening at the same time across:

  • Multiple computers on a network
  • Multiple applications running on one computer
  • Multiple processors in a computer (or on one chip)

Multiprocessing

  • Want illusion of having CPU for each processor
  • In reality, we have multiprocessing with executions interleaved
  • Multicore processors:
    • Multiple CPUs on a single chip
    • Share main memory and some of the caches
    • Each can execute a separate process
    • Scheduling of processes done by kernel
  • Concurrent Process: Flows overlap in time. Sequential otherwise.

Scaling & Speedup

Scalability

Scalability refers to the ability to handle increased workload by adding resources.

Types of Scaling

  • Strong Scaling: Fixed global problem size with increasing processors (number of cores)
    • Fix global problem size
    • Increase n (core count)
  • Weak Scaling: Increase problem size with processors (easier to accomplish)
    • Increase computational work proportional to the number of cores
    • As you increase the number of cores, the amount of work each core does stays roughly consistent

Parallel Speedup

Performance improvement from parallel execution is measured as:

            
    S(n) = T1/Tn

    Where:
    - S: The number of processors you can execute on, or number of cores
    - T1: execution time on one processor
    - Tn: execution time on n processors
            

"Perfect Scaling" occurs when S(n) = n

Parallel Efficiency


    E(n) = S(n)/n

    - E=1 is "Perfect Scaling"
    - Runs from 0 to 1
        

Amdahl's Law

Theoretical Speedup Limits on multiple processors:


    Speedup = 1 / ((1 - p) + p/n)

    Where:
    p = portion of the program that can be parallelized
    n = number of processors
    1 - p = portion that cannot be parallelized
    

"The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used"

1 4 16 64 NUMBER OF PROCESSORS 1x 2x 5x 10x 20x SPEEDUP 95% PARALLEL 90% PARALLEL 75% PARALLEL 50% PARALLEL AMDAHL'S LAW

Parallel Architectures

Parallel Computing Architectures

  • Shared Memory: Computers rely on multiple processors to contact the same shared memory resource
  • Distributed Memory: Multiple processors with their own memory resources are linked over a network
  • Hybrid Memory: Combines shared memory computers on distributed memory networks. Connected CPUs can access shared memory and tasks assigned to other units on the same network

Multicore CPUs

Every core owns all of the data (visible to all cores), which allows for updating and viewing data in parallel.

  • Core synchronization is important for proper usage
  • Message-passing platform (shared): core asks other cores for data it doesn't own
  • Processors don't work in isolation, they have a memory hierarchy. Data has to be moved to the processor to perform operations
  • In Parallel: Cache coherency is important. Synchronization between caches and main memory is required
[ HPC Topics ]
HPC.003

Memory Models

UMA and NUMA

  • Uniform Memory Access (UMA): Each core (CPU's in graphic) can read and write to entire memory space but they do so with roughly similar latency
  • Non-Uniform Memory Access (NUMA): Memory access time depends on memory location relative to processor. Affinities between groups of CPU's and a certain subset of the memory

Both architectures are used in symmetric multiprocessing (SMP) systems - multiple processors share a common memory pool.

Shared vs. Distributed Memory

  • Shared Memory: Multiple processors access the same memory space
    • All memory is visible to all cores (threading)
    • Uniform Memory Access: SMP-symmetric multiprocessors
    • Non-Uniform Memory Access
  • Distributed Memory: Each processor has its own private memory
    • Every core only sees its subpiece of memory (no OpenMP)
    • Processors need to know boundary information from neighboring processors
    • Can be done through message passing. MPI is most commonly used
    • Interior Process vs. Boundary Process

RDMA Networking

Remote Direct Memory Access (RDMA) enables one networked computer to access another networked computer's memory without involving either computer's operating system or interrupting either computer's processing.

  • Helps minimize latency and maximize throughput
  • Reduces memory bandwidth bottlenecks
  • Emerging high-performance RDMA fabrics—including InfiniBand, virtual interface architecture and RDMA over converged Ethernet—make cloud-based HPC possible

Memory Hierarchy

Why Memory Hierarchy Matters

Memory is organized to minimize access time by optimizing the available memory in the computer. This organization follows the principle of locality of references: same data or nearby data is likely to be accessed repeatedly.

Memory Types and Organization

Each level in the hierarchy has its own size, cost, and performance characteristics:

  1. Registers: Smallest, fastest memory units directly in the CPU (16-64 bits)
  2. Cache Memory (L1, L2, L3): Small, fast memory close to the CPU for frequently accessed data
    • L1 Cache: Accessed in one cycle, very small but extremely fast
    • L2 Cache: Larger but slightly slower
    • L3 Cache: Shared among cores, larger and slower than L1/L2
  3. Main Memory (RAM): Primary memory with larger capacity but slower access
    • Static RAM (SRAM): Stores binary information in flip flops, faster but more expensive
    • Dynamic RAM (DRAM): Stores binary info as charge on capacitors, higher density
  4. External Storage: Disk drives, SSDs, and other persistent storage
STORAGE (HDD/SSD) MAIN MEMORY (DRAM) L3 CACHE L2 CACHE L1 CACHE REGISTERS SPEED INCREASES SIZE INCREASES MEMORY HIERARCHY

Characteristics and Trade-offs

  • Capacity: Global volume of information the memory can store
  • Access Time: Time interval between the read/write request and availability of the data
  • Performance: Frequently accessed data is stored in faster memory
  • Cost per Bit: Cost increases as you move up the hierarchy (internal memory is costlier than external)
[ Memory ]
HPC.004

Low-Precision Arithmetic

Floating Point Representation

Real numbers in computers are represented as floating point numbers (scientific notation), with components:

  • Fractional part: actual value
  • Exponent term: scale
  • Sign

Low-Precision Benefits

  • FP16 (Half-Precision):
    • 16-bit floating-point format
    • Faster computation, less memory
    • Used in NVIDIA's Tensor Cores
  • INT8 Quantization:
    • 8-bit integer format
    • Further memory reduction
    • Common in inference deployments

Mixed Precision Training

Combining FP16 and FP32:

  • FP16 for most operations
  • FP32 for critical accumulations
  • Loss scaling to prevent underflow

This approach provides significant performance gains while maintaining numerical stability.

Performance

Performance Considerations

If performance is neglected, computations could run for extremely long periods. Understanding system capacity is crucial:

  • FLOPS: Floating point operations per second
  • Multiple CPUs make complex computations more manageable

Common Bottlenecks

  • Memory Bandwidth: The rate at which data can be read from or stored into memory
  • Compute Utilization: How efficiently computational resources are used
  • Data Loading: The time to load data from storage to memory
  • Communication Overhead: Time spent transferring data between nodes

Latency and Bandwidth

  • Latency: Delay before a transfer of data begins following an instruction
  • Bandwidth: How much you can transfer per unit of time

When training neural networks, bandwidth (for weights) is typically more important than latency.

Hiding Latency

Strategies for managing latency include:

  • Prefetching: Request data before it's needed
  • Concurrency: Use pipelining to switch between computations as data is coming in

Note that bandwidth is fundamental and can't be "hidden" - it represents a hard limit on system performance.

Optimization Techniques

Memory Management Optimizations

  1. Tensor memory layout optimization
  2. Pinned memory for faster transfers
  3. Memory pool allocators
  4. Gradient checkpointing

Data Loading Optimizations

  1. NVIDIA DALI for efficient data loading
  2. Prefetching and caching
  3. GPU direct memory access
  4. Dataset sharding for multi-GPU

Distributed Training Optimizations

  1. Data parallel training
  2. Model parallel training
  3. Pipeline parallelism
  4. Communication optimization

Kernel Optimizations

  • Kernel Fusion: Combining multiple operations
  • Memory Coalescing: Optimizing memory access patterns
  • Thread Block Optimization: Efficient thread organization
  • Pipeline Parallelism: Overlapping computation and communication

Load Balancing

Purpose of Load Balancing

Load balancing ensures that work is distributed efficiently across computing resources, preventing any single resource from becoming a bottleneck.

Key Components

  • Work Distribution: Dividing tasks among processors
  • Dynamic Load Balancing: Runtime task redistribution
  • Task Scheduling: Organizing execution order
  • Resource Allocation: Assigning computing resources

Strategies

Common load balancing strategies include:

  • Static Distribution: Work divided evenly before execution
  • Dynamic Distribution: Tasks allocated during runtime based on processor availability
  • Work Stealing: Idle processors "steal" work from busy ones
  • Hierarchical Approaches: Combining different strategies at different levels
[ Performance ]
HPC.005

Communication Protocols

MPI (Message Passing Interface)

Standard protocol for distributed computing that allows users to communicate between nodes in a cluster or across a network.

  • Point-to-point Communication: Direct message between two processes
  • Collective Communication: Operations involving all processes
  • Process Management: Handling multiple parallel processes

Hybrid Approach

Many HPC applications use a hybrid approach:

  • Local memory threaded with OpenMP
  • Distributed memory parallelism with MPI

Communication Patterns

Common communication patterns in HPC:

  • Broadcast: One process sends data to all other processes
  • Scatter/Gather: Distributing/collecting data across processes
  • Reduction: Combining results from all processes
  • All-to-All: Every process communicates with every other process
[ Protocols ]
Back to Top