AKA: HPC Fundamentals of HPC systems and parallel execution.
Introduction to HPC
High-Performance Computing (HPC) refers to the practice of aggregating computing power to deliver much higher performance than a typical desktop computer or workstation to solve large problems in science, engineering, or business.
HPC is a technology that uses clusters of powerful processors that work in parallel to process massive, multidimensional data sets and solve complex problems at extremely high speeds.
Standard Computing vs. Parallel Computing
Standard Computing System:
Solves problems (primarily) by using serial computing
Divides the workload into a sequence of tasks
Runs the tasks one after the other on the same processor
Requires systems to use only one processor
Parallel Computing:
Runs multiple tasks simultaneously on numerous computer servers or processors
Large compute problems are broken down into smaller problems that can be solved by multiple processors
HPC Clusters
Computer Clusters (HPC Clusters) consist of multiple high-speed computer servers (nodes) networked with a centralized scheduler that manages the parallel computing workload.
High Performance Components
All computing resources in an HPC Cluster are high speed and high throughput:
Networking
Memory
Storage
File systems
[ HPC Background ]
HPC.002
Parallelism
Purpose of Parallelism
Cost Reduction: Serial computing uses a single processor, which takes incredibly long for complex problems
Complex Problem Solving: Systems need to perform thousands to millions of tasks in an instant. Many bottlenecks if performed sequentially
Increased Efficiency: Resources can be used efficiently. Multiple tasks can be performed concurrently
Types of Parallel Computing
Bit-Level Parallelism: Processor word size is increased and number of instructions the processor must run to solve the problem decreased
Instruction-level Parallelism: Processor chooses which instructions it will run. Processors are built to perform certain operations simultaneously to improve resource optimization and increase throughput
Task Parallelism: Parallelized code across several processors simultaneously running tasks on the same data. Reduces serial time by running tasks concurrently
Superword-level Parallelism: Vectorized tactic that is used on inline code. Completes multiple similar tasks at once, saving time and resources
Concurrency
Understanding Concurrency
Concurrency refers to managing multiple tasks and is about dealing with multiple things at once. This is distinct from parallelism, which is about doing multiple things at once.
Multiple computations are happening at the same time across:
Multiple computers on a network
Multiple applications running on one computer
Multiple processors in a computer (or on one chip)
Multiprocessing
Want illusion of having CPU for each processor
In reality, we have multiprocessing with executions interleaved
Multicore processors:
Multiple CPUs on a single chip
Share main memory and some of the caches
Each can execute a separate process
Scheduling of processes done by kernel
Concurrent Process: Flows overlap in time. Sequential otherwise.
Scaling & Speedup
Scalability
Scalability refers to the ability to handle increased workload by adding resources.
Types of Scaling
Strong Scaling: Fixed global problem size with increasing processors (number of cores)
Fix global problem size
Increase n (core count)
Weak Scaling: Increase problem size with processors (easier to accomplish)
Increase computational work proportional to the number of cores
As you increase the number of cores, the amount of work each core does stays roughly consistent
Parallel Speedup
Performance improvement from parallel execution is measured as:
S(n) = T1/Tn
Where:
- S: The number of processors you can execute on, or number of cores
- T1: execution time on one processor
- Tn: execution time on n processors
"Perfect Scaling" occurs when S(n) = n
Parallel Efficiency
E(n) = S(n)/n
- E=1 is "Perfect Scaling"
- Runs from 0 to 1
Amdahl's Law
Theoretical Speedup Limits on multiple processors:
Speedup = 1 / ((1 - p) + p/n)
Where:
p = portion of the program that can be parallelized
n = number of processors
1 - p = portion that cannot be parallelized
"The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used"
Parallel Architectures
Parallel Computing Architectures
Shared Memory: Computers rely on multiple processors to contact the same shared memory resource
Distributed Memory: Multiple processors with their own memory resources are linked over a network
Hybrid Memory: Combines shared memory computers on distributed memory networks. Connected CPUs can access shared memory and tasks assigned to other units on the same network
Multicore CPUs
Every core owns all of the data (visible to all cores), which allows for updating and viewing data in parallel.
Core synchronization is important for proper usage
Message-passing platform (shared): core asks other cores for data it doesn't own
Processors don't work in isolation, they have a memory hierarchy. Data has to be moved to the processor to perform operations
In Parallel: Cache coherency is important. Synchronization between caches and main memory is required
[ HPC Topics ]
HPC.003
Memory Models
UMA and NUMA
Uniform Memory Access (UMA): Each core (CPU's in graphic) can read and write to entire memory space but they do so with roughly similar latency
Non-Uniform Memory Access (NUMA): Memory access time depends on memory location relative to processor. Affinities between groups of CPU's and a certain subset of the memory
Both architectures are used in symmetric multiprocessing (SMP) systems - multiple processors share a common memory pool.
Shared vs. Distributed Memory
Shared Memory: Multiple processors access the same memory space
Distributed Memory: Each processor has its own private memory
Every core only sees its subpiece of memory (no OpenMP)
Processors need to know boundary information from neighboring processors
Can be done through message passing. MPI is most commonly used
Interior Process vs. Boundary Process
RDMA Networking
Remote Direct Memory Access (RDMA) enables one networked computer to access another networked computer's memory without involving either computer's operating system or interrupting either computer's processing.
Helps minimize latency and maximize throughput
Reduces memory bandwidth bottlenecks
Emerging high-performance RDMA fabrics—including InfiniBand, virtual interface architecture and RDMA over converged Ethernet—make cloud-based HPC possible
Memory Hierarchy
Why Memory Hierarchy Matters
Memory is organized to minimize access time by optimizing the available memory in the computer. This organization follows the principle of locality of references: same data or nearby data is likely to be accessed repeatedly.
Memory Types and Organization
Each level in the hierarchy has its own size, cost, and performance characteristics:
Registers: Smallest, fastest memory units directly in the CPU (16-64 bits)
Cache Memory (L1, L2, L3): Small, fast memory close to the CPU for frequently accessed data
L1 Cache: Accessed in one cycle, very small but extremely fast
L2 Cache: Larger but slightly slower
L3 Cache: Shared among cores, larger and slower than L1/L2
Main Memory (RAM): Primary memory with larger capacity but slower access
Static RAM (SRAM): Stores binary information in flip flops, faster but more expensive
Dynamic RAM (DRAM): Stores binary info as charge on capacitors, higher density
External Storage: Disk drives, SSDs, and other persistent storage
Characteristics and Trade-offs
Capacity: Global volume of information the memory can store
Access Time: Time interval between the read/write request and availability of the data
Performance: Frequently accessed data is stored in faster memory
Cost per Bit: Cost increases as you move up the hierarchy (internal memory is costlier than external)
[ Memory ]
HPC.004
Low-Precision Arithmetic
Floating Point Representation
Real numbers in computers are represented as floating point numbers (scientific notation), with components:
Fractional part: actual value
Exponent term: scale
Sign
Low-Precision Benefits
FP16 (Half-Precision):
16-bit floating-point format
Faster computation, less memory
Used in NVIDIA's Tensor Cores
INT8 Quantization:
8-bit integer format
Further memory reduction
Common in inference deployments
Mixed Precision Training
Combining FP16 and FP32:
FP16 for most operations
FP32 for critical accumulations
Loss scaling to prevent underflow
This approach provides significant performance gains while maintaining numerical stability.
Performance
Performance Considerations
If performance is neglected, computations could run for extremely long periods. Understanding system capacity is crucial:
FLOPS: Floating point operations per second
Multiple CPUs make complex computations more manageable
Common Bottlenecks
Memory Bandwidth: The rate at which data can be read from or stored into memory
Compute Utilization: How efficiently computational resources are used
Data Loading: The time to load data from storage to memory
Communication Overhead: Time spent transferring data between nodes
Latency and Bandwidth
Latency: Delay before a transfer of data begins following an instruction
Bandwidth: How much you can transfer per unit of time
When training neural networks, bandwidth (for weights) is typically more important than latency.
Hiding Latency
Strategies for managing latency include:
Prefetching: Request data before it's needed
Concurrency: Use pipelining to switch between computations as data is coming in
Note that bandwidth is fundamental and can't be "hidden" - it represents a hard limit on system performance.