Parallel Programming Crash Course

ALG.001

Sequential vs. Parallel Programming

Sequential programming:

One sequence of program instructions where each instruction can only start after the previous instructions have finished.

Parallel programming:

Multiple sequences of instructions run in parallel, with synchronization points that introduce dependencies across sequences.

Core Concepts

Concurrency vs. Parallelism

These terms are often confused but have distinct meanings:

Concurrency:
Dealing with tasks that are logically happening simultaneously. Two tasks are concurrent if they can be logically active at some point in time.

Parallelism:
Dealing with tasks that are physically happening simultaneously on multiple processor cores.

Concurrency provides a way to structure a solution that may (but not necessarily) have tasks executed in parallel.

[ 2.1 D&C ]

PAR.002

Parallel Hardware Architecture

Modern computers achieve parallelism through multiple layers of hardware:

Multi-core Processors

Multiple processing units on a single chip that can execute instructions independently.

Each core has:

Its own set of registers
L1 and often L2 cache
Shared L3 cache and memory bandwidth with other cores

Hardware Threads

Some architectures (like Intel's Hyperthreading) allow a single physical core to appear as multiple logical cores by:

Maintaining separate register states for multiple threads
Switching between them when one is stalled
Typically provides 20-30% performance improvement over single threading

Principles of Mutual Exclusion

When multiple threads access shared resources, we need to implement protection mechanisms:

Critical Sections

A block of code where multiple threads might access shared resources. Synchronization primitives must adhere to:

Mutual Exclusion: Only one thread executes in the critical section at a time
Progress: If multiple threads request entry, one must be allowed to proceed
Bounded Waiting: A thread cannot be indefinitely prevented from entering its critical section

Synchronization Primitives

Tools for implementing mutual exclusion:

Locks/Mutexes: Protect critical sections by allowing only one thread to hold the lock
Semaphores: Like "locks with capacity > 1"
Atomic Operations: Hardware-supported indivisible operations
Condition Variables: Allow threads to wait for specific conditions

Concurrent Data Structures

Specialized data structures designed to handle concurrent access:

Concurrency Approaches

Coarse-grained locking: One lock for the entire data structure
Fine-grained locking: Multiple locks for different parts of the structure
Lock-free: Using atomic operations to avoid locks completely
Wait-free: Guarantees that all operations complete in a bounded number of steps

Common Concurrent Data Structures

Concurrent Hash Maps
Concurrent Queues
Concurrent Linked Lists
Threadsafe Stacks

// Example: Thread-safe Counter
class AtomicCounter {
    private atomic<int> value = 0;

    int increment() {
        return ++value; // Atomic operation
    }

    int get() {
        return value; // Atomic read
    }
}

Concurrent Design Patterns

Thread Pools

Reuse a fixed number of threads to execute multiple tasks, avoiding the overhead of thread creation:

// Conceptual thread pool
ThreadPool pool = new ThreadPool(4); // 4 worker threads
pool.submit(task1);
pool.submit(task2);
// Tasks execute concurrently on available threads

Fork-Join Pattern

Recursively breaks down a problem into sub-problems that are solved in parallel:

Result solve(Problem p) {
    if (p is small)
        return solveDirectly(p);
    else {
        split p into subproblems p1, p2
        fork task to solve(p1)
        fork task to solve(p2)
        join results
        return combined result
    }
}

Producer-Consumer Pattern

Separates the tasks of producing and consuming data with a shared buffer:

Producer threads generate data and add to a shared queue
Consumer threads take items from the queue and process them
Queue handles synchronization between producers and consumers

[ 2.0 IMPL ]

PAR.003

Performance Considerations

Amdahl's Law

Describes the theoretical speedup in latency of the execution of a task at fixed workload:

Speedup = 1 / ((1 - p) + p/n)

Where:
p = portion of the program that can be parallelized
n = number of processors
1-p = portion that cannot be parallelized

This formula gives us a ballpark computation of what kind of speedup we can expect. The key insight: if a significant portion of your program cannot be parallelized, adding more processors will have diminishing returns.

Common Performance Bottlenecks

Lock Contention: Multiple threads waiting to acquire the same lock
False Sharing: Cache line invalidation due to unrelated data sharing the same cache line
Memory Bandwidth: Multiple cores competing for memory access
Load Imbalance: Uneven distribution of work across threads

Debugging Parallel Programs

Parallel bugs are notoriously difficult to find and fix because they:

May only manifest under specific timing conditions
Are often not reproducible on demand
Can change behavior when debugging tools are applied

Common Parallel Bugs

Race Conditions: Outcome depends on the sequence or timing of uncontrollable events
Deadlocks: Two or more threads each waiting for the other to release a resource
Livelocks: Threads constantly change their state in response to other threads without making progress
Thread Leaks: Threads created but never properly terminated

[ 3.0 PERF ]

PAR.004

Conclusion

Parallel programming presents both challenges and opportunities. The transition from sequential to parallel thinking requires:

Understanding hardware architecture and its limitations
Designing algorithms with parallelism in mind from the beginning
Using appropriate synchronization mechanisms to ensure correctness
Testing thoroughly for race conditions and other parallel bugs
Measuring performance to verify that parallelism is providing the expected benefits

As we continue to approach the physical limits of processor speed, the importance of effective parallel programming will only increase. The fundamental challenge remains: how to divide problems into independent tasks that can be executed simultaneously without excessive communication or synchronization overhead.

Practical Implementation Tips

Think Parallel from the Start

Adding parallelism to an existing sequential program is often more difficult than designing for parallelism from the beginning.

Minimize Shared State

The less data shared between threads, the fewer synchronization points required and the more scalable your solution will be.

Use High-Level Abstractions

Whenever possible, use well-tested libraries and frameworks that handle the low-level details of parallelism for you:

Thread pools and task schedulers
Concurrent collections
Parallel algorithms

Test on Multiple Cores

Parallel bugs may only appear when actually running on multiple cores. Be sure to test on machines with varying numbers of cores.

Measure, Don't Assume

Always measure the performance of your parallel code. Sometimes, the overhead of parallelism can outweigh the benefits for small workloads.

// Example: Good practice for thread safety
// Immutable objects are inherently thread-safe
class Point {
    private final int x;
    private final int y;

    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }

    public int getX() { return x; }
    public int getY() { return y; }

    // Creates a new object rather than modifying
    public Point translate(int dx, int dy) {
        return new Point(x + dx, y + dy);
    }
}

[ 4.0 TIPS ]

PARALLEL PROGRAMMING

Have you ever wondered why modern processors have multiple cores? Or why you need a GPU for your videogames?

The answer is simple:

Jump to Topic

Sequential vs. Parallel Programming

Sequential programming:

Parallel programming:

Core Concepts

Concurrency vs. Parallelism

Concurrency:
Dealing with tasks that are logically happening simultaneously. Two tasks are concurrent if they can be logically active at some point in time.

Parallelism:
Dealing with tasks that are physically happening simultaneously on multiple processor cores.

Parallel Hardware Architecture

Multi-core Processors

Hardware Threads

Principles of Mutual Exclusion

Critical Sections

Synchronization Primitives

Concurrent Data Structures

Concurrency Approaches

Common Concurrent Data Structures

Concurrent Design Patterns

Thread Pools

Fork-Join Pattern

Producer-Consumer Pattern

Performance Considerations

Amdahl's Law

Common Performance Bottlenecks

Debugging Parallel Programs

Common Parallel Bugs

Conclusion

Further Reading

Practical Implementation Tips

Think Parallel from the Start

Minimize Shared State

Use High-Level Abstractions

Test on Multiple Cores

Measure, Don't Assume

Have you ever wondered why modern processors have multiple cores? Or why you need a GPU for your videogames?

The answer is simple:

Jump to Topic

Sequential vs. Parallel Programming

Sequential programming:

Parallel programming:

Core Concepts

Concurrency vs. Parallelism

Concurrency: Dealing with tasks that are logically happening simultaneously. Two tasks are concurrent if they can be logically active at some point in time.

Parallelism: Dealing with tasks that are physically happening simultaneously on multiple processor cores.

Parallel Hardware Architecture

Multi-core Processors

Hardware Threads

Principles of Mutual Exclusion

Critical Sections

Synchronization Primitives

Concurrent Data Structures

Concurrency Approaches

Common Concurrent Data Structures

Concurrent Design Patterns

Thread Pools

Fork-Join Pattern

Producer-Consumer Pattern

Performance Considerations

Amdahl's Law

Common Performance Bottlenecks

Debugging Parallel Programs

Common Parallel Bugs

Conclusion

Further Reading

Practical Implementation Tips

Think Parallel from the Start

Minimize Shared State

Use High-Level Abstractions

Test on Multiple Cores

Measure, Don't Assume

Concurrency:
Dealing with tasks that are logically happening simultaneously. Two tasks are concurrent if they can be logically active at some point in time.

Parallelism:
Dealing with tasks that are physically happening simultaneously on multiple processor cores.