PARALLEL PROGRAMMING

A crash course on the fundamentals of
concurrent and parallel execution.


PAR.001

Have you ever wondered why modern processors have multiple cores? Or why you need a GPU for your videogames?


The answer is simple:

Physical limitations have made it increasingly difficult to make individual cores faster.

Programmers must design our software to take advantage of parallel execution.

SEQUENTIAL EXECUTION TASK A TASK B TASK C PARALLEL EXECUTION CORE 1 CORE 2 CORE 3 CORE 4 SYNC POINT TIME → SEQUENTIAL PARALLEL (4 CORES)
[ 1.0 INTRO ]

Jump to Topic

ALG.001

Sequential vs. Parallel Programming

Sequential programming:

One sequence of program instructions where each instruction can only start after the previous instructions have finished.

Parallel programming:

Multiple sequences of instructions run in parallel, with synchronization points that introduce dependencies across sequences.

Core Concepts

Concurrency vs. Parallelism

These terms are often confused but have distinct meanings:


Concurrency:
Dealing with tasks that are logically happening simultaneously. Two tasks are concurrent if they can be logically active at some point in time.

Parallelism:
Dealing with tasks that are physically happening simultaneously on multiple processor cores.


Concurrency provides a way to structure a solution that may (but not necessarily) have tasks executed in parallel.

[ 2.1 D&C ]
PAR.002

Parallel Hardware Architecture

Modern computers achieve parallelism through multiple layers of hardware:

Multi-core Processors

Multiple processing units on a single chip that can execute instructions independently.

Each core has:

  • Its own set of registers
  • L1 and often L2 cache
  • Shared L3 cache and memory bandwidth with other cores

Hardware Threads

Some architectures (like Intel's Hyperthreading) allow a single physical core to appear as multiple logical cores by:


  • Maintaining separate register states for multiple threads
  • Switching between them when one is stalled
  • Typically provides 20-30% performance improvement over single threading

Principles of Mutual Exclusion

When multiple threads access shared resources, we need to implement protection mechanisms:

Critical Sections

A block of code where multiple threads might access shared resources. Synchronization primitives must adhere to:

  • Mutual Exclusion: Only one thread executes in the critical section at a time
  • Progress: If multiple threads request entry, one must be allowed to proceed
  • Bounded Waiting: A thread cannot be indefinitely prevented from entering its critical section

Synchronization Primitives

Tools for implementing mutual exclusion:

  • Locks/Mutexes: Protect critical sections by allowing only one thread to hold the lock
  • Semaphores: Like "locks with capacity > 1"
  • Atomic Operations: Hardware-supported indivisible operations
  • Condition Variables: Allow threads to wait for specific conditions

Concurrent Data Structures

Specialized data structures designed to handle concurrent access:

Concurrency Approaches

  • Coarse-grained locking: One lock for the entire data structure
  • Fine-grained locking: Multiple locks for different parts of the structure
  • Lock-free: Using atomic operations to avoid locks completely
  • Wait-free: Guarantees that all operations complete in a bounded number of steps

Common Concurrent Data Structures

  • Concurrent Hash Maps
  • Concurrent Queues
  • Concurrent Linked Lists
  • Threadsafe Stacks
// Example: Thread-safe Counter
class AtomicCounter {
    private atomic<int> value = 0;

    int increment() {
        return ++value; // Atomic operation
    }

    int get() {
        return value; // Atomic read
    }
}

Concurrent Design Patterns

Thread Pools

Reuse a fixed number of threads to execute multiple tasks, avoiding the overhead of thread creation:

// Conceptual thread pool
ThreadPool pool = new ThreadPool(4); // 4 worker threads
pool.submit(task1);
pool.submit(task2);
// Tasks execute concurrently on available threads

Fork-Join Pattern

Recursively breaks down a problem into sub-problems that are solved in parallel:

Result solve(Problem p) {
    if (p is small)
        return solveDirectly(p);
    else {
        split p into subproblems p1, p2
        fork task to solve(p1)
        fork task to solve(p2)
        join results
        return combined result
    }
}

Producer-Consumer Pattern

Separates the tasks of producing and consuming data with a shared buffer:

  • Producer threads generate data and add to a shared queue
  • Consumer threads take items from the queue and process them
  • Queue handles synchronization between producers and consumers
[ 2.0 IMPL ]
PAR.003

Performance Considerations

Amdahl's Law

Describes the theoretical speedup in latency of the execution of a task at fixed workload:

Speedup = 1 / ((1 - p) + p/n)

Where:
p = portion of the program that can be parallelized
n = number of processors
1-p = portion that cannot be parallelized

This formula gives us a ballpark computation of what kind of speedup we can expect. The key insight: if a significant portion of your program cannot be parallelized, adding more processors will have diminishing returns.

Common Performance Bottlenecks

  • Lock Contention: Multiple threads waiting to acquire the same lock
  • False Sharing: Cache line invalidation due to unrelated data sharing the same cache line
  • Memory Bandwidth: Multiple cores competing for memory access
  • Load Imbalance: Uneven distribution of work across threads

Debugging Parallel Programs

Parallel bugs are notoriously difficult to find and fix because they:

  • May only manifest under specific timing conditions
  • Are often not reproducible on demand
  • Can change behavior when debugging tools are applied

Common Parallel Bugs

  • Race Conditions: Outcome depends on the sequence or timing of uncontrollable events
  • Deadlocks: Two or more threads each waiting for the other to release a resource
  • Livelocks: Threads constantly change their state in response to other threads without making progress
  • Thread Leaks: Threads created but never properly terminated
1 4 16 64 NUMBER OF PROCESSORS 1x 2x 5x 10x 20x SPEEDUP 95% PARALLEL 90% PARALLEL 75% PARALLEL 50% PARALLEL AMDAHL'S LAW
[ 3.0 PERF ]
PAR.004

Conclusion

Parallel programming presents both challenges and opportunities. The transition from sequential to parallel thinking requires:

  • Understanding hardware architecture and its limitations
  • Designing algorithms with parallelism in mind from the beginning
  • Using appropriate synchronization mechanisms to ensure correctness
  • Testing thoroughly for race conditions and other parallel bugs
  • Measuring performance to verify that parallelism is providing the expected benefits

As we continue to approach the physical limits of processor speed, the importance of effective parallel programming will only increase. The fundamental challenge remains: how to divide problems into independent tasks that can be executed simultaneously without excessive communication or synchronization overhead.

Further Reading

  • "The Art of Multiprocessor Programming" by Maurice Herlihy and Nir Shavit
    Read Here
  • "Parallel Programming: Concepts and Practice" by Bertil Schmidt et al.
    Read Here
  • "Programming with POSIX Threads" by David R. Butenhof
    Read Here

Practical Implementation Tips

Think Parallel from the Start

Adding parallelism to an existing sequential program is often more difficult than designing for parallelism from the beginning.

Minimize Shared State

The less data shared between threads, the fewer synchronization points required and the more scalable your solution will be.

Use High-Level Abstractions

Whenever possible, use well-tested libraries and frameworks that handle the low-level details of parallelism for you:

  • Thread pools and task schedulers
  • Concurrent collections
  • Parallel algorithms

Test on Multiple Cores

Parallel bugs may only appear when actually running on multiple cores. Be sure to test on machines with varying numbers of cores.

Measure, Don't Assume

Always measure the performance of your parallel code. Sometimes, the overhead of parallelism can outweigh the benefits for small workloads.

// Example: Good practice for thread safety
// Immutable objects are inherently thread-safe
class Point {
    private final int x;
    private final int y;

    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }

    public int getX() { return x; }
    public int getY() { return y; }

    // Creates a new object rather than modifying
    public Point translate(int dx, int dy) {
        return new Point(x + dx, y + dy);
    }
}
[ 4.0 TIPS ]
Back to Top