Go to All Subject -

Computer Sotware and Inormation Technology Engineering CSE IT

Multi - Core Architectures and Programming - CS6801

Multi - Core Architectures and Programming

An Introduction to Parallel Programming by Peter S Pacheco

Chapter 1 Why Parallel Computing


-:- Why Parallel Computing?
-:- Why We Need Ever-Increasing Performance
-:- Why We’re Building Parallel Systems
-:- Why we Need to Write Parallel Programs
-:- How Do We Write Parallel Programs?
-:- Concurrent, Parallel, Distributed

Chapter 2 Parallel Hardware and Parallel Software


-:- Parallel Hardware and Parallel Software
-:- Some Background: von Neumann architecture, Processes, multitasking, and threads
-:- Modifications to the Von Neumann Model
-:- Parallel Hardware
-:- Parallel Software
-:- Input and Output
-:- Performance of Parallel Programming
-:- Parallel Program Design with example
-:- Writing and Running Parallel Programs
-:- Assumptions - Parallel Programming

Chapter 3 Distributed Memory Programming with MPI


-:- Distributed-Memory Programming with MPI
-:- The Trapezoidal Rule in MPI
-:- Dealing with I/O
-:- Collective Communication
-:- MPI Derived Datatypes
-:- Performance Evaluation of MPI Programs
-:- A Parallel Sorting Algorithm

Chapter 4 Shared Memory Programming with Pthreads


-:- Shared-Memory Programming with Pthreads
-:- Processes, Threads, and Pthreads
-:- Pthreads - Hello, World Program
-:- Matrix-Vector Multiplication
-:- Critical Sections
-:- Busy-Waiting
-:- Mutexes
-:- Producer-Consumer Synchronization and Semaphores
-:- Barriers and Condition Variables
-:- Read-Write Locks
-:- Caches, Cache Coherence, and False Sharing
-:- Thread-Safety
-:- Shared-Memory Programming with OpenMP
-:- The Trapezoidal Rule
-:- Scope of Variables
-:- The Reduction Clause
-:- The parallel For Directive
-:- More About Loops in Openmp: Sorting
-:- Scheduling Loops
-:- Producers and Consumers
-:- Caches, Cache Coherence, and False Sharing
-:- Thread-Safety
-:- Parallel Program Development
-:- Two n-Body Solvers
-:- Parallelizing the basic solver using OpenMP
-:- Parallelizing the reduced solver using OpenMP
-:- Evaluating the OpenMP codes
-:- Parallelizing the solvers using pthreads
-:- Parallelizing the basic solver using MPI
-:- Parallelizing the reduced solver using MPI
-:- Performance of the MPI solvers
-:- Tree Search
-:- Recursive depth-first search
-:- Nonrecursive depth-first search
-:- Data structures for the serial implementations
-:- Performance of the serial implementations
-:- Parallelizing tree search
-:- A static parallelization of tree search using pthreads
-:- A dynamic parallelization of tree search using pthreads
-:- Evaluating the Pthreads tree-search programs
-:- Parallelizing the tree-search programs using OpenMP
-:- Performance of the OpenMP implementations
-:- Implementation of tree search using MPI and static partitioning
-:- Implementation of tree search using MPI and dynamic partitioning
-:- Which API?

Multicore Application Programming For Windows Linux and Oracle Solaris by Darryl Gove

Chapter 1 Hardware and Processes and Threads


-:- Hardware, Processes, and Threads
-:- Examining the Insides of a Computer
-:- The Motivation for Multicore Processors
-:- Supporting Multiple Threads on a Single Chip
-:- Increasing Instruction Issue Rate with Pipelined Processor Cores
-:- Using Caches to Hold Recently Used Data
-:- Using Virtual Memory to Store Data
-:- Translating from Virtual Addresses to Physical Addresses
-:- The Characteristics of Multiprocessor Systems
-:- How Latency and Bandwidth Impact Performance
-:- The Translation of Source Code to Assembly Language
-:- The Performance of 32-Bit versus 64-Bit Code
-:- Ensuring the Correct Order of Memory Operations
-:- The Differences Between Processes and Threads

Chapter 2 Coding for Performance


-:- Coding for Performance
-:- Defining Performance
-:- Understanding Algorithmic Complexity
-:- Why Algorithmic Complexity Is Important
-:- Using Algorithmic Complexity with Care
-:- How Structure Impacts Performance
-:- Performance and Convenience Trade-Offs in Source Code and Build Structures
-:- Using Libraries to Structure Applications
-:- The Impact of Data Structures on Performance
-:- The Role of the Compiler
-:- The Two Types of Compiler Optimization
-:- Selecting Appropriate Compiler Options
-:- How Cross-File Optimization Can Be Used to Improve Performance
-:- Using Profile Feedback
-:- How Potential Pointer Aliasing Can Inhibit Compiler Optimizations
-:- Identifying Where Time Is Spent Using Profiling
-:- Commonly Available Profiling Tools
-:- How Not to Optimize
-:- Performance by Design

Chapter 3 Identifying Opportunities for Parallelism


-:- Identifying Opportunities for Parallelism
-:- Using Multiple Processes to Improve System Productivity
-:- Multiple Users Utilizing a Single System
-:- Improving Machine Efficiency Through Consolidation
-:- Using Containers to Isolate Applications Sharing a Single System
-:- Hosting Multiple Operating Systems Using Hypervisors
-:- Using Parallelism to Improve the Performance of a Single Task
-:- One Approach to Visualizing Parallel Applications
-:- How Parallelism Can Change the Choice of Algorithms
-:- Amdahl’s Law
-:- Determining the Maximum Practical Threads
-:- How Synchronization Costs Reduce Scaling
-:- Parallelization Patterns
-:- Data Parallelism Using SIMD Instructions
-:- Parallelization Using Processes or Threads
-:- Multiple Independent Tasks
-:- Multiple Loosely Coupled Tasks
-:- Multiple Copies of the Same Task
-:- Single Task Split Over Multiple Threads
-:- Using a Pipeline of Tasks to Work on a Single Item
-:- Division of Work into a Client and a Server
-:- Splitting Responsibility into a Producer and a Consumer
-:- Combining Parallelization Strategies
-:- How Dependencies Influence the Ability Run Code in Parallel
-:- Antidependencies and Output Dependencies
-:- Using Speculation to Break Dependencies
-:- Critical Paths
-:- Identifying Parallelization Opportunities

Chapter 4 Synchronization and Data Sharing


-:- Synchronization and Data Sharing
-:- Data Races
-:- Using Tools to Detect Data Races
-:- Avoiding Data Races
-:- Synchronization Primitives
-:- Mutexes and Critical Regions
-:- Spin Locks
-:- Semaphores
-:- Readers-Writer Locks
-:- Barriers
-:- Atomic Operations and Lock-Free Code
-:- Deadlocks and Livelocks
-:- Communication Between Threads and Processes
-:- Storing Thread-Private Data

Chapter 5 Using POSIX Threads


-:- Using POSIX Threads
-:- Creating Threads
-:- Compiling Multithreaded Code
-:- Process Termination
-:- Sharing Data Between Threads
-:- Variables and Memory
-:- Multiprocess Programming
-:- Sockets
-:- Reentrant Code and Compiler Flags
-:- Windows Threading

Chapter 6 Windows Threading


-:- Creating Native Windows Threads
-:- Terminating Threads
-:- Creating and Resuming Suspended Threads
-:- Using Handles to Kernel Resources
-:- Methods of Synchronization and Resource Sharing
-:- An Example of Requiring Synchronization Between Threads
-:- Protecting Access to Code with Critical Sections
-:- Protecting Regions of Code with Mutexes
-:- Slim Reader/Writer Locks
-:- Signaling Event Completion to Other Threads or Processes
-:- Wide String Handling in Windows
-:- Creating Processes
-:- Sharing Memory Between Processes
-:- Inheriting Handles in Child Processes
-:- Naming Mutexes and Sharing Them Between Processes
-:- Communicating with Pipes
-:- Communicating Using Sockets
-:- Atomic Updates of Variables
-:- Allocating Thread-Local Storage
-:- Setting Thread Priority

Chapter 7 Using Automatic Parallelization and OpenMP


-:- Using Automatic Parallelization and OpenMP
-:- Using Automatic Parallelization to Produce a Parallel Application
-:- Identifying and Parallelizing Reductions
-:- Automatic Parallelization of Codes Containing Calls
-:- Assisting Compiler in Automatically Parallelizing Code
-:- Using OpenMP to Produce a Parallel Application
-:- Using OpenMP to Parallelize Loops
-:- Runtime Behavior of an OpenMP Application
-:- Variable Scoping Inside OpenMP Parallel Regions
-:- Parallelizing Reductions Using OpenMP
-:- Accessing Private Data Outside the Parallel Region
-:- Improving Work Distribution Using Scheduling
-:- Using Parallel Sections to Perform Independent Work
-:- Nested Parallelism
-:- Using OpenMP for Dynamically Defined Parallel Tasks
-:- Keeping Data Private to Threads
-:- Controlling the OpenMP Runtime Environment
-:- Waiting for Work to Complete
-:- Restricting the Threads That Execute a Region of Code
-:- Ensuring That Code in a Parallel Region Is Executed in Order
-:- Collapsing Loops to Improve Workload Balance
-:- Enforcing Memory Consistency
-:- An Example of Parallelization

Chapter 8 Hand Coded Synchronization and Sharing


-:- Hand-Coded Synchronization and Sharing
-:- Atomic Operations
-:- Using Compare and Swap Instructions to Form More Complex Atomic Operations
-:- Enforcing Memory Ordering to Ensure Correct Operation
-:- Compiler Support of Memory-Ordering Directives
-:- Reordering of Operations by the Compiler
-:- Volatile Variables
-:- Operating System–Provided Atomics
-:- Lockless Algorithms
-:- Dekker’s Algorithm
-:- Producer-Consumer with a Circular Buffer
-:- Scaling to Multiple Consumers or Producers
-:- Scaling the Producer-Consumer to Multiple Threads
-:- Modifying the Producer-Consumer Code to Use Atomics
-:- The ABA Problem

Chapter 9 Scaling with Multicore Processors


-:- Scaling with Multicore Processors
-:- Constraints to Application Scaling
-:- Hardware Constraints to Scaling
-:- Bandwidth Sharing Between Cores
-:- False Sharing
-:- Cache Conflict and Capacity
-:- Pipeline Resource Starvation
-:- Operating System Constraints to Scaling
-:- Multicore Processors and Scaling

Chapter 10 Other Parallelization Technologies


-:- Other Parallelization Technologies
-:- GPU-Based Computing
-:- Language Extensions
-:- Alternative Languages
-:- Clustering Technologies
-:- Transactional Memory
-:- Vectorization