chapter 1 why parallel computing? 1
1.1 why we need ever-increasing performance 2
1.2 why we’re building parallel systems 3
1.3 why we need to write parallel programs 3
1.4 how do we write parallel programs? 6
1.5 what we’ll be doing 8
1.6 concurrent, parallel, distributed 9
1.7 the rest of the book 10
1.8 a word of warning 10
1.9 typographical conventions 11
1.10 summary 12
1.11 exercises 12
chapter 2 parallel hardware and parallel software 15
2.1 some background15
2.1.1 the von neumann architecture 15
2.1.2 processes, multitasking, and threads 17
2.2 modifications to the von neumann model 18
2.2.1 the basics of caching 19
2.2.2 cache mappings 20
.2.2.3 caches and programs: an example 22
2.2.4 virtual memory 23
2.2.5 instruction-level parallelism 25
2.2.6 hardware multithreading 28
2.3 parallel hardware 29
2.3.1 simd systems 29
2.3.2 mimd systems 32
2.3.3 interconnection networks 35
2.3.4 cache coherence 43
2.3.5 shared-memory versus distributed-memory 46
2.4 parallel software 47
2.4.1 caveats 47
2.4.2 coordinating the processes/threads 48
2.4.3 shared-memory 49
2.4.4 distributed-memory 53
2.4.5 programming hybrid systems 56
2.5 input and output 56
2.6 performance 58
2.6.1 speedup and efficiency 58
2.6.2 amdahl’s law 61
2.6.3 scalability 62
2.6.4 taking timings 63
2.7 parallel program design 65
2.7.1 an example 66
2.8 writing and running parallel programs 70
2.9 assumptions 70
2.10 summary 71
2.10.1 serial systems 71
2.10.2 parallel hardware 73
2.10.3 parallel software 74
2.10.4 input and output 75
2.10.5 performance 75
2.10.6 parallel program design 76
2.10.7 assumptions 76
2.11 exercises 77
chapter 3 distributed-memory programming with mpi 83
3.1 getting started84
3.1.1 compilation and execution 84
3.1.2 mpi programs 86
3.1.3 mpi init and mpi finalize 86
3.1.4 communicators, mpi comm size and mpi comm rank 87
3.1.5 spmd programs 88
3.1.6 communication 88
3.1.7 mpi send 88
3.1.8 mpi recv 90
3.1.9 message matching 91
3.1.10 the status p argument 92
3.1.11 semantics of mpi send and mpi recv 93
3.1.12 some potential pitfalls 94
3.2 the trapezoidal rule in mpi 94
3.2.1 the trapezoidal rule 94
3.2.2 parallelizing the trapezoidal rule 96
contents xiii
3.3 dealing with i/o 97
3.3.1 output 97
3.3.2 input 100
3.4 collective communication101
3.4.1 tree-structured communication 102
3.4.2 mpi reduce 103
3.4.3 collective vspoint-to-point communications 105
3.4.4 mpi allreduce 106
3.4.5 broadcast 106
3.4.6 data distributions 109
3.4.7 scatter 110
3.4.8 gather 112
3.4.9 allgather 113
3.5 mpi derived datatypes 116
3.6 performance evaluation of mpi programs119
3.6.1 taking timings 119
3.6.2 results 122
3.6.3 speedup and efficiency 125
3.6.4 scalability 126
3.7 a parallel sorting algorithm 127
3.7.1 some simple serial sorting algorithms 127
3.7.2 parallel odd-even transposition sort 129
3.7.3 safety in mpi programs 132
3.7.4 final details of parallel odd-even sort 134
3.8 summary 136
3.9 exercises 140
3.10 programming assignments 147
chapter 4 shared-memory programming with pthreads 151
4.1 processes, threads, and pthreads 151
4.2 hello, world 153
4.2.1 execution 153
4.2.2 preliminaries 155
4.2.3 starting the threads 156
4.2.4 running the threads 157
4.2.5 stopping the threads 158
4.2.6 error checking 158
4.2.7 other approaches to thread startup159
4.3 matrix-vector multiplication 159
4.4 critical sections 162
xiv contents
4.5 busy-waiting 165
4.6 mutexes 168
4.7 producer-consumer synchronization and semaphores171
4.8 barriers and condition variables 176
4.8.1 busy-waiting and a mutex 177
4.8.2 semaphores 177
4.8.3 condition variables 179
4.8.4 pthreads barriers 181
4.9 read-write locks 181
4.9.1 linked list functions 181
4.9.2 a multi-threaded linked list 183
4.9.3 pthreads read-write locks 187
4.9.4 performance of the various implementations 188
4.9.5 implementing read-write locks 190
4.10 caches, cache coherence, and false sharing 190
4.11 thread-safety 195
4.11.1 incorrect programs can produce correct output 198
4.12 summary 198
4.13 exercises 200
4.14 programming assignments 206
chapter 5 shared-memory programming with openmp 209
5.1 getting started210
5.1.1 compiling and running openmp programs 211
5.1.2 the program 212
5.1.3 error checking 215
5.2 the trapezoidal rule 216
5.2.1 a first openmp version 216
5.3 scope of variables 220
5.4 the reduction clause 221
5.5 the parallel for directive 224
5.5.1 caveats 225
5.5.2 data dependences 227
5.5.3 finding loop-carried dependences 228
5.5.4 estimating 229
5.5.5 more on scope 231
5.6 more about loops in openmp: sorting 232
5.6.1 bubble sort 232
5.6.2 odd-even transposition sort 233
5.7 scheduling loops 236
5.7.1 the schedule clause 237
5.7.3 the dynamic and guided schedule types 239
5.7.4 the runtime schedule type 239
5.7.5 which schedule? 241
5.8 producers and consumers 241
5.8.1 queues 241
5.8.2 message-passing 242
5.8.3 sending messages 243
5.8.4 receiving messages 243
5.8.5 termination detection 244
5.8.6 startup 244
5.8.7 the atomic directive 245
5.8.8 critical sections and locks 246
5.8.9 using locks in the message-passing program 248
5.8.10 critical directives, atomic directives,
or locks? 249
5.8.11 some caveats 249
5.9 caches, cache coherence, and false sharing 251
5.10 thread-safety 256
5.10.1 incorrect programs can produce correct output 258
5.11 summary 259
5.12 exercises 263
5.13 programming assignments 267
chapter 6 parallel program development 271
6.1 two n-body solvers 271
6.1.1 the problem 271
6.1.2 two serial programs 273
6.1.3 parallelizing the n-body solvers 277
6.1.4 a word about i/o 280
6.1.5 parallelizing the basic solver using openmp 281
6.1.6 parallelizing the reduced solver using openmp 284
6.1.7 evaluating the openmp codes 288
6.1.8 parallelizing the solvers using pthreads 289
6.1.9 parallelizing the basic solver using mpi 290
6.1.10 parallelizing the reduced solver using mpi 292
6.1.11 performance of the mpi solvers 297
6.2 tree search 299
6.2.1 recursive depth-first search 302
6.2.2 nonrecursive depth-first search 303
6.2.3 data structures for the serial implementations 305
6.2.6 a static parallelization of tree search using pthreads 309
6.2.7 a dynamic parallelization of tree search using pthreads 310
6.2.8 evaluating the pthreads tree-search programs 315
6.2.9 parallelizing the tree-search programs using openmp 316
6.2.10 performance of the openmp implementations 318
6.2.11 implementation of tree search using mpi and static partitioning 319
6.2.12 implementation of tree search using mpi and dynamic partitioning 327
6.3 a word of caution 335
6.4 which api? 335
6.5 summary 336
6.5.1 pthreads and openmp 337
6.5.2 mpi 338
6.6 exercises 341
6.7 programming assignments 350
chapter 7 where to go from here 353
references 357
index 361
· · · · · · (
收起)