CS344: Parallel Computing

posted in: MOOC | 0

Parallel Computing

Introduction

c1  

c2

Here is a simple CUDA program:

To compile it in Linux:

Or in Windows + VS:

New CUDA project is the easiest solution to get started,

or

  • right click on project
  • build Dependencies
  • build Customizations
  • check the box for the Cuda version
  • right click all .cu files for properties, set their types to CUDA/C++ in the CUDA file properties
  • add cudart.lib to Linker->Input for all configurations and all platforms

 

Another simple program from Udacity:

 

KERNEL <<< GRID OF BLOCKS, BLOCK OF THREADS, shared memory per block in bytes >>> (…)

Efficient: Each thread knows its threadIdx (thread within block, threadIdx.x), blockDim (size of a block), blockIdx(block within gird) and gridDim (size of grid)

 

Map

  • Set of elements to process [64 floats]
  • Function to run on each element [“square”]

 

Map (elements, function)

GPU are good at map

  • GPUs have many parallel processors
  • GPUs optimize for throughput
 

Gather, scatter, stencil, transpose.

Synchronization, Barrier

SMs, blocks, threads

 

 

Cuda Code for Image blur

Unit #3

Parallel Reduce

((a+b)+c)+d = (a+b)+(c+d)

Step Complexity of Parallel Reduction

N summations, log_2(N) operations

Reduction Using Global And Shared Memory

extern __shared__ float sdata[];

shmem_reduce_kernel<<<blocks, threads, threads * sizeof(float) >>> (d_intermediate, d_in)

The global memory version uses 3 times as much global memory compared with the shared memory version.

Global memory method reads 2N times and writes N times while shared memory only reads N times.

0.651ms vs 0.451ms

Parallel Primitive

Scan

Example

Input: 1, 2, 3, 4

Operation Add

Output: 1, 3, 6, 10

Scan is used for quick sort, for sparse matrix computation, for data compression

Balancing your checkbook.

  • Identify in operations, such as I op a = a
    • 0 + a = a
    • min(0xFF, a) = a
    • 1 * a = a
    • 0 or a = a 
    • 1 and a = a

Input: a0, a1, a2, a3, … aN

Operation #

Output: I, a0, a0#a1, a0#a1#a2,  a0#a1#a2#a3, …,  a0#a1#a2#…#aN

Inclusive vs Exclusive Scan

Exclusive scan puts identity first, Inclusive scan ignores the identify

Serial Implementation of Inclusive / Exclusive Scan

 

The number of steps to reduce N elements is O(log N)

The total amount of work is N^2 / 2 operations, O(N^2)

Hillis Steele – Step Efficient

Hillis, W. Daniel, and Guy L. Steele Jr. “Data parallel algorithms.” Communications of the ACM 29.12 (1986): 1170-1183.

It’s similar to Gaussian pyramid idea

1, 2, 3, 4, 5, 6, 7, 8

1, 3, 5, 7, 9, 11, 13, 15

1, 3, 6, 10, 14, 18, 22, 26

1, 3, 6, 10, 15, 21, 28, 36

Steps: log N  (log N Gaussian pyramid)

Work N log N

Blelloch Scan – Work Efficient

1, 2, 3, 4, 5, 6, 7, 8

    3       7     11    15

            10            26

                            36

Downsweep operator

L   R

   \  |

R   L+R

 

more work than processors => prefer more work efficient

more processors than work => prefer more step efficient

Histogram

Implementing Histogram Using Atomics

Implementing Histogram Using Local Memory

 

Calculating Global Histogram Using Reduction

We can use REDUCE to combine local histograms into global histograms

Sort Then Reduce By Key

 

HDR Tone Mapping Using GPU

Histogram Equalization

  • Map
  • Reduce
    • Minimum brightness
    • Maximum brightness
  • Use Atomic Add first
  • Scatter
  • Scan

 

Leave a Reply