CS542200 Homework 3
This assignment helps you manage to solve the all-pairs shortest path problem with CPU threads and then further accelerate the program with CUDA accompanied by Blocked Floyd-Warshall algorithm. In this assignment, you will realize how powerful GPUs can be. Finally, we encourage you to optimize your program by exploring different optimizing strategies for performance points.
2 REQUIREMENTS
● In this assignment, you are asked to implement 3 versions of programs that solve the all-pairs shortest path problem.
- CPU version (hw3-1)
- ◆ You are required to use threading to parallelize the computation in your program.
- ◆ You can choose any threading library or framework you like (pthread, std::thread, OpenMP, Intel TBB, etc).
- ◆ You can choose any algorithm to solve the problem.
- ◆ You must implement the shortest path algorithm yourself. (Do not uselibraries to solve the problem. Ask TA if unsure).
- Single-GPU version (hw3-2)
◆ Should be optimized to get the performance points (20%).
■ Multi-GPU version (hw3-3)
◆ Must use 2 GPUs. Single GPU version is not accepted and will get 0 for
correctness and performance score in hw3-3 (even if you get AC on scoreboard).
3 BLOCKED FLOYD-WARSHALL ALGORITHM
Given an 𝑉 × 𝑉 matrix 𝑊 = [𝑤(𝑖, 𝑗)] where 𝑤(𝑖, 𝑗)≥0 represents the distance (weight of the edge) from a vertex 𝑖 to a vertex 𝑗 in a directed graph with 𝑉 vertices. We define an 𝑉 × 𝑉 matrix 𝐷 = [𝑑(𝑖, 𝑗)] where 𝑑(𝑖, 𝑗) denotes the shortest-path distance from a vertex 𝑖
to a vertex 𝑗. Let 𝐷(𝑘) = [𝑑(𝑘)(𝑖, 𝑗)] be the result which all the intermediate vertices are in the set {0, 1, 2,…, 𝑘 − 1}.
We define 𝑑(𝑘)(𝑖, 𝑗) as the following:
𝑑(𝑘)(𝑖,𝑗)= {𝑤(𝑖,𝑗) 𝑖𝑓𝑘=0;
𝑚𝑖𝑛(𝑑(𝑘−1)(𝑖,𝑗),𝑑(𝑘−1)(𝑖,𝑘−1)+𝑑(𝑘−1)(𝑘−1,𝑗)) 𝑖𝑓 𝑘≥1
The matrix 𝐷(𝑉) = 𝑑(𝑉)(𝑖, 𝑗) gives the answer to the all-pairs shortest path problem.
In the blocked all-pairs shortest path algorithm, we partition 𝐷 into ⌈𝑉/𝐵⌉×⌈𝑉/𝐵⌉ blocks of 𝐵×𝐵 submatrices. The number 𝐵 is called the blocking factor. For instance, in figure 1, we divide a 6×6 matrix into 3×3 submatrices (or blocks) by 𝐵 = 2.
Figure 1: Divide a matrix by B = 2
The blocked version of the Floyd-Warshall algorithm will perform ⌈𝑉/𝐵⌉ rounds, and each round is divided into 3 phases. It performs 𝐵 iterations in each phase.
Assuming a block is identified by its index (𝐼, 𝐽), where 0 ≤ 𝐼, 𝐽 < ⌈𝑉/𝐵⌉. The block with
index (𝐼, 𝐽) is denoted by 𝐷 (𝑘) . (𝐼,𝐽)
In the following explanation, we assume 𝑁 = 6 and 𝐵 = 2. The execution flow is described step by step as follows:
● Phase 1: self-dependent blocks.
In the 𝑘-th round, the first phase is to compute the 𝐵×𝐵 pivot block 𝐷
For instance, in the 1st round, 𝐷 (2) is computed as follows: (0,0)
(𝑘·𝐵) . (𝑘−1,𝑘−1)
𝑑(1)(0, 0) = 𝑚𝑖𝑛(𝑑(0)(0, 0), 𝑑(0)(0, 0) + 𝑑(0)(0, 0)) 𝑑(1)(0, 1) = 𝑚𝑖𝑛(𝑑(0)(0, 1), 𝑑(0)(0, 0) + 𝑑(0)(0, 1)) 𝑑(1)(1, 0) = 𝑚𝑖𝑛(𝑑(0)(1, 0), 𝑑(0)(1, 0) + 𝑑(0)(0, 0)) 𝑑(1)(1, 1) = 𝑚𝑖𝑛(𝑑(0)(1, 1), 𝑑(0)(1, 0) + 𝑑(0)(0, 1)) 𝑑(2)(0, 0) = 𝑚𝑖𝑛(𝑑(1)(0, 0), 𝑑(1)(0, 1) + 𝑑(1)(1, 0)) 𝑑(2)(0, 1) = 𝑚𝑖𝑛(𝑑(1)(0, 1), 𝑑(1)(0, 1) + 𝑑(1)(1, 1)) 𝑑(2)(1, 0) = 𝑚𝑖𝑛(𝑑(1)(1, 0), 𝑑(1)(1, 1) + 𝑑(1)(1, 0)) 𝑑(2)(1, 1) = 𝑚𝑖𝑛(𝑑(1)(1, 1), 𝑑(1)(1, 1) + 𝑑(1)(1, 1))
Note that the result of 𝑑(2) depends on the result of 𝑑(1) and therefore cannot be computed in parallel with the computation of 𝑑(1).
● Phase 2: pivot-row and pivot-column blocks.
In the 𝑘-th round, it computes all 𝐷(𝑘·𝐵) and 𝐷(𝑘·𝐵) where h≠𝑘 − 1.
The result of pivot-row / pivot-column blocks depend on the result in phase 1 and itself.
For instance, in the 1st round, the result of 𝐷(2) depends on 𝐷(2)
and 𝐷(0) : (0,2)
(0,2)
𝑑(1)(0, 4) = 𝑚𝑖𝑛(𝑑(0)(0, 4),𝑑(2)(0, 0) + 𝑑(0)(0, 4))
𝑑(1)(0, 5) = 𝑚𝑖𝑛(𝑑(0)(0, 5),𝑑(2)(0, 0) + 𝑑(0)(0, 5))
𝑑(1)(1, 4) = 𝑚𝑖𝑛(𝑑(0)(1, 4),𝑑(2)(1, 0) + 𝑑(0)(0, 4))
𝑑(1)(1, 5) = 𝑚𝑖𝑛(𝑑(0)(1, 5),𝑑(2)(1, 0) + 𝑑(0)(0, 5))
𝑑(2)(0, 4) = 𝑚𝑖𝑛(𝑑(1)(0, 4),𝑑(2)(0, 1) + 𝑑(1)(1, 4))
𝑑(2)(0, 5) = 𝑚𝑖𝑛(𝑑(1)(0, 5),𝑑(2)(0, 1) + 𝑑(1)(1, 5))
𝑑(2)(1, 4) = 𝑚𝑖𝑛(𝑑(1)(1, 4),𝑑(2)(1, 1) + 𝑑(1)(1, 4))
𝑑(2)(1, 5) = 𝑚𝑖𝑛(𝑑(1)(1, 5),𝑑(2)(1, 1) + 𝑑(1)(1, 5)) Phase 3: other blocks.
In the 𝑘-th round, it computes all 𝐷(𝑘·𝐵) where h , h ≠𝑘 − 1. (h1,h2) 1 2
(0,0)
(h,𝑘−1) (𝑘−1,h)
The result of these blocks depends on the result from phase 2 and itself.
For instance, in the 1st round, the result of 𝐷(2) depends on 𝐷(2) (1,2) (1,0)
𝑑(1)(2, 4) = 𝑚𝑖𝑛(𝑑(0)(2, 4),𝑑(2)(2, 0) + 𝑑(2)(0, 4)) 𝑑(1)(2, 5) = 𝑚𝑖𝑛(𝑑(0)(2, 5),𝑑(2)(2, 0) + 𝑑(2)(0, 5)) 𝑑(1)(3, 4) = 𝑚𝑖𝑛(𝑑(0)(3, 4),𝑑(2)(3, 0) + 𝑑(2)(0, 4)) 𝑑(1)(3, 5) = 𝑚𝑖𝑛(𝑑(0)(3, 5),𝑑(2)(3, 0) + 𝑑(2)(0, 5)) 𝑑(2)(2, 4) = 𝑚𝑖𝑛(𝑑(1)(2, 4),𝑑(2)(2, 1) + 𝑑(2)(1, 4)) 𝑑(2)(2, 5) = 𝑚𝑖𝑛(𝑑(1)(2, 5),𝑑(2)(2, 1) + 𝑑(2)(1, 5)) 𝑑(2)(3, 4) = 𝑚𝑖𝑛(𝑑(1)(3, 4),𝑑(2)(3, 1) + 𝑑(2)(1, 4)) 𝑑(2)(3, 5) = 𝑚𝑖𝑛(𝑑(1)(3, 5),𝑑(2)(3, 1) + 𝑑(2)(1, 5))
and 𝐷(2) : (0,2)
Figure 2: The 3 phases of the blocked FW algorithm in the first round.
Figure 3: The computations of 𝐷(2) , 𝐷(2) and their dependencies in the first round. (0,2) (1,2)
Figure 4: In this particular example where 𝑉 = 6 and 𝐵 = 2, we will require ⌈𝑉/𝐵⌉ = 3 rounds.
4 RUN YOUR PROGRAMS
● Command line specification
# CPU
srun -N1 -n1 -cCPUS ./hw3-1 INPUTFILE OUTPUTFILE
# Single-GPU
srun -N1 -n1 –gres=gpu:1 ./hw3-2 INPUTFILE OUTPUTFILE
# Multi-GPU
srun -N1 -n1 -c2 –gres=gpu:2 ./hw3-3 INPUTFILE OUTPUTFILE
- ○ CPUS: Number of CPUs, specified by TA.
- ○ INPUTFILE: The pathname of the input file. Your program should read theinput graph from this file.
- ○ OUTPUTFILE: The pathname of the output file. Your program should outputthe shortest path distances to this file. CPUS: Number of CPUs, specified by
TA.
● Input specification
- ○ The input is a directed graph with non-negative edge distances.
- ○ The input file is a binary file containing 32-bit integers. You can use the inttype in C/C++.
- ○ The first two integers are the number of vertices (V) and the number of edges(E).
- ○ Then, there are E edges. Each edge consists of 3 integers:
- source vertex id (𝑠𝑟𝑐𝑖)
- destination vertex id (𝑑𝑠𝑡𝑖)
- edge weight (𝑤𝑖)
- ○ The values of vertex indexes & edge indexes start at 0.
- ○ The ranges for the input are:
- ● 2≤𝑉≤6000 (CPU)
- ● 2≤𝑉≤40000 (Single-GPU)
- ● 2≤𝑉≤60000 (Multi-GPU)
- ● 0≤𝐸≤𝑉×(𝑉 − 1)
- ● 0≤𝑠𝑟𝑐𝑖, 𝑑𝑠𝑡𝑖 < 𝑉
- ● 𝑠𝑟𝑐𝑖 ≠ 𝑑𝑠𝑡𝑖
- ● if 𝑠𝑟𝑐𝑖 = 𝑠𝑟𝑐𝑗 then 𝑑𝑠𝑡𝑖 ≠ 𝑑𝑠𝑡𝑗 (there will not be repeated edges)
- ● 0≤𝑤𝑖≤1000 Here’s an example:
|
offset |
type |
decimal value |
description |
|
0000 |
32-bit integer |
3 |
# 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 (𝑉) |
|
0004 |
32-bit integer |
6 |
# 𝑒𝑑𝑔𝑒𝑠 (𝐸) |
|
0008 |
32-bit integer |
0 |
src id for edge 0 |
|
0012 |
32-bit integer |
1 |
dst id for edge 0 |
|
0016 |
32-bit integer |
3 |
edge 0’s distance |
|
0020 |
32-bit integer |
src id for edge 1 |
|
|
… |
… |
… |
… |
|
0076 |
32-bit integer |
edge 5’s distance |
● Output specification
- ○ The output file is also in binary format.
- ○ For an input file with 𝑉 vertices, you should output an output file containing𝑉2 integers.
- ○ The first 𝑉 integers should be the shortest path distances for starting from edge 0: 𝑑𝑖𝑠𝑡(0, 0), 𝑑𝑖𝑠𝑡(0, 1), 𝑑𝑖𝑠𝑡(0, 2), …, 𝑑𝑖𝑠𝑡(0, 𝑉 − 1); then the following 𝑉integers would be the shortest path distances starting from edge 1:𝑑𝑖𝑠𝑡(1, 0), 𝑑𝑖𝑠𝑡(1, 1), 𝑑𝑖𝑠𝑡(1, 2), …, 𝑑𝑖𝑠𝑡(1, 𝑉 − 1); and so on, totaling 𝑉2
integers.
- ○ 𝑑𝑖𝑠𝑡(𝑖, 𝑗) = 0 where 𝑖 = 𝑗.
- ○ If there is no valid path between𝑖→𝑗, please output with:𝑑𝑖𝑠𝑡(𝑖, 𝑗) = 230 − 1 = 1073741823. Example output file:
|
offset |
type |
decimal value |
description |
|
0000 |
32-bit integer |
0 |
𝑑𝑖𝑠𝑡(0, 0) |
|
0004 |
32-bit integer |
? |
𝑑𝑖𝑠𝑡(0, 1) |
|
0008 |
32-bit integer |
? |
𝑑𝑖𝑠𝑡(0, 2) |
|
… |
… |
… |
… |
|
4𝑉2 − 8 |
32-bit integer |
? |
𝑑𝑖𝑠𝑡(𝑉 − 1, 𝑉 − 2) |
|
4𝑉2 − 4 |
32-bit integer |
0 |
𝑑𝑖𝑠𝑡(𝑉 − 1, 𝑉 − 1) |
5 REPORT
Answer the questions below. You are recommended to use the same section numbering as they are listed.
1. Implementation
- Which algorithm do you choose in hw3-1?
- How do you divide your data in hw3-2, hw3-3?
- What’s your configuration in hw3-2, hw3-3? And why? (e.g. blocking factor,#blocks, #threads)
- How do you implement the communication in hw3-3?
- Briefly describe your implementations in diagrams, figures or sentences.
- Profiling Results (hw3-2)Provide the profiling results of following metrics on the biggest kernel of your program using NVIDIA profiling tools. NVIDIA Profiler Guide.
- ○ occupancy
- ○ sm efficiency
- ○ shared memory load/store throughput
- ○ global load/store throughput
- Experiment & Analysis
- System Spec
If you didn’t use our hades server for the experiments, please show the CPU, RAM, disk of the system. - Blocking Factor (hw3-2)
Observe what happened with different blocking factors, and plot the trend in terms of Integer GOPS and global/shared memory bandwidth. (You can get the information from profiling tools or manual) (You might want to check nvprof and Metrics Reference)Figure 5: Example chart of performance and global memory bandwidth trend w.r.t. blocking factor
- System Spec
Note:
To run nvprof on hades with flags like –metrics, please run on the slurm
partition prof. e.g. srun -p prof -N1 -n1 –gres=gpu:1 nvprof –metrics gld_throughput ./hw3-2 /home/pp21/share/hw3-2/cases/c01.1 c01.1.out
c. Optimization (hw3-2)
Any optimizations after you port the algorithm on GPU, describe them with sentences and charts. Here are some techniques you can implement:
- Coalesced memory access
- Shared memory
- Handle bank conflict
- CUDA 2D alignment
- Occupancy optimization
- Large blocking factor
- Reduce communication
- StreamingFigure 6: Example chart of performance optimization¶
d. Weak scalability (hw3-3)
Observe weak scalability of the multi-GPU implementations
- Time Distribution (hw3-2) Analyze the time spent in:
- ● computing
- ● communication
- ● memory copy (H2D, D2H)
- ● I/O of your program w.r.t. input size.
- Others
Additional charts with explanation and studies. The more, the better.
4. Experience & conclusion
- What have you learned from this homework?
- Feedback (optional)



