Name: CS542200 Homework 3-All-Pairs Shortest Path Solved
SKU: 89142
Availability: InStock

Rate this product

CS542200 Homework 3

This assignment helps you manage to solve the all-pairs shortest path problem with CPU threads and then further accelerate the program with CUDA accompanied by Blocked Floyd-Warshall algorithm. In this assignment, you will realize how powerful GPUs can be. Finally, we encourage you to optimize your program by exploring different optimizing strategies for performance points.

2 REQUIREMENTS

● In this assignment, you are asked to implement 3 versions of programs that solve the all-pairs shortest path problem.

CPU version (hw3-1)
- ◆ You are required to use threading to parallelize the computation in your program.
- ◆ You can choose any threading library or framework you like (pthread, std::thread, OpenMP, Intel TBB, etc).
- ◆ You can choose any algorithm to solve the problem.
- ◆ You must implement the shortest path algorithm yourself. (Do not uselibraries to solve the problem. Ask TA if unsure).
Single-GPU version (hw3-2)

◆ Should be optimized to get the performance points (20%).

■ Multi-GPU version (hw3-3)
◆ Must use 2 GPUs. Single GPU version is not accepted and will get 0 for

correctness and performance score in hw3-3 (even if you get AC on scoreboard).

3 BLOCKED FLOYD-WARSHALL ALGORITHM

Given an 𝑉 × 𝑉 matrix 𝑊 = [𝑤(𝑖, 𝑗)] where 𝑤(𝑖, 𝑗)≥0 represents the distance (weight of the edge) from a vertex 𝑖 to a vertex 𝑗 in a directed graph with 𝑉 vertices. We define an 𝑉 × 𝑉 matrix 𝐷 = [𝑑(𝑖, 𝑗)] where 𝑑(𝑖, 𝑗) denotes the shortest-path distance from a vertex 𝑖

to a vertex 𝑗. Let 𝐷(𝑘) = [𝑑(𝑘)(𝑖, 𝑗)] be the result which all the intermediate vertices are in the set {0, 1, 2,…, 𝑘 − 1}.

We define 𝑑(𝑘)(𝑖, 𝑗) as the following:
𝑑(𝑘)(𝑖,𝑗)= {𝑤(𝑖,𝑗) 𝑖𝑓𝑘=0;

𝑚𝑖𝑛(𝑑(𝑘−1)(𝑖,𝑗),𝑑(𝑘−1)(𝑖,𝑘−1)+𝑑(𝑘−1)(𝑘−1,𝑗)) 𝑖𝑓 𝑘≥1

The matrix 𝐷(𝑉) = 𝑑(𝑉)(𝑖, 𝑗) gives the answer to the all-pairs shortest path problem.
In the blocked all-pairs shortest path algorithm, we partition 𝐷 into ⌈𝑉/𝐵⌉×⌈𝑉/𝐵⌉ blocks of 𝐵×𝐵 submatrices. The number 𝐵 is called the blocking factor. For instance, in figure 1, we divide a 6×6 matrix into 3×3 submatrices (or blocks) by 𝐵 = 2.

Figure 1: Divide a matrix by B = 2
The blocked version of the Floyd-Warshall algorithm will perform ⌈𝑉/𝐵⌉ rounds, and each round is divided into 3 phases. It performs 𝐵 iterations in each phase.
Assuming a block is identified by its index (𝐼, 𝐽), where 0 ≤ 𝐼, 𝐽 < ⌈𝑉/𝐵⌉. The block with

index (𝐼, 𝐽) is denoted by 𝐷 (𝑘) . (𝐼,𝐽)

In the following explanation, we assume 𝑁 = 6 and 𝐵 = 2. The execution flow is described step by step as follows:

● Phase 1: self-dependent blocks.
In the 𝑘-th round, the first phase is to compute the 𝐵×𝐵 pivot block 𝐷

For instance, in the 1st round, 𝐷 (2) is computed as follows: (0,0)

(𝑘·𝐵) . (𝑘−1,𝑘−1)

𝑑(1)(0, 0) = 𝑚𝑖𝑛(𝑑(0)(0, 0), 𝑑(0)(0, 0) + 𝑑(0)(0, 0)) 𝑑(1)(0, 1) = 𝑚𝑖𝑛(𝑑(0)(0, 1), 𝑑(0)(0, 0) + 𝑑(0)(0, 1)) 𝑑(1)(1, 0) = 𝑚𝑖𝑛(𝑑(0)(1, 0), 𝑑(0)(1, 0) + 𝑑(0)(0, 0)) 𝑑(1)(1, 1) = 𝑚𝑖𝑛(𝑑(0)(1, 1), 𝑑(0)(1, 0) + 𝑑(0)(0, 1)) 𝑑(2)(0, 0) = 𝑚𝑖𝑛(𝑑(1)(0, 0), 𝑑(1)(0, 1) + 𝑑(1)(1, 0)) 𝑑(2)(0, 1) = 𝑚𝑖𝑛(𝑑(1)(0, 1), 𝑑(1)(0, 1) + 𝑑(1)(1, 1)) 𝑑(2)(1, 0) = 𝑚𝑖𝑛(𝑑(1)(1, 0), 𝑑(1)(1, 1) + 𝑑(1)(1, 0)) 𝑑(2)(1, 1) = 𝑚𝑖𝑛(𝑑(1)(1, 1), 𝑑(1)(1, 1) + 𝑑(1)(1, 1))

Note that the result of 𝑑(2) depends on the result of 𝑑(1) and therefore cannot be computed in parallel with the computation of 𝑑(1).

● Phase 2: pivot-row and pivot-column blocks.
In the 𝑘-th round, it computes all 𝐷(𝑘·𝐵) and 𝐷(𝑘·𝐵) where h≠𝑘 − 1.

The result of pivot-row / pivot-column blocks depend on the result in phase 1 and itself.

For instance, in the 1st round, the result of 𝐷(2) depends on 𝐷(2)

and 𝐷(0) : (0,2)

(0,2)

𝑑(1)(0, 4) = 𝑚𝑖𝑛(𝑑(0)(0, 4),𝑑(2)(0, 0) + 𝑑(0)(0, 4))

𝑑(1)(0, 5) = 𝑚𝑖𝑛(𝑑(0)(0, 5),𝑑(2)(0, 0) + 𝑑(0)(0, 5))

𝑑(1)(1, 4) = 𝑚𝑖𝑛(𝑑(0)(1, 4),𝑑(2)(1, 0) + 𝑑(0)(0, 4))

𝑑(1)(1, 5) = 𝑚𝑖𝑛(𝑑(0)(1, 5),𝑑(2)(1, 0) + 𝑑(0)(0, 5))

𝑑(2)(0, 4) = 𝑚𝑖𝑛(𝑑(1)(0, 4),𝑑(2)(0, 1) + 𝑑(1)(1, 4))

𝑑(2)(0, 5) = 𝑚𝑖𝑛(𝑑(1)(0, 5),𝑑(2)(0, 1) + 𝑑(1)(1, 5))

𝑑(2)(1, 4) = 𝑚𝑖𝑛(𝑑(1)(1, 4),𝑑(2)(1, 1) + 𝑑(1)(1, 4))

𝑑(2)(1, 5) = 𝑚𝑖𝑛(𝑑(1)(1, 5),𝑑(2)(1, 1) + 𝑑(1)(1, 5)) Phase 3: other blocks.

In the 𝑘-th round, it computes all 𝐷(𝑘·𝐵) where h , h ≠𝑘 − 1. (h1,h2) 1 2

(0,0)

(h,𝑘−1) (𝑘−1,h)

The result of these blocks depends on the result from phase 2 and itself.

For instance, in the 1st round, the result of 𝐷(2) depends on 𝐷(2) (1,2) (1,0)

𝑑(1)(2, 4) = 𝑚𝑖𝑛(𝑑(0)(2, 4),𝑑(2)(2, 0) + 𝑑(2)(0, 4)) 𝑑(1)(2, 5) = 𝑚𝑖𝑛(𝑑(0)(2, 5),𝑑(2)(2, 0) + 𝑑(2)(0, 5)) 𝑑(1)(3, 4) = 𝑚𝑖𝑛(𝑑(0)(3, 4),𝑑(2)(3, 0) + 𝑑(2)(0, 4)) 𝑑(1)(3, 5) = 𝑚𝑖𝑛(𝑑(0)(3, 5),𝑑(2)(3, 0) + 𝑑(2)(0, 5)) 𝑑(2)(2, 4) = 𝑚𝑖𝑛(𝑑(1)(2, 4),𝑑(2)(2, 1) + 𝑑(2)(1, 4)) 𝑑(2)(2, 5) = 𝑚𝑖𝑛(𝑑(1)(2, 5),𝑑(2)(2, 1) + 𝑑(2)(1, 5)) 𝑑(2)(3, 4) = 𝑚𝑖𝑛(𝑑(1)(3, 4),𝑑(2)(3, 1) + 𝑑(2)(1, 4)) 𝑑(2)(3, 5) = 𝑚𝑖𝑛(𝑑(1)(3, 5),𝑑(2)(3, 1) + 𝑑(2)(1, 5))

and 𝐷(2) : (0,2)

Figure 2: The 3 phases of the blocked FW algorithm in the first round.

Figure 3: The computations of 𝐷(2) , 𝐷(2) and their dependencies in the first round. (0,2) (1,2)

Figure 4: In this particular example where 𝑉 = 6 and 𝐵 = 2, we will require ⌈𝑉/𝐵⌉ = 3 rounds.

4 RUN YOUR PROGRAMS
● Command line specification

# CPU

srun -N1 -n1 -cCPUS ./hw3-1 INPUTFILE OUTPUTFILE

# Single-GPU

srun -N1 -n1 –gres=gpu:1 ./hw3-2 INPUTFILE OUTPUTFILE

# Multi-GPU

srun -N1 -n1 -c2 –gres=gpu:2 ./hw3-3 INPUTFILE OUTPUTFILE

○ CPUS: Number of CPUs, specified by TA.
○ INPUTFILE: The pathname of the input file. Your program should read theinput graph from this file.
○ OUTPUTFILE: The pathname of the output file. Your program should outputthe shortest path distances to this file. CPUS: Number of CPUs, specified by
TA.

● Input specification

○ The input is a directed graph with non-negative edge distances.
○ The input file is a binary file containing 32-bit integers. You can use the inttype in C/C++.
○ The first two integers are the number of vertices (V) and the number of edges(E).

○ Then, there are E edges. Each edge consists of 3 integers:
1. source vertex id (𝑠𝑟𝑐𝑖)
2. destination vertex id (𝑑𝑠𝑡𝑖)
3. edge weight (𝑤𝑖)
○ The values of vertex indexes & edge indexes start at 0.
○ The ranges for the input are:
- ● 2≤𝑉≤6000 (CPU)
- ● 2≤𝑉≤40000 (Single-GPU)
- ● 2≤𝑉≤60000 (Multi-GPU)
- ● 0≤𝐸≤𝑉×(𝑉 − 1)
- ● 0≤𝑠𝑟𝑐𝑖, 𝑑𝑠𝑡𝑖 < 𝑉
- ● 𝑠𝑟𝑐𝑖 ≠ 𝑑𝑠𝑡𝑖
- ● if 𝑠𝑟𝑐𝑖 = 𝑠𝑟𝑐𝑗 then 𝑑𝑠𝑡𝑖 ≠ 𝑑𝑠𝑡𝑗 (there will not be repeated edges)
- ● 0≤𝑤𝑖≤1000 Here’s an example:

offset	type	decimal value	description
0000	32-bit integer	3	# 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 (𝑉)
0004	32-bit integer	6	# 𝑒𝑑𝑔𝑒𝑠 (𝐸)
0008	32-bit integer	0	src id for edge 0
0012	32-bit integer	1	dst id for edge 0
0016	32-bit integer	3	edge 0’s distance
0020	32-bit integer		src id for edge 1
…	…	…	…
0076	32-bit integer		edge 5’s distance

● Output specification

○ The output file is also in binary format.
○ For an input file with 𝑉 vertices, you should output an output file containing𝑉2 integers.

○ The first 𝑉 integers should be the shortest path distances for starting from edge 0: 𝑑𝑖𝑠𝑡(0, 0), 𝑑𝑖𝑠𝑡(0, 1), 𝑑𝑖𝑠𝑡(0, 2), …, 𝑑𝑖𝑠𝑡(0, 𝑉 − 1); then the following 𝑉integers would be the shortest path distances starting from edge 1:𝑑𝑖𝑠𝑡(1, 0), 𝑑𝑖𝑠𝑡(1, 1), 𝑑𝑖𝑠𝑡(1, 2), …, 𝑑𝑖𝑠𝑡(1, 𝑉 − 1); and so on, totaling 𝑉2
integers.
○ 𝑑𝑖𝑠𝑡(𝑖, 𝑗) = 0 where 𝑖 = 𝑗.
○ If there is no valid path between𝑖→𝑗, please output with:𝑑𝑖𝑠𝑡(𝑖, 𝑗) = 230 − 1 = 1073741823. Example output file:

offset	type	decimal value	description
0000	32-bit integer	0	𝑑𝑖𝑠𝑡(0, 0)
0004	32-bit integer	?	𝑑𝑖𝑠𝑡(0, 1)
0008	32-bit integer	?	𝑑𝑖𝑠𝑡(0, 2)
…	…	…	…
4𝑉2 − 8	32-bit integer	?	𝑑𝑖𝑠𝑡(𝑉 − 1, 𝑉 − 2)
4𝑉2 − 4	32-bit integer	0	𝑑𝑖𝑠𝑡(𝑉 − 1, 𝑉 − 1)

5 REPORT

Answer the questions below. You are recommended to use the same section numbering as they are listed.
1. Implementation

Which algorithm do you choose in hw3-1?
How do you divide your data in hw3-2, hw3-3?
What’s your configuration in hw3-2, hw3-3? And why? (e.g. blocking factor,#blocks, #threads)
How do you implement the communication in hw3-3?
Briefly describe your implementations in diagrams, figures or sentences.

Profiling Results (hw3-2)Provide the profiling results of following metrics on the biggest kernel of your program using NVIDIA profiling tools. NVIDIA Profiler Guide.
- ○ occupancy
- ○ sm efficiency
- ○ shared memory load/store throughput
- ○ global load/store throughput
Experiment & Analysis
1. System Spec
  If you didn’t use our hades server for the experiments, please show the CPU, RAM, disk of the system.
2. Blocking Factor (hw3-2)
  Observe what happened with different blocking factors, and plot the trend in terms of Integer GOPS and global/shared memory bandwidth. (You can get the information from profiling tools or manual) (You might want to check nvprof and Metrics Reference)Figure 5: Example chart of performance and global memory bandwidth trend w.r.t. blocking factor

Note:
To run nvprof on hades with flags like –metrics, please run on the slurm

partition prof. e.g. srun -p prof -N1 -n1 –gres=gpu:1 nvprof –metrics gld_throughput ./hw3-2 /home/pp21/share/hw3-2/cases/c01.1 c01.1.out

c. Optimization (hw3-2)
Any optimizations after you port the algorithm on GPU, describe them with sentences and charts. Here are some techniques you can implement:

Coalesced memory access
Shared memory
Handle bank conflict
CUDA 2D alignment
Occupancy optimization
Large blocking factor
Reduce communication
StreamingFigure 6: Example chart of performance optimization¶

d. Weak scalability (hw3-3)

Observe weak scalability of the multi-GPU implementations

Time Distribution (hw3-2) Analyze the time spent in:
- ● computing
- ● communication
- ● memory copy (H2D, D2H)
- ● I/O of your program w.r.t. input size.
Others
Additional charts with explanation and studies. The more, the better.

4. Experience & conclusion

What have you learned from this homework?
Feedback (optional)

hw3-n3xibk.zip

[SOLVED] CS542200 Homework 3-All-Pairs Shortest Path

Want to See Past Work First?

[SOLVED] CS542200 Homework 3-All-Pairs Shortest Path

Related products

CS542200 Homework 2-Mandelbrot Set

CS542200 Homework 4-MapReduce

CS542200 Homework 1-Odd-Even Sort

Related in this category

More in this category

CS542200 Homework 4-MapReduce

CS542200 Homework 1-Odd-Even Sort

CS542200 Homework 2-Mandelbrot Set

Want to See Past Work First?