[SOLVED] CS542200 Homework 3-All-Pairs Shortest Path

35.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You will receive the following solution file(s) instantly after successful payment:

zip file icon hw3-n3xibk.zip (1960.3 KB)
Assignment Instructions Updated Recently? Submit Below and we will provide new Solution!
Submit New Instructions
🔒 Securely Powered by:
Secure Checkout
Rate this product

CS542200 Homework 3

This assignment helps you manage to solve the all-pairs shortest path problem with CPU threads and then further accelerate the program with CUDA accompanied by Blocked Floyd-Warshall algorithm. In this assignment, you will realize how powerful GPUs can be. Finally, we encourage you to optimize your program by exploring different optimizing strategies for performance points.

2 REQUIREMENTS

● In this assignment, you are asked to implement 3 versions of programs that solve the all-pairs shortest path problem.

  • CPU version (hw3-1)
    • ◆  You are required to use threading to parallelize the computation in your program.
    • ◆  You can choose any threading library or framework you like (pthread, std::thread, OpenMP, Intel TBB, etc).
    • ◆  You can choose any algorithm to solve the problem.
    • ◆  You must implement the shortest path algorithm yourself. (Do not uselibraries to solve the problem. Ask TA if unsure).
  • Single-GPU version (hw3-2)

◆ Should be optimized to get the performance points (20%).

■ Multi-GPU version (hw3-3)
◆ Must use 2 GPUs. Single GPU version is not accepted and will get 0 for

correctness and performance score in hw3-3 (even if you get AC on scoreboard).

3 BLOCKED FLOYD-WARSHALL ALGORITHM

Given an 𝑉 × 𝑉 matrix 𝑊 = [𝑤(𝑖, 𝑗)] where 𝑤(𝑖, 𝑗)≥0 represents the distance (weight of the edge) from a vertex 𝑖 to a vertex 𝑗 in a directed graph with 𝑉 vertices. We define an 𝑉 × 𝑉 matrix 𝐷 = [𝑑(𝑖, 𝑗)] where 𝑑(𝑖, 𝑗) denotes the shortest-path distance from a vertex 𝑖

to a vertex 𝑗. Let 𝐷(𝑘) = [𝑑(𝑘)(𝑖, 𝑗)] be the result which all the intermediate vertices are in the set {0, 1, 2,…, 𝑘 − 1}.

We define 𝑑(𝑘)(𝑖, 𝑗) as the following:
𝑑(𝑘)(𝑖,𝑗)= {𝑤(𝑖,𝑗) 𝑖𝑓𝑘=0;

𝑚𝑖𝑛(𝑑(𝑘−1)(𝑖,𝑗),𝑑(𝑘−1)(𝑖,𝑘−1)+𝑑(𝑘−1)(𝑘−1,𝑗)) 𝑖𝑓 𝑘≥1

The matrix 𝐷(𝑉) = 𝑑(𝑉)(𝑖, 𝑗) gives the answer to the all-pairs shortest path problem.
In the blocked all-pairs shortest path algorithm, we partition 𝐷 into ⌈𝑉/𝐵⌉×⌈𝑉/𝐵⌉ blocks of 𝐵×𝐵 submatrices. The number 𝐵 is called the blocking factor. For instance, in figure 1, we divide a 6×6 matrix into 3×3 submatrices (or blocks) by 𝐵 = 2.

Figure 1: Divide a matrix by B = 2
The blocked version of the Floyd-Warshall algorithm will perform ⌈𝑉/𝐵⌉ rounds, and each round is divided into 3 phases. It performs 𝐵 iterations in each phase.
Assuming a block is identified by its index (𝐼, 𝐽), where 0 ≤ 𝐼, 𝐽 < ⌈𝑉/𝐵⌉. The block with

index (𝐼, 𝐽) is denoted by 𝐷 (𝑘) . (𝐼,𝐽)

In the following explanation, we assume 𝑁 = 6 and 𝐵 = 2. The execution flow is described step by step as follows:

● Phase 1: self-dependent blocks.
In the 𝑘-th round, the first phase is to compute the 𝐵×𝐵 pivot block 𝐷

For instance, in the 1st round, 𝐷 (2) is computed as follows: (0,0)

(𝑘·𝐵) . (𝑘−1,𝑘−1)

𝑑(1)(0, 0) = 𝑚𝑖𝑛(𝑑(0)(0, 0), 𝑑(0)(0, 0) + 𝑑(0)(0, 0)) 𝑑(1)(0, 1) = 𝑚𝑖𝑛(𝑑(0)(0, 1), 𝑑(0)(0, 0) + 𝑑(0)(0, 1)) 𝑑(1)(1, 0) = 𝑚𝑖𝑛(𝑑(0)(1, 0), 𝑑(0)(1, 0) + 𝑑(0)(0, 0)) 𝑑(1)(1, 1) = 𝑚𝑖𝑛(𝑑(0)(1, 1), 𝑑(0)(1, 0) + 𝑑(0)(0, 1)) 𝑑(2)(0, 0) = 𝑚𝑖𝑛(𝑑(1)(0, 0), 𝑑(1)(0, 1) + 𝑑(1)(1, 0)) 𝑑(2)(0, 1) = 𝑚𝑖𝑛(𝑑(1)(0, 1), 𝑑(1)(0, 1) + 𝑑(1)(1, 1)) 𝑑(2)(1, 0) = 𝑚𝑖𝑛(𝑑(1)(1, 0), 𝑑(1)(1, 1) + 𝑑(1)(1, 0)) 𝑑(2)(1, 1) = 𝑚𝑖𝑛(𝑑(1)(1, 1), 𝑑(1)(1, 1) + 𝑑(1)(1, 1))

Note that the result of 𝑑(2) depends on the result of 𝑑(1) and therefore cannot be computed in parallel with the computation of 𝑑(1).

● Phase 2: pivot-row and pivot-column blocks.
In the 𝑘-th round, it computes all 𝐷(𝑘·𝐵) and 𝐷(𝑘·𝐵) where h≠𝑘 − 1.

The result of pivot-row / pivot-column blocks depend on the result in phase 1 and itself.

For instance, in the 1st round, the result of 𝐷(2) depends on 𝐷(2)

and 𝐷(0) : (0,2)

(0,2)

𝑑(1)(0, 4) = 𝑚𝑖𝑛(𝑑(0)(0, 4),𝑑(2)(0, 0) + 𝑑(0)(0, 4))

𝑑(1)(0, 5) = 𝑚𝑖𝑛(𝑑(0)(0, 5),𝑑(2)(0, 0) + 𝑑(0)(0, 5))

𝑑(1)(1, 4) = 𝑚𝑖𝑛(𝑑(0)(1, 4),𝑑(2)(1, 0) + 𝑑(0)(0, 4))

𝑑(1)(1, 5) = 𝑚𝑖𝑛(𝑑(0)(1, 5),𝑑(2)(1, 0) + 𝑑(0)(0, 5))

𝑑(2)(0, 4) = 𝑚𝑖𝑛(𝑑(1)(0, 4),𝑑(2)(0, 1) + 𝑑(1)(1, 4))

𝑑(2)(0, 5) = 𝑚𝑖𝑛(𝑑(1)(0, 5),𝑑(2)(0, 1) + 𝑑(1)(1, 5))

𝑑(2)(1, 4) = 𝑚𝑖𝑛(𝑑(1)(1, 4),𝑑(2)(1, 1) + 𝑑(1)(1, 4))

𝑑(2)(1, 5) = 𝑚𝑖𝑛(𝑑(1)(1, 5),𝑑(2)(1, 1) + 𝑑(1)(1, 5)) Phase 3: other blocks.

In the 𝑘-th round, it computes all 𝐷(𝑘·𝐵) where h , h ≠𝑘 − 1. (h1,h2) 1 2

(0,0)

(h,𝑘−1) (𝑘−1,h)

The result of these blocks depends on the result from phase 2 and itself.

For instance, in the 1st round, the result of 𝐷(2) depends on 𝐷(2) (1,2) (1,0)

𝑑(1)(2, 4) = 𝑚𝑖𝑛(𝑑(0)(2, 4),𝑑(2)(2, 0) + 𝑑(2)(0, 4)) 𝑑(1)(2, 5) = 𝑚𝑖𝑛(𝑑(0)(2, 5),𝑑(2)(2, 0) + 𝑑(2)(0, 5)) 𝑑(1)(3, 4) = 𝑚𝑖𝑛(𝑑(0)(3, 4),𝑑(2)(3, 0) + 𝑑(2)(0, 4)) 𝑑(1)(3, 5) = 𝑚𝑖𝑛(𝑑(0)(3, 5),𝑑(2)(3, 0) + 𝑑(2)(0, 5)) 𝑑(2)(2, 4) = 𝑚𝑖𝑛(𝑑(1)(2, 4),𝑑(2)(2, 1) + 𝑑(2)(1, 4)) 𝑑(2)(2, 5) = 𝑚𝑖𝑛(𝑑(1)(2, 5),𝑑(2)(2, 1) + 𝑑(2)(1, 5)) 𝑑(2)(3, 4) = 𝑚𝑖𝑛(𝑑(1)(3, 4),𝑑(2)(3, 1) + 𝑑(2)(1, 4)) 𝑑(2)(3, 5) = 𝑚𝑖𝑛(𝑑(1)(3, 5),𝑑(2)(3, 1) + 𝑑(2)(1, 5))

and 𝐷(2) : (0,2)

Figure 2: The 3 phases of the blocked FW algorithm in the first round.

Figure 3: The computations of 𝐷(2) , 𝐷(2) and their dependencies in the first round. (0,2) (1,2)

Figure 4: In this particular example where 𝑉 = 6 and 𝐵 = 2, we will require ⌈𝑉/𝐵⌉ = 3 rounds.

4 RUN YOUR PROGRAMS
● Command line specification

# CPU

srun -N1 -n1 -cCPUS ./hw3-1 INPUTFILE OUTPUTFILE

# Single-GPU

srun -N1 -n1 –gres=gpu:1 ./hw3-2 INPUTFILE OUTPUTFILE

# Multi-GPU

srun -N1 -n1 -c2 –gres=gpu:2 ./hw3-3 INPUTFILE OUTPUTFILE

  • ○  CPUS: Number of CPUs, specified by TA.
  • ○  INPUTFILE: The pathname of the input file. Your program should read theinput graph from this file.
  • ○  OUTPUTFILE: The pathname of the output file. Your program should outputthe shortest path distances to this file. CPUS: Number of CPUs, specified by

    TA.

● Input specification

  • ○  The input is a directed graph with non-negative edge distances.
  • ○  The input file is a binary file containing 32-bit integers. You can use the inttype in C/C++.
  • ○  The first two integers are the number of vertices (V) and the number of edges(E).
  • ○  Then, there are E edges. Each edge consists of 3 integers:
    1. source vertex id (𝑠𝑟𝑐𝑖)
    2. destination vertex id (𝑑𝑠𝑡𝑖)
    3. edge weight (𝑤𝑖)
  • ○  The values of vertex indexes & edge indexes start at 0.
  • ○  The ranges for the input are:
    • ●  2≤𝑉≤6000 (CPU)
    • ●  2≤𝑉≤40000 (Single-GPU)
    • ●  2≤𝑉≤60000 (Multi-GPU)
    • ●  0≤𝐸≤𝑉×(𝑉 − 1)
    • ●  0≤𝑠𝑟𝑐𝑖, 𝑑𝑠𝑡𝑖 < 𝑉
    • ●  𝑠𝑟𝑐𝑖 ≠ 𝑑𝑠𝑡𝑖
    • ●  if 𝑠𝑟𝑐𝑖 = 𝑠𝑟𝑐𝑗 then 𝑑𝑠𝑡𝑖 ≠ 𝑑𝑠𝑡𝑗 (there will not be repeated edges)
    • ●  0≤𝑤𝑖≤1000 Here’s an example:

offset

type

decimal value

description

0000

32-bit integer

3

# 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 (𝑉)

0004

32-bit integer

6

# 𝑒𝑑𝑔𝑒𝑠 (𝐸)

0008

32-bit integer

0

src id for edge 0

0012

32-bit integer

1

dst id for edge 0

0016

32-bit integer

3

edge 0’s distance

0020

32-bit integer

src id for edge 1

0076

32-bit integer

edge 5’s distance

● Output specification

  • ○  The output file is also in binary format.
  • ○  For an input file with 𝑉 vertices, you should output an output file containing𝑉2 integers.
  • ○  The first 𝑉 integers should be the shortest path distances for starting from edge 0: 𝑑𝑖𝑠𝑡(0, 0), 𝑑𝑖𝑠𝑡(0, 1), 𝑑𝑖𝑠𝑡(0, 2), …, 𝑑𝑖𝑠𝑡(0, 𝑉 − 1); then the following 𝑉integers would be the shortest path distances starting from edge 1:𝑑𝑖𝑠𝑡(1, 0), 𝑑𝑖𝑠𝑡(1, 1), 𝑑𝑖𝑠𝑡(1, 2), …, 𝑑𝑖𝑠𝑡(1, 𝑉 − 1); and so on, totaling 𝑉2

    integers.

  • ○  𝑑𝑖𝑠𝑡(𝑖, 𝑗) = 0 where 𝑖 = 𝑗.
  • ○  If there is no valid path between𝑖→𝑗, please output with:𝑑𝑖𝑠𝑡(𝑖, 𝑗) = 230 − 1 = 1073741823. Example output file:

offset

type

decimal value

description

0000

32-bit integer

0

𝑑𝑖𝑠𝑡(0, 0)

0004

32-bit integer

?

𝑑𝑖𝑠𝑡(0, 1)

0008

32-bit integer

?

𝑑𝑖𝑠𝑡(0, 2)

4𝑉2 − 8

32-bit integer

?

𝑑𝑖𝑠𝑡(𝑉 − 1, 𝑉 − 2)

4𝑉2 − 4

32-bit integer

0

𝑑𝑖𝑠𝑡(𝑉 − 1, 𝑉 − 1)

5 REPORT

Answer the questions below. You are recommended to use the same section numbering as they are listed.
1. Implementation

  1. Which algorithm do you choose in hw3-1?
  2. How do you divide your data in hw3-2, hw3-3?
  3. What’s your configuration in hw3-2, hw3-3? And why? (e.g. blocking factor,#blocks, #threads)
  4. How do you implement the communication in hw3-3?
  5. Briefly describe your implementations in diagrams, figures or sentences.
  1. Profiling Results (hw3-2)Provide the profiling results of following metrics on the biggest kernel of your program using NVIDIA profiling tools. NVIDIA Profiler Guide.
    • ○  occupancy
    • ○  sm efficiency
    • ○  shared memory load/store throughput
    • ○  global load/store throughput
  2. Experiment & Analysis
    1. System Spec
      If you didn’t use our hades server for the experiments, please show the CPU, RAM, disk of the system.
    2. Blocking Factor (hw3-2)
      Observe what happened with different blocking factors, and plot the trend in terms of Integer GOPS and global/shared memory bandwidth. (You can get the information from profiling tools or manual) (You might want to check nvprof and Metrics Reference)Figure 5: Example chart of performance and global memory bandwidth trend w.r.t. blocking factor

Note:
To run nvprof on hades with flags like –metrics, please run on the slurm

partition prof. e.g. srun -p prof -N1 -n1 –gres=gpu:1 nvprof –metrics gld_throughput ./hw3-2 /home/pp21/share/hw3-2/cases/c01.1 c01.1.out

c. Optimization (hw3-2)
Any optimizations after you port the algorithm on GPU, describe them with sentences and charts. Here are some techniques you can implement:

  • Coalesced memory access
  • Shared memory
  • Handle bank conflict
  • CUDA 2D alignment
  • Occupancy optimization
  • Large blocking factor
  • Reduce communication
  • StreamingFigure 6: Example chart of performance optimization¶

d. Weak scalability (hw3-3)

Observe weak scalability of the multi-GPU implementations

  1. Time Distribution (hw3-2) Analyze the time spent in:
    • ●  computing
    • ●  communication
    • ●  memory copy (H2D, D2H)
    • ●  I/O of your program w.r.t. input size.
  2. Others
    Additional charts with explanation and studies. The more, the better.

4. Experience & conclusion

  1. What have you learned from this homework?
  2. Feedback (optional)
  • hw3-n3xibk.zip