CSE512 Assignment 2-Parallel Spatial Join with PostGIS Solved

35.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout

Description

5/5 - (1 vote)

The required task is to understand and implement two tasks: i) Parallel spatial joining technique on-top of an open-source relational database management system (i.e., PostgreSQL) using PostGIS spatial extension and ii) Spatial joining with a map reduce program on top of Apache Spark using the Apache Sedona spatial extension.

Input Data: Input datasets consist of two spatial datasets: a point dataset and a rectangle dataset. The point dataset (points.csv) consists of latitudes and longitudes of various points while the rectangle dataset (rectangles.csv) consists of latitudes and longitudes of two diagonal points of rectangles.

Each row in points.csv file has the format longitude,latitude while the same for the rectangles.csv file is longitude1,latitude1,longitude2,latitude2.

Our goal is to perform spatial join between points dataset and rectangles dataset and return the number of points inside each rectangle (including points on rectangle boundary).

 

**Required Tasks**

Part A (Parallel Spatial Join with PostGIS)

We load the points and rectangles datasets into two PostgreSQL tables: points and rectangles respectively. The points table has a geometry type column geom which denotes Point geometry. The rectangles table has a geometry type column geom which denotes Envelop geometry (rectangle). You are given a Python file Assignment2_Interface.py with an incomplete function parallelJoin. You need to complete this function. It has following attributes: pointsTable, rectsTable, outputTable, outputPath, and openConnection. The highlevel objective is to find the number of points inside each rectangle save it to the outputTable and outputPath. Perform the following tasks:

  • Create four partitions/fragments of both pointsTable and rectsTable. You should consider space partitioning such that all points or rectangles of a partition should lie within the corresponding fragment. Fragments doesn’t need to satisfy disjointness property.
  • Run four parallel threads. Each thread should perform a spatial join between a fragment of pointsTable and a fragment of rectsTable. The purpose of the join is to find the number of points (pointsTable.geom) inside each rectangle or Envelop (geom) within the corresponding fragment. You must make use of ST_Contains method supported by PostGIS.
  • Sort the output of each fragment in the ascending order of counts of points inside the parallel threads.
  • Merge the outputs of four parallel joins into outputTable in the ascending order of counts of points. You can design the structure of the outputTable as you wish, but it should have a column named points_count containing counts of points.
  • Write the counts of points into the outputPath in the ascending order. You don’t need to write rectangle coordinates.

Set up Instructions for Part A: You should have PostgreSQL >=13 installed on your device. On top of it, you should install PostGIS extension.

Naming Conventions to be Followed: You should not change the following naming conventions:

  • Database Name: dds_assignment2
  • Postgres Username: postgres
  • Postgres Password: 12345

 

 

Part B (Spatial Join with Map Reduce on Apache Sedona)

Similar to Part A, you need to perform a spatial join between the points dataset and rectangles dataset in order to count the number of points located within each rectangle. This time, you will perform the task with the help of map and reduce functionalities supported by Spark.

Task: You are given a Apache Sedona project named Map-Reduce-Apache-Sedona. Find the Scala file SparkMapReduce.scala. Complete the incomplete function runMapReduce(). Instead of using group by operation, you must make use of Spark map and reducebyKey operations to complete the task. Covert the resultant RDD back to a DataFrame before returning the result. The returned DataFrame should contain the number of points within each rectangle in an ascending order of count values. You don’t need to return rectangles.

Setting Up Hadoop and Apache Spark Test Environment:

The following setup instructions are specific to Ubuntu operating system:

  • Install Java version >= 1.8 and set up the JAVA_HOME environment variable. If you don’t know how to set up the JAVA_HOME environment variable, follow the link: https://askubuntu.com/questions/175514/howtosetjavahomeforjava. To check whether set up is done, run the command echo $JAVA_HOME on your command line. You should see the path.
  • Download Hadoop-2.7.7 from the link: https://archive.apache.org/dist/hadoop/common/hadoop2.7.7/hadoop-2.7.7.tar.gz and extract it to your desired location. Now, you need to set up the HADOOP_HOME environment variable. Run the command sudo nano ~/.bashrc on the command line. Copy the statement export HADOOP_HOME=path to the folder hadoop-2.7.7 in the opened file. Save and close the file. Run the command source ~/.bashrc and check the correctness of the setup with echo $HADOOP_HOME
  • Download Spark-2.4.7 from the link: https://drive.google.com/file/d/1XlRcJNoGTaf0YvVv8pwZzzjrCqGIRA/view?usp=sharing and extract it to your desired location. Now, you need to set up the SPARK_HOME environment variable. Run the command sudo nano ~/.bashrc on the command line. Copy the statement export SPARK_HOME=path to the folder spark-2.4.7-bin-hadoop2.7 in the opened file. Save and close the file. Run the command source ~/.bashrc and check the correctness of the setup with echo $SPARK_HOME If you browse the path to the SPARK_HOME, you should see spark-submit under the bin folder which is required for your testing.
  • Assignment2-totaw7.zip