CSE6250: Big Data Analytics in Healthcare Homework 4 Solution

40.00 $

Categories: ,

Description

5/5 - (4 votes)

Overview

Patients often exhibit highly complex clinical presentations in the clinic, making it difficult to determine optimal treatment solutions or understand health risks in any particular patient.

Meanwhile, electronic health record systems and health records provide rich information on aspects of patients such as diagnosis and medication histories. These data can be leveraged in order to identify patients that are similar to each other via patient similarity algorithms. The insight from patient similarity may be used for applications such as allocation of resources, determining targeted treatment plans, or constructing cohorts for predictive modeling studies.

There are several strategies for patient similarity, including graph based algorithms. In this homework, you will study related concepts and implement simple algorithms to compute patient similarity. You will be required to implement those algorithms in Spark GraphX using Scala.

1  Heterogeneous patient graph [25 points]

Graphical models are one way to represent patient EHR data. Unlike the traditional approaches for data storage, such as relational databases, graphical models can give us insight into the relations among patients, diagnosis, medications, labs, etc. Not much research has been done on using graphs in healthcare applications, needless to say, there is no existing implementation that uses Spark GraphX to construct a patient graph and perform various analyses using those new big data tools.

Implement code that consumes patient, diagnosis, medication, and lab input extracted from the MIMIC II database and return a GraphX model. You will use this model in subsequent steps to perform additional tasks. Your algorithm should be implemented in the GraphLoader.load function. Please do not modify the function declaration. You will lose points for doing so.

The following files will be provided for you to construct the graph (ensure that those files reside in your data directory):

  • csv: Each line represents a patient with some demographics, such as gender and date of birthday (dob), and date of death (dod).
  • csv: Each line represents a diagnosis for a corresponding patient ID. In addition to the diagnosis and patient ID the file contains other information such as the date and diagnosis sequence (primary, secondary, etc.).
  • csv: Each line represents a medication for a corresponding patient. med name represents the name of the medication.
  • csv: Each line represents a lab result for a patient. The name of the lab, the units for the lab, and the value for the lab are found in specific columns on this file.

Important note: every record in the diagnostic, medication and lab CSV files corresponds to an edge in the graph, representing an event. Therefore, a single patient can have multiple events related to the same diagnosis, medication or lab. Moreover, you will only create a single edge between a patient and diagnosis in the graph using the most recent event information. If there are two same diagnosis given on the same day, just take one of them. The same applies for medications and labs. To make it more clear, suppose we have the sample diagnostic data in the Table 1 below, you will create an edge for the event in the highlighted row only.

Table 1: Sample diagnostic data

subject id date enc id code sequence
3 211574 2075 774.6 1
3 249345 2099 774.6 1
3 507510 2125 774.6 2
. . . . .

Your task is to use the files above to generate a bipartite graph in GraphX containing patient, diagnosis, medication and lab vertices. You will then create edges that will only connect patients to diagnosis, medication and lab. Details about each vertex and edge follows:

Patient vertex: a vertex containing patient related information stored in a PatientProperty class which extends VertexProperty. The PatientProperty class contains the fields:

  • subject id
  • sex
  • dob: date of birth
  • dod: date of death

Diagnostic vertex: a vertex containing diagnosis related information stored in a DiagnosticProperty class which extends VertexProperty. The DiagnosticProperty class contains the follow fields:

  • code: the ICD9 diagnosis code

Lab result vertex: a vertex containing lab result information stored in a LabResultProperty class which extends VertexProperty. The LabResultProperty class contains the fields:

  • lab name: name associated with the lab result

Medication vertex: a vertex containing medication related information stored in a MedicationProperty class which extends VertexProperty. The MedicationProperty class contains the fields:

  • med name: medication name

The graph should contain three types of edges: patient-lab, patient-diagnostic and patientmedication. Similar to the vertices, each of those edges also have properties and are defined as follows:

  • Patient-lab edge: an edge containing information linking a patient to a lab result, which is stored in a PatientLabEdgeProperty class which extends EdgeProperty. The PatientLabEdgeProperty class contains labResult which is of LabResult class defined in models.
  • Patient-diagnostic edge: an edge containing information linking a patient to a diagnosis, which is stored in a PatientDiagnosticEdgeProperty class which extends EdgeProperty. The PatientDiagnosticEdgeProperty class contains diagnostic, which is a Diagnostic class defined in models.
  • Patient-medication edge: an edge containing information linking a patient to a medication, which is stored in a PatientMedicationEdgeProperty class which extends EdgeProperty. The PatientMedicationEdgeProperty class contains medication, which is a Medication class defined in models.

Notice that there are no edges between patients, or between diagnosis, medications and labs.

From this section you are to perform the following tasks:

  • Construct patient heterogeneous graph as discussed above.
  • All edges in the graph should be bi-directional.
  • Make sure for patient vertices you use the patientID as a VertexId and for other types of vertices generate vertex IDs.
  • Please implement your code in load(). DO NOT change the method signature and you are allowed to add any other secondary methods that you can call from these two methods.

2             Compute Jaccard coefficient [25 points]

Jaccard coefficient is one of the simplest approaches for computing similarities among objects. For instance, given two patients each described by a set of diagnosis, medication and lab results such that Pi = {Dx1,Rx3,Lab6…,} and Pj = {Lab3,Dx2,Rx5…,} the Jaccard similarity between the two patients is given by

Two patients are completely similar if sij = 1 and dissimilar if sij = 0.

Using the Jaccard similarity, you are to perform the following tasks:

  • Please implement your code in Jaccard.jaccardSimilarityOneVsAll(). DO NOT change the method signature and you are allowed to add any other secondary methods that you can call from these two methods. Please do not modify the function declaration. You will lose points for doing so.

3             Random walk with restart [25 points]

Random walk with restart (RWR) is a simple variation of PageRank. With PageRank, you start at a graph vertice and move to one of the adjacent vertices at each step. You also have a random probability where you jump to a random vertice instead of one of the adjacent vertice. With RWR, you also have a random jump probability (a.k.a reset probability), but instead of jumping to a random vertice you jump to the vertice you began with.

The RWR algorithm will compute the random walk among all vertices in the graph. If there are n patients, d diagnosis, m medications and l labs, then the output of RWR is a vector of k elements, where k = n+d+m+l is the number of vertices in the graph. Refer to http://www.cs.cmu.edu/~jimeng/papers/icdm05.pdf for more details about RWR [1].

  • Implement RWR by completing the RandomWalk.randomWalkOneVsAll() method in the RandomWalk object. Please implement your RWR on your own. You can refer to the GraphX library but do not directly use the existing function. Your RWR by default should be able to run for 100 iterations using a reset probability of 0.15 and return only the top 10 similar patients ignoring similarities between medications, diagnostics, and labs.

4             Power Iteration Clustering [20 points]

Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a graph given pairwise similarties as edge properties. MLlib includes an implementation of PIC, which takes an RDD of (srcId, dstId, similarity) tuples and outputs a model with the clustering assignments. The similarities must be nonnegative. PIC assumes that the similarity measure is symmetric. A pair (srcId, dstId) regardless of the ordering should appear at most once in the input data. You may use print statements for debugging but comment any print statements you added before submitting.

  • For this question, your task is computing pairwise similarities between all patients. Please implement your code in jaccardSimilarityAllPatients(). DO NOT change the method signature and you are allowed to add any other secondary methods that you can call from this method. In Main.main you will see how this method is invoked [10 points]
  • Please complete runPIC(). It is just a kind of wrapper to call Spark’s built-in PIC implementation. You need to pass all pair similarities you get from the previous question as input for this function. Then, you can pass it though Spark’s PIC implementation with the proper configuration. Please refer to PIC doc in spark. Use three clusters and 100 for maximum iterations. You have to return the clustering result as RDD[(patientID, clusterLabel)] where the type of variables are patientID: Long and clusterLabel: Int. [10 points]

5             Submission[5 points]

The folder structure of your submission should match the folder structure listed below or your code will not be graded. You can display your folder structure using the tree command. All unrelated files will be discarded during testing. It is your duty to make sure your code can be compiled with the provided SBT.

<your gtid>-<your gt account>-hw4

|– build.sbt

|– project

|                 |– build.properties

|                — plugins.sbt

|– src

|           — main

|                       — scala

|                               — edu

|                                          — gatech

|                                                    — cse6250

|                                                                |– clustering

|                                                          |                      — PowerIterationClustering.scala

|                                                                 |– graphconstruct

|                                                          |                — GraphLoader.scala

|                                                              |– helper

|                                                          |               — CSVHelper.scala

|                                                          |                 — SparkHelper.scala

|                                                              |– jaccard

|                                                          |               — Jaccard.scala

|                                                            |– main

|                                                                       |              — Main.scala
| |– model
|                                                                       |              — models.scala
| — randomwalk
| — Randomwalk.scala

Create a tar archive of the folder above with the following command and submit the tar file.

tar -czvf <your gtid>-<your gt account>-hw4.tar.gz

<your gtid>-<your gt account>-hw4

References

[1] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly detection in bipartite graphs. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages 8–pp. IEEE, 2005.

  • CSE_8803_Graphs-using-Patient-Data.zip