CS8803 Module2 Solved

30.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout Second Badge

Description

5/5 - (5 votes)

In this assignment, you’ll begin the process of exploring relationships in data. You’ll accomplish this task by computing some basic statistical measures on one of three datasets. This is a good time to learn or reboot your Python coding skills.

Step 1Select one of the datasets for completion of this assignment:

  • [mental-health-in-tech-survey.csv] Mental Health in Tech Survey: Survey on Mental Health in the Tech Workplace in 2014 – https://osmihelp.org/research/

Dependent Variables:

  • treatment: Have you sought treatment for a mental health condition? (Yes/No) o mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? (Yes/Maybe/No)
  • phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? (Yes/Maybe/No)
  • [diabetic_data.csv] Diabetes 130 US hospitals for years 1999-2008: Diabetes – readmission – https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Dependent Variables:

  • time_in_hospital: a numeric value representing number of days between admission and discharge
  • readmitted: Days to inpatient readmission – “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.
  • [compas-scores-two-years.csv] COMPAS Recidivism Racial Bias: Racial Bias in inmate COMPAS reoffense risk scores for Florida (ProPublica) – https://github.com/propublica/compasanalysis

Dependent Variables: o decile_score: a numeric value between 1 and 10 corresponding to the recidivism risk score generated by COMPAS software (a small number corresponds to a low risk, a larger number corresponds to a high risk).

  • two_year_recid: a numeric indicator of whether the defendant recidivated two years after previous charge (0: no, did not recidivate, 1: yes, did recidivate)

 

Step 2Explore the data by answering the following questions:

  • Which dataset did you select?
  • How many observations are in the dataset?
  • How many variables in the dataset?
  • Does this dataset seem to belong to a regulated domain in law as discussed in the lectures? If yes, which one?
  • How many variables in the dataset are associated with a legally recognized protected class? In a table format, list those variables associated with a protected class, identify the protected class and the associated legal precedence/law as discussed in the lectures.

Example Output (associated with a different dataset) – Dataset: Housing Decisions in Metro-Atlanta

Number of Observations: 1,400

Number of Variables: 16

Regulated Domain in Law: Housing (Fair Housing Act)

Number of Protected Class Variables: 2

  Protected Class Law
nationality National origin Civil Rights Act of 1964, 1991
pregnant (y/n) Pregnancy Pregnancy Discrimination Act

 

 

Step 3 – Determine the relationships between dependent and independent variables

The frequency of a value represents the number of times a value occurs in a data set. Compute the frequency of each value associated with each dependent variable (listed in Step 1) as a function of all of the protected class variables (independent variables) identified in Step 2. Create histogram(s) comparing the frequency values of the dependent variable as a function of the independent variable. Hint: For variables that are continuous, you might consider creating intervals that represent the data. For categorical/ordinal/nominal values, you might consider converting to numerical values.

 

Example Output for One Dependent-Independent Variable Combination:  

Independent Variable –

Protected Class Variable

Dependent Variable –

Housing Decision (Y/N)

Pregnant – Y Frequency of Y: 50 Frequency of N: 120
Pregnant – N Frequency of Y: 130 Frequency of N: 20

 

 

Step 4Show how to manipulate with data

Select one protected class variable (independent variable) and one dependent variable. 1) Create a graph to support the “fairness” hypothesis: The system is fair. There is no difference in the outcomes. 2) Create a graph to support the bias hypothesis: The system is biased. There is a difference in the outcomes. For each, provide a brief description of your manipulations.

 

Example Output:

 

  • Fair Hypothesis: As seen from this graph, housing decisions are not dependent on the pregnancy status of women. [Manipulations: Used line graph; Increased Scale to +-50; Mapped the ratio of positive Y decisions (i.e. 50/180 versus 130/180); No label on the Y-Axis].

Difference   in                     Housing           Decisions         Based              on                    Pregnancy

 

  • Bias Hypothesis: As seen from this graph, housing decisions are significantly dependent on the pregnancy status of women. [This hypothesis was easily supported with the data so didn’t require much in manipulations: Used stacked bar graph; Reduced Scale; Reworded labels].

 

 

 

 

Step 5: Given your selected protected class variable (independent variable), calculate the average (mean, median, and mode) values of the protected class group (Hint: Variables might need to be converted to numerical values as needed). Run the random sampling method using 50% of the data to create a reduced dataset. Calculate the average (mean, median, and mode) values of the protected class group. Indicate if there is a difference (or not) between the original dataset and the reduced dataset for any of the averages.  Provide all results.

 

Protected Class Variable (Pregnant) Mean Median Mode
Original Data Set 0 (NO) 0 (NO) 0 (NO)
Reduced Data Set 0 (NO) 1 (YES) 0 (NO)
Difference No Difference Difference No Difference

 

Step 6: Given your reduced dataset from Step 5, Repeat Step 3 (frequency and histogram) using your selected independent variable as a function of your selected dependent variable (from Step 4).  Explain any differences (in no more than 2 sentences). If you used the random sampling method, would members associated with the protected class variable benefit or be harmed? Explain your reasoning (in no more than 2 sentences).

 

Step 7: Turn in a report documenting your outputs.

 

  • Assignment-2-49oyej.zip