Name: DATA303 Assignment 1- Inference for Continuous Response Models I Solved
SKU: 101399
Price: 35.00 USD
Availability: InStock

Description

Rate this product

Assignment 1 Questions

Q1. Data on US cancer mortality rates for over 3000 counties are available in the dataset cancer_reg.csv available on Blackboard. The data were obtained from the Data World website (https: //data.world/nrippner/ols-regression-challenge). Read the data set into R and use it to answer the questions that follow. We’ll use the subset of variables listed below:

incidencerate: Mean per capita (100,000) cancer diagnoses1
medincome: Median annual income (dollars) per county (2
povertypercent: Percent of county population in poverty2
studypercap: Per capita number of cancer-related clinical trials per county1
medianage: Median age (in years) of county residents2
pctunemployed16_over: Percent of county residents aged 16 and over that are unemployed2
pctprivatecoverage: Percent of county residents with private health coverage2
pctbachdeg25_over: Percent of county residents aged 25 and over with bachelor’s degree as highest
education attained2
target_deathrate: Response variable. Mean per capita (100,000) cancer mortalities1
1 Years 2010-2016 2 2013 Census Estimates

a. (6 marks) Create a new dataset called cancer2 that contains only the subset of variables listed above. Based on a summary of the variables in the dataset and the plots below, identify any variable or variables that have obviously incorrect values. For the variables you identify, write and implement code to filter out the incorrect values. Give the number of observations left in the dataset.

300 200 100 300 300 250 500 750 1000 1250 Mean cancer diagnoses per 100,000 10 20 30 40 Percent of population in poverty 200 100 250005000075000100001025000 Median income per county 200 100
300 200 100 300 200 100 0 0 2500 5000 7500 10000 Number of cancer−related clinical trials per county 0 200 400 600 Median age of county	300 200 100 0 10 20 30 % aged 16 and over who are unemployed

300 200 100

10 20 30 40

% with private health coverage

% aged 25 and over with Bachelor’s degree as highest qualification

b. (4 marks) Some data cleaning is done on cancer2 and a new dataset cancer3.csv (available on Blackboard) is created. Construct a scatterplot matrix of all variables in the new dataset. List any key points of note from the scatterplot matrix, including any considerations you might make during a regression analysis.

Mortality Mortality

Mortality

Mortality Mortality

Mortality

c. (3 marks) Fit a linear model to the data in cancer3, including all predictors with no transformations or interactions. Present a summary of the model in a table. Give an estimate of σ2, the error variance.

(2 marks) Suppose two counties differ by 1 per 100,000 in mean cancer diagnoses with all else being
equal. Based on the model fitted in part (c), what is the difference in expected cancer mortality for

these two counties?
(2 marks) Does it make practical sense to interpret the intercept for the model in part (c)? Justify
your answer.
(3 marks) The model fitted in part (c) is to be used to predict cancer mortality for a county with
the predictor values below. Obtain 95% confidence and prediction intervals for such a county. Explain briefly why the prediction interval is wider than the confidence interval.

incidencerate: 452
medincome: 23000
povertypercent: 16
studypercap: 150
medianage: 40
pctunemployed16_over: 8
pctprivatecoverage: 70
pctbachdeg25_over: 50

(3 marks) Assuming all regression assumptions hold, are the intervals you obtained in part (f) likely to be valid? Explain your answer briefly.
(3 marks) Based on a global usefulness test, is it worth going on to further analyse and interpret a model of target_deathrate against each of the predictors? Carry out the test, give the conclusion and justify your answer.
(2 marks) The plots below are constructed from the cleaned dataset cancer3. Which predictors, if any, would you consider applying log or polynomial transformations to? Explain your answer briefly.

300 200 100 300 300 250 500 750 1000 1250 Mean cancer diagnoses per 100,000 10 20 30 40 Percent of population in poverty 200 100 250005000075000100001025000 Median income per county 200 100
300 200 100 300 200 100 0 2500 5000 7500 10000 Number of cancer−related clinical trials per county 30 40 50 60 Median age of county	300 200 100 0 10 20 30 % aged 16 and over who are unemployed

300 200 100

10 20 30 40

% with private health coverage

% aged 25 and over with Bachelor’s degree as highest qualification

Mortality Mortality

Mortality

Mortality Mortality

Mortality

Q2. Francis Galton’s 1866 dataset (cleaned) lists individual observations on height for 899 children. Galton coined the term “regression” following his study of how children’s heights related to heights of their parents. The data are available in the file galton.csv and contain the following variables:

• familyID: Family ID
• father: Height of father
• mother: Height of mother
• gender: gender of child
• height: Height of child
• kids: Number of childre in family
• midparent: Mid-parent height calculated as (‘father + 1.08*mother)/2 • adltchld: height if gender=M, otherwise 1.08*height if gender= F

All heights are measured in inches.

(3 marks) Read the data into R and fit a linear model for height with the variables father, mother, gender, kids and midparent as predictors. Provide a summary of the fitted model. You will notice that estimates for midparent are listed as NA. Why might this be the case and what regression problem does this point to?
(2 marks) What action might you take to resolve the problem identified in part (a)?
(2 marks) Based on the model fitted in part (a) give an interpretation of the coefficient for genderM.
(2 marks) Determine the number of families in the dataset.
(3 marks) The problem in part (a) is resolved and a new linear model is fitted.No observations are
excluded. The plots below are obtained to investigate regression assumptions for this new model. Based on your answer in part (d) and the plots below, do the data meet all the regression assumptions? Explain your answer briefly.

Residuals vs Fitted

62 64 66 68 70 72 74

Fitted values

Scale−Location

62 64 66 68 70 72 74

Fitted values

Assignment total: 40 marks

Normal Q−Q

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Residuals vs Leverage

0.000 0.005 0.010 0.015 0.020

Leverage

289

60 479

	126 815
	Cook’s distance 60

472989 60

Assignment-1-Inference-for-Continuous-Response-Models-I-9mvfyd.zip

DATA303 Assignment 1- Inference for Continuous Response Models I Solved

If Helpful Share:

Description

Related products

SOLVED:COMP 2401 — Assignment #5

SOLVED:201 Assignment 3

SOLVED: COP 3223 Introduction to C Summer Assignment 2