STAT823 Project -Datasets and Instructions Solved

30.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout

Description

Rate this product

Choose only one of the questions and use the provided RMardown Project Template or a similar LATEX Template to write your project report.

For the selected Data (question),

  1. Create a title of your research question from the objective of the study.
  2. The response and exposure variables for each dataset are provided. The identification number (idnum) variable is not part of the covariates of interest.
  3. Fit appropriate statistical model(s). (See provided Hint/suggestions.) Explore the data. You may have to transform the response variable or covariates and/or standardize some covariates if necessary. Check for correlations among the variables. Use all important covariates or perform variable selections using standard statistics methods.
  4. Check for goodness of fit of the models and select the best that fits the data well.
  5. Produce residual plots to check for model assumptions including independence/multicollinearity, equal variance, outliers and normality of residuals. 6. In your report, make sure to produce Tables for Descriptive/summary statistics
  6. Create a Table(s) of inferential statistics from the final model.
  7. Select only a few important graphs (scatterplots/line graphs, boxplots, barchars, etc) and show the relationship between the response and the covariates of interest.
  8. Write your full report (in pdf format) and draw conclusions based on your study objective(s).

You are required to make slides from your final report and record your presentation.

Submit both your project report (in pdf format) and the recorded slides/powerpoint presentation for grading.

Qn 1: IPO Dataset

Model Hint/Suggestion: Multiple logistic regression model

Private companies often go public by issuing shares of stock referred to as initial public offerings (IPOs). A study of 482 IPOs was conducted to determine what are the characteristics of companies that attract venture capital funding. The response of interest is whether or not a company was financed with venture capital funds (Variable: funding). Potential predictors include the face value of the company (in millions), the number of shares offered (in millions), and whether or not the company underwent a leveraged buyout. Each line of the data set has an identification number and provides information on 4 other variables for a single person. The 5 variables are:

Variable Variable Name Description
1 idnum: Identification number 1 482
2 funding: Venture capital funding Presence or absence of venture capital funding: 1 if yes; 0 otherwise
3 facevalue: Face value company Estimated face value of company from prospectus (in Million dollars)
4 shares: Number of shares offered Total number of shares offered (in Millions)
5 buyout: Leverage buyout Presence or absence of leveraged buyout: 1 if yes; 0 otherwise
  idnum funding facevalue shares buyout  
  1 0 1.2 3 0  
  2 0 1.45 1.45 1  
  3 0 1.5 0.3 0  
  4 0 1.53 0.51 0  
   
  479 0 143.24 11.02 1  
  480 0 159.5 7.25 0  
  481 0 165 11 0  
  482 0 234.6 9.2 0  

Qn 2: Prostate Cancer Dataset

Model Hint/Suggestion: Multiple linear regression model

A university medical center urology group was interested in the association between prostate-specific antigen (PSA level) is the response variable and a number of prognostic clinical measurements in men with advanced prostate cancer. Data were collected on 97 men who were about to undergo radical prostectomies. Each line of the data set has an identification number and provides information on 8 other variables for each person. The 9 variables are:

Table 2: Adapted in part from: Hastie, T. J.; R. J. Tibshirani; and J. Friedman. The Elements of Statistical Learning: Data Mining. Inference. and Prediction. New York: Springer-Verlag, 2001.

Variable Variable Name Description
1 idnum: Identification number 1 972
2 psa: PSA level Serum prostate-specific antigen level

(mg/ml)

3 cancerv: Cancer volume Estimate of prostate cancer volume (cc)
4 weight: Weight Prostate weight (gm)
5 age: Age Age of patient (years)
6 hyperplasia: Benign prostatic hyperplasia Amount of benign prostatic hyperplasia

(cm2)

7 seminal: Seminal vesicle invasion Presence Or absence of seminal vesicle invasion: 1 if yes; 0 otherwise
8 capsular: Capsular penetration Degree of capsular penetration (cm)
9 score: Gleason score Pathologically determined grade of disease using total score of two patterns (summed scores were either 6, 7, or 8 with higher scores indicating worse prognosis)
  idnum psa cancerv weight age hyperplasia seminal capsular score  
  1 0.65 0.56 15.96 50 0 0 0 6  
  2 0.85 0.37 27.66 58 0 0 0 7  
  3 0.85 0.6 14.73 74 0 0 0 7  
   
  95 170.72 18.36 29.96 52 0 1 11.7 8  
  96 239.85 17.81 43.38 68 4.76 1 4.76 8  
  97 265.07 32.14 52.98 68 1.55 1 18.17 8  

Qn 3: Website Developer Dataset

Model Hint/Suggestion: Start with a Poisson regression model, check for over-dispersion, if present, consider fitting a Negative Binomial regression model

Recall that for Poisson regression, one of the assumptions for a valid model is that the mean and variance of the count variable are equal. The negative binomial distribution is a more generalized form of distribution used for ‘count’ response data, allowing for greater dispersion or variance of counts. In practice, it is quite common for the variance of the outcome to be larger than the mean. This is called overdispersion. If a count variable is overdispersed, Poisson regression underestimates the standard errors of the predictor variables. When overdispersion is evident, one solution is to specify that the errors have a negative binomial distribution.

Management of a company that develops websites was interested in determining which variables have the greatest impact on the number of websites developed and delivered to customers per quarter (Response variable: Websites delivered). Data were collected on website production output for 13 three-person website development teams, from January 2001 through August 2002. Each line of the data set has an identification number and provides information on 6 other variables for thirteen teams over time. The 8 variables are:

Variable Variable Name Description
1 idnum: Identification number 1 73
2 delivered: Websites delivered Number of websites completed and delivered to customers during the quarter
3 backlog: Backlog of orders Number of website orders in backlog at the close of the quarter
4 teamnum: Team number 1 13
5 experience: Team experience Number of months team has been together
6 change: Process change A change in the website development process occurred during the second quarter of 2002: 1 if quarter 2 or 3, 2002; 0 otherwise
7 year: Year 2001 or 2002
8 quarter: Quarter 1,2,3, or 4
  idnum delivered backlog teamnum experience change year quarter  
  1 1 12 1 3 0 2001 1  
  2 2 18 1 6 0 2001 2  
  3 7 26 1 9 0 2001 3  
  4 2 28 1 12 0 2001 4  
   
  70 7 28 13 11 0 2001 4  
  71 7 36 13 14 0 2002 1  
  72 19 37 13 17 1 2002 2  
  73 12 26 13 20 1 2002 3  

Qn 4: Market Share Dataset

Model Hint/Suggestion: Multiple linear regression model

Company executives from a large packaged foods manufacturer wished to determine which factors influence the market share of one of its products (market share is the response variable). Data were collected from a national database (Nielsen) for 36 consecutive months. Each line of the data set has an identification number and provides information on 6 other variables for each month. The data presented here are for September, 1999, through August, 2002. The variables are:

Variable Variable Name Description
1 idnum: Identification number 1 36
2 marketshare: Market share Average monthly market share for product

(percent)

3 price: Price Average monthly price of product (dollars)
4 gnrpoints: Gross Nielson rating points An index of the amount of advertising exposure that the product received
5 discount: Discount price Presence or absence of discount price during period: 1 if discount, 0 otherwise
6 promotion: Package promotion Presence or absence of package promotion during period: 1 if promotion present, 0 otherwise
7 month: Month Month (Jan-Dec)
8 year: Year Year (1999 – 2002)
  idnum marketshare price gnrpoints discount promotion month year  
  1 3.15 2.2 498 1 1 Sep 1999  
  2 2.52 2.19 510 0 0 Oct 1999  
  3 2.64 2.29 422 1 1 Nov 1999  
  4 2.55 2.42 858 0 1 Dec 1999  
  NA  
  33 2.88 2.42 145 1 1 May 2002  
  34 2.8 2.52 270 1 0 Jun 2002  
  35 2.48 2.5 322 0 1 Jul 2002  
  36 2.85 2.78 317 1 1 Aug 2002  

Qn 5: Disease Outbreak Dataset

Model Hint/Suggestion: Multiple logistic regression model source = Book Website

Adapted in part from H.G. Dantes, J.S. Koopman, C.L. Addy, et. al., “Dengue Epidemics on the Pacific Coast of Mexico.” International Journal of Epidemiology 17 (1988), pp. 178 86

The data set below provides information from a study based on 196 persons selected in a probability sample within two sectors in a city. Assume that the response variable (main outcome of interest) is disease: (Disease status) which is coded 1 if the person has a disease or 0 if they do not have a disease. Each line of the dat set has an identification number (id) and provides information on 5 other variables (exposure/independent variables) for each person. The 6 variables are:

Variable Variable Name Description
1 id: Identification number 1 196
2 ageyrs: Age Age of person (in years)
3 ses: Socio-economic status 1 = upper, 2 = middle, 3 = lower
4 sector: Sector Sector within city, where: 1 = sector 1, 2 = sector 2
5 disease: Disease status 1 = with disease, 0 = without disease
6 savings: Savings account status 1 = has savings account, 0 = does not have savings account
  id ageyrs ses sector disease savings  
  1 33 1 1 0 1  
  2 35 1 1 0 1  
  3 6 1 1 0 0  
  4 60 1 1 0 1  
   
  193 10 3 1 0 1  
  194 31 3 1 0 0  
  195 85 3 1 0 1  
  196 24 2 1 0 0  

Qn 6: Mosquito larva infestation Dataset

Model Hint/Suggestion: Multiple Poisson regression and Negative Binomial regression Models

Recall that for Poisson regression, one of the assumptions for a valid model is that the mean and variance of the count variable are equal. The negative binomial distribution is a more generalized form of distribution used for ‘count’ response data, allowing for greater dispersion or variance of counts. In practice, it is quite common for the variance of the outcome to be larger than the mean. This is called overdispersion. If a count variable is overdispersed, Poisson regression underestimates the standard errors of the predictor variables. When overdispersion is evident, one solution is to specify that the errors have a negative binomial distribution.

Use the data set DHF99 from the R package epiDisplay. Type library(epiDisplay) then ?DHF99 to see more details about the dataset.

The main outcome of interest (response variable) is counts of water containers infested with mosquito larvae in a field survey. This is variable containers in the data.

library(epiDisplay) data(“DHF99”)

# create a new dataset to manipulate malaria <- DHF99 summ(malaria)

  ##

## No. of observations = 300

##

   
  ##          Var. name obs. mean             median s.d.          min. max.  
  ## 1 houseid                300 174.27 154.5           112.44 1 385  
  ## 2 village                   300 48.56 51                  32.25 1 105  
  ## 3 education 300 2.09                   1                 1.455 1 5  
  ## 4 containers 299 0.35                  0               1.01       0 11  
  ## 5 viltype                300 1.56         1                 0.754 1 3  

codebook(malaria)

##

##

##

## houseid          :       no

## obs. mean                median s.d.            min.       max.

## 300 174.273 154.5                  112.439 1             385

##

## ==================

## village            :         Village

## obs. mean              median s.d.          min.       max.

## 300 48.56 51                         32.253 1              105

##

## ==================

## education              :           Educational level

##                             Frequency Percent

## Primary 168 56.00
## Secondary 36 12.00
## High school 34 11.33
## Bachelor 25 8.33
## Other 37 12.33

##

## ==================

## containers              :           # infested vessels

## obs. mean              median s.d.          min.       max.

## 299 0.351 0                            1.014 0               11

##

## ==================

## viltype            :          Village type

##                 Frequency Percent

## rural           180      60 ## urban     72        24 ## slum           48        16

##

## ==================

  houseid village education containers viltype
1 1 22 Other 3 rural
2 2 22 Primary 1 rural
3 3 22 Primary 0 rural
4 4 22 Primary 0 rural
NA NA
297 382 39 Primary 0 rural
298 383 39 Primary 0 rural
299 384 39 Primary 0 rural
300 385 39 Primary 0 rural

 

  • Project-cyueid.zip