Description

Rate this product

Choose only one of the questions and use the provided RMardown Project Template or a similar L^ATEX Template to write your project report.

For the selected Data (question),

Create a title of your research question from the objective of the study.
The response and exposure variables for each dataset are provided. The identification number (idnum) variable is not part of the covariates of interest.
Fit appropriate statistical model(s). (See provided Hint/suggestions.) Explore the data. You may have to transform the response variable or covariates and/or standardize some covariates if necessary. Check for correlations among the variables. Use all important covariates or perform variable selections using standard statistics methods.
Check for goodness of fit of the models and select the best that fits the data well.
Produce residual plots to check for model assumptions including independence/multicollinearity, equal variance, outliers and normality of residuals. 6. In your report, make sure to produce Tables for Descriptive/summary statistics
Create a Table(s) of inferential statistics from the final model.
Select only a few important graphs (scatterplots/line graphs, boxplots, barchars, etc) and show the relationship between the response and the covariates of interest.
Write your full report (in pdf format) and draw conclusions based on your study objective(s).

You are required to make slides from your final report and record your presentation.

Submit both your project report (in pdf format) and the recorded slides/powerpoint presentation for grading.

Qn 1: IPO Dataset

Model Hint/Suggestion: Multiple logistic regression model

Private companies often go public by issuing shares of stock referred to as initial public offerings (IPOs). A study of 482 IPOs was conducted to determine what are the characteristics of companies that attract venture capital funding. The response of interest is whether or not a company was financed with venture capital funds (Variable: funding). Potential predictors include the face value of the company (in millions), the number of shares offered (in millions), and whether or not the company underwent a leveraged buyout. Each line of the data set has an identification number and provides information on 4 other variables for a single person. The 5 variables are:

Variable	Variable Name				Description
1	idnum: Identification number				1 − 482
2	funding: Venture capital funding				Presence or absence of venture capital funding: 1 if yes; 0 otherwise
3	facevalue: Face value company				Estimated face value of company from prospectus (in Million dollars)
4	shares: Number of shares offered				Total number of shares offered (in Millions)
5	buyout: Leverage buyout				Presence or absence of leveraged buyout: 1 if yes; 0 otherwise
		idnum	funding	facevalue		shares	buyout
		1	0	1.2		3	0
		2	0	1.45		1.45	1
		3	0	1.5		0.3	0
		4	0	1.53		0.51	0
		…	…	…		…	…
		479	0	143.24		11.02	1
		480	0	159.5		7.25	0
		481	0	165		11	0
		482	0	234.6		9.2	0

Qn 2: Prostate Cancer Dataset

Model Hint/Suggestion: Multiple linear regression model

A university medical center urology group was interested in the association between prostate-specific antigen (PSA level) is the response variable and a number of prognostic clinical measurements in men with advanced prostate cancer. Data were collected on 97 men who were about to undergo radical prostectomies. Each line of the data set has an identification number and provides information on 8 other variables for each person. The 9 variables are:

Table 2: Adapted in part from: Hastie, T. J.; R. J. Tibshirani; and J. Friedman. The Elements of Statistical Learning: Data Mining. Inference. and Prediction. New York: Springer-Verlag, 2001.

Variable		Variable Name						Description
1		idnum: Identification number						1 − 972
2		psa: PSA level						Serum prostate-specific antigen level (mg/ml)
3		cancerv: Cancer volume						Estimate of prostate cancer volume (cc)
4		weight: Weight						Prostate weight (gm)
5		age: Age						Age of patient (years)
6		hyperplasia: Benign prostatic hyperplasia						Amount of benign prostatic hyperplasia (cm²)
7		seminal: Seminal vesicle invasion						Presence Or absence of seminal vesicle invasion: 1 if yes; 0 otherwise
8		capsular: Capsular penetration						Degree of capsular penetration (cm)
9		score: Gleason score						Pathologically determined grade of disease using total score of two patterns (summed scores were either 6, 7, or 8 with higher scores indicating worse prognosis)
	idnum		psa	cancerv	weight	age	hyperplasia		seminal	capsular	score
	1		0.65	0.56	15.96	50	0		0	0	6
	2		0.85	0.37	27.66	58	0		0	0	7
	3		0.85	0.6	14.73	74	0		0	0	7
	…		…	…	…	…	…		…	…	…
	95		170.72	18.36	29.96	52	0		1	11.7	8
	96		239.85	17.81	43.38	68	4.76		1	4.76	8
	97		265.07	32.14	52.98	68	1.55		1	18.17	8

Qn 3: Website Developer Dataset

Model Hint/Suggestion: Start with a Poisson regression model, check for over-dispersion, if present, consider fitting a Negative Binomial regression model

Recall that for Poisson regression, one of the assumptions for a valid model is that the mean and variance of the count variable are equal. The negative binomial distribution is a more generalized form of distribution used for ‘count’ response data, allowing for greater dispersion or variance of counts. In practice, it is quite common for the variance of the outcome to be larger than the mean. This is called overdispersion. If a count variable is overdispersed, Poisson regression underestimates the standard errors of the predictor variables. When overdispersion is evident, one solution is to specify that the errors have a negative binomial distribution.

Management of a company that develops websites was interested in determining which variables have the greatest impact on the number of websites developed and delivered to customers per quarter (Response variable: Websites delivered). Data were collected on website production output for 13 three-person website development teams, from January 2001 through August 2002. Each line of the data set has an identification number and provides information on 6 other variables for thirteen teams over time. The 8 variables are:

Variable		Variable Name					Description
1		idnum: Identification number					1 − 73
2		delivered: Websites delivered					Number of websites completed and delivered to customers during the quarter
3		backlog: Backlog of orders					Number of website orders in backlog at the close of the quarter
4		teamnum: Team number					1 − 13
5		experience: Team experience					Number of months team has been together
6		change: Process change					A change in the website development process occurred during the second quarter of 2002: 1 if quarter 2 or 3, 2002; 0 otherwise
7		year: Year					2001 or 2002
8		quarter: Quarter					1,2,3, or 4
	idnum		delivered	backlog	teamnum	experience		change	year	quarter
	1		1	12	1	3		0	2001	1
	2		2	18	1	6		0	2001	2
	3		7	26	1	9		0	2001	3
	4		2	28	1	12		0	2001	4
	…		…	…	…	…		…	…	…
	70		7	28	13	11		0	2001	4
	71		7	36	13	14		0	2002	1
	72		19	37	13	17		1	2002	2
	73		12	26	13	20		1	2002	3

Qn 4: Market Share Dataset

Model Hint/Suggestion: Multiple linear regression model

Company executives from a large packaged foods manufacturer wished to determine which factors influence the market share of one of its products (market share is the response variable). Data were collected from a national database (Nielsen) for 36 consecutive months. Each line of the data set has an identification number and provides information on 6 other variables for each month. The data presented here are for September, 1999, through August, 2002. The variables are:

Variable		Variable Name					Description
1		idnum: Identification number					1 − 36
2		marketshare: Market share					Average monthly market share for product (percent)
3		price: Price					Average monthly price of product (dollars)
4		gnrpoints: Gross Nielson rating points					An index of the amount of advertising exposure that the product received
5		discount: Discount price					Presence or absence of discount price during period: 1 if discount, 0 otherwise
6		promotion: Package promotion					Presence or absence of package promotion during period: 1 if promotion present, 0 otherwise
7		month: Month					Month (Jan-Dec)
8		year: Year					Year (1999 – 2002)
	idnum		marketshare	price	gnrpoints	discount		promotion	month	year
	1		3.15	2.2	498	1		1	Sep	1999
	2		2.52	2.19	510	0		0	Oct	1999
	3		2.64	2.29	422	1		1	Nov	1999
	4		2.55	2.42	858	0		1	Dec	1999
	…		…	…	…	…		…	NA	…
	33		2.88	2.42	145	1		1	May	2002
	34		2.8	2.52	270	1		0	Jun	2002
	35		2.48	2.5	322	0		1	Jul	2002
	36		2.85	2.78	317	1		1	Aug	2002

Qn 5: Disease Outbreak Dataset

Model Hint/Suggestion: Multiple logistic regression model source = Book Website

Adapted in part from H.G. Dantes, J.S. Koopman, C.L. Addy, et. al., “Dengue Epidemics on the Pacific Coast of Mexico.” International Journal of Epidemiology 17 (1988), pp. 178 − 86

The data set below provides information from a study based on 196 persons selected in a probability sample within two sectors in a city. Assume that the response variable (main outcome of interest) is disease: (Disease status) which is coded 1 if the person has a disease or 0 if they do not have a disease. Each line of the dat set has an identification number (id) and provides information on 5 other variables (exposure/independent variables) for each person. The 6 variables are:

Variable	Variable Name						Description
1	id: Identification number						1 − 196
2	ageyrs: Age						Age of person (in years)
3	ses: Socio-economic status						1 = upper, 2 = middle, 3 = lower
4	sector: Sector						Sector within city, where: 1 = sector 1, 2 = sector 2
5	disease: Disease status						1 = with disease, 0 = without disease
6	savings: Savings account status						1 = has savings account, 0 = does not have savings account
		id	ageyrs	ses	sector	disease		savings
		1	33	1	1	0		1
		2	35	1	1	0		1
		3	6	1	1	0		0
		4	60	1	1	0		1
		…	…	…	…	…		…
		193	10	3	1	0		1
		194	31	3	1	0		0
		195	85	3	1	0		1
		196	24	2	1	0		0

Qn 6: Mosquito larva infestation Dataset

Model Hint/Suggestion: Multiple Poisson regression and Negative Binomial regression Models

Use the data set DHF99 from the R package epiDisplay. Type library(epiDisplay) then ?DHF99 to see more details about the dataset.

The main outcome of interest (response variable) is counts of water containers infested with mosquito larvae in a field survey. This is variable containers in the data.

library(epiDisplay) data(“DHF99”) # create a new dataset to manipulate malaria <- DHF99 summ(malaria)
	## ## No. of observations = 300 ##
	## Var. name obs. mean median s.d. min.	max.
	## 1 houseid 300 174.27 154.5 112.44 1	385
	## 2 village 300 48.56 51 32.25 1	105
	## 3 education 300 2.09 1 1.455 1	5
	## 4 containers 299 0.35 0 1.01 0	11
	## 5 viltype 300 1.56 1 0.754 1	3

codebook(malaria)

## houseid : no

## obs. mean median s.d. min. max.

## 300 174.273 154.5 112.439 1 385

## ==================

## village : Village

## obs. mean median s.d. min. max.

## 300 48.56 51 32.253 1 105

## ==================

## education : Educational level

## Frequency Percent

## Primary	168	56.00
## Secondary	36	12.00
## High school	34	11.33
## Bachelor	25	8.33
## Other	37	12.33

## ==================

## containers : # infested vessels

## obs. mean median s.d. min. max.

## 299 0.351 0 1.014 0 11

## ==================

## viltype : Village type

## Frequency Percent

## rural 180 60 ## urban 72 24 ## slum 48 16

## ==================

	houseid	village	education	containers	viltype
1	1	22	Other	3	rural
2	2	22	Primary	1	rural
3	3	22	Primary	0	rural
4	4	22	Primary	0	rural
…	…	…	NA	…	NA
297	382	39	Primary	0	rural
298	383	39	Primary	0	rural
299	384	39	Primary	0	rural
300	385	39	Primary	0	rural

Project-cyueid.zip

STAT823 Project -Datasets and Instructions Solved

If Helpful Share:

Description

Qn 1: IPO Dataset

Qn 2: Prostate Cancer Dataset

Qn 3: Website Developer Dataset

Qn 4: Market Share Dataset

Qn 5: Disease Outbreak Dataset

Qn 6: Mosquito larva infestation Dataset

Related products

STAT823 Homework 6- Exploratory Data Analysis (EDA) Solved

STAT823 Homework 9-Multiple Linear Regression Solved

STAT823 Homework 11-Logistic Regression Solved