-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathassignment.Rmd
187 lines (113 loc) · 20.5 KB
/
assignment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
title: "Lab 2: What Makes a Product Successful? Fall 2021"
author: 'w203: Statistics for Data Science'
date: "November 1, 2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
Imagine that you are part of a team of product data scientists at Acme, Inc. Your manager, Mx. Coy Ote, has given you the freedom to choose your own product to investigate, and evaluate a way to make it more successful.
**Your task is to select and develop a research question, find appropriate data, then conduct a regression study.**
## Research Question
Your research question must be specific, it should clearly state an $X$ and a $Y$, defined at a conceptual level. Your $X$ should be a design property or characteristic of a product that could be modified in the production process, and your $Y$ should be a metric of success.
In selecting your research question, you will have to use the skills you developed in RDADA to work on a question that can be addressed using a modeling approach from this course. It is not appropriate to ask "What product features increase success?" or "How does product design affect sales?". These types of questions are not amenable to a modeling based approach and your study would likely become a fishing expedition. Instead, your team will have to use your background knowledge to identify a relationship you want to measure between a specific design feature and a specific metric of success.
If your data set is large enough, you can begin your process by splitting the data into an exploration set and a testing set. As a rough guideline, you might put 30\% of your data into the exploration set, but make sure that both sets have a minimum of 200 rows of data. Use the exploration set to build your intuition, explore how the data is distributed, and identify your $X$ and $Y$ variables. Then use the testing set to fit your models.
Because your manager is interested in *changes* to a product, they are fundamentally asking you to perform an explanatory study. As we have noted in the class, given observational data, an OLS regression is usually not a credible way to measure a causal effect. We have purposefully selected a domain in which the one-equation structural model is at least partially defensible. The most prominent causal pathways will go in one direction, from product design characteristics to success. While not a perfect reflection of reality, we expect your model to be plausible enough to make your results interesting. At the same time, you will need to analyze potential violations of the one-equation structural model and what effect any violations may have on your results.
## Data
For this lab, you and your team will be responsible for gathering the data that you use. The data should be publicly available, and should be relevant to your research question. To increase the diversity of products investigated, we are asking students to avoid working on data that is sourced from Yelp and Airbnb. There are very, very good data resources available, for example:
- [New York Times](https://open.nytimes.com/data/home)
- [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday)
- [ICPSR](https://www.icpsr.umich.edu/web/pages/) for social and political data
- [Data.world](https://data.world/datasets/products)
- [Dataverse](https://dataverse.harvard.edu) for published research data
- [UC Irvine Machine Learning Data Repository](https://archive.ics.uci.edu/ml/index.php)
- [Google Dataset Search](https://datasetsearch.research.google.com)
- [Amazon Open Data Registry](https://registry.opendata.aws)
- [Azure Open Data Registry](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-catalog)
**Requirements for your data:**
- Data should be cross-sectional (i.e. not have multiple measurements for a single unit of observation). You may, for example, take a single cross section from a larger panel.
- We recommend a minimum of 100 or 200 observations. A team could choose to use an interesting dataset that is smaller than this, however, this will then require the team to assess and satisfy the more stringent CLM assumptions.
- The outcome (or outcomes) that you use should be plausibly metric (i.e. number of sales of a product; number of views of a video). For this lab however, to make it easier to find data, teams may use an ordinal outcome variable if necessary. If using an ordinal outcome such as a 1-7 Likert scale, the team should clearly discuss the consequences of failing to satisfy the assumptions of the OLS regression model.
- For any omitted variable that would call your results into question, the data should include the best possible variable that operationalizes this concept. At a minimum, the data should have a variable that serves as an imperfect measure - or *proxy* - for the omitted variable.
- Your models must include a mixture of numeric and categorical inputs (this requirement is for learning purposes). If it is appropriate, you may bin a metric variable into a categorical variable.
You may draw different variables from different data sources. You may use data sources not on the above list. You must document any data source you use in your report.
## Example of a Research Question
Suppose that your team is interested in learning how the length of lanyard attached to a catapult affects customers’ satisfaction with the catapult. (A classic question from [Roadrunner cartoons](https://youtu.be/r9ENQ5j2zG4?t=15 ).)
You work to develop a primary outcome: proportion of boulders that land on their target.
On Acme’s servers, you find data on lanyard length, maximum-rated weight for the catapult and sales region. However, when you are reasoning about the product, you also note that length of the catapult arm and size of the catapult wheels are also likely to affect customer satisfaction and are correlated with lanyard length. Because any model that does not include these confounding variables would yield estimates that conflate the importance of wheels and arms with the lanyard, you determine that the off-the-shelf data is not complete and that you need to encode the data yourself.
In the modeling phase of your project, your team proposes to build three models. One model estimates the relationship between targeting accuracy and lanyard length by itself. A second model is similar, but adds a set of covariates including length of catapult arm and size of catapult wheels. Finally, a third model includes an interaction term between lanyard length and customer type (first time or repeat), allowing you to investigate whether the effect of lanyard length is heterogeneous depending on the person operating the catapult.
## A Group Assignment
**This is a group assignment.** Your live session instructor will coordinate the formation of groups. We would like to encourage teams to focus on using the lab as a way to learn how to work as a team of collaborating data scientists on shared code; how to clean and organize data; and, how to present work in a compelling way. As a result, we encourage teams to allow individuals to take risks and be supportive in the face of successes and failures. Create an opportunity for people who want to improve a particular skill to do so -- this might be project coordination, management of code through git, plotting, or any of the many aspects that you'll work on. *We hope that you can support and learn from one another through this team-based project.*
# Deliverables
| Deliverable Name | Week Due | Grade Weight |
|------------------------|----------|--------------|
| [Research Proposal] | Week 12 | 10% |
| [Within-Team Review] | Week 12 | 5% |
| [Final Presentation] | Week 14 | 10% |
| [Final Report] | Week 14 | 75% |
# Final Project Components {.tabset}
## Research Proposal
After a week of work, the project team will produce a one-page research proposal that defines the teams' research question, data sources and plan of action.
The research question should be informed by an understanding of the data and information that is available. This means that the team will need to pursue at least some preliminary exploratory data analysis. A motivated team might form their research question, and begin to build a functioning data pipeline as an investment in ongoing project success.
The research proposal is intended to provide a structure for the team to have an early conversation with their instructor. It will be graded credit/no credit for completeness (i.e. a reasonable effort by the team will receive full marks). Your instructor will read these proposals and will contact the team with any necessary course corrections, suggestions, or feedback.
**This proposal is due in week 12, in Gradescope, with one submission for the whole team.**
## Within-Team Review
Being an effective, supportive team member is a crucial part of data science work. Your performance in this lab includes the role you play in supporting your teammates. This includes being responsive, creating an environment in which all members feel included, and above all treating each other with respect. In line with this perspective, we will ask each team member to write two paragraphs to their instructor about the progress they have made individually, and the team has made as a whole toward completing their report.
This self-assessment should:
- Reflect on the strengths and weaknesses of the team and the team's process to this point in the project.
- Where your collaboration has worked well, how will you work to ensure that these successful practices continue to be employed?
- If there are places where collaboration has been challenging, what can the team do jointly to improve?
- If there are any individual performances that deserve special recognition, please let your instructor know in this evaluation.
- If there are any individual performances that require special attention, please also let your instructor know in this evaluation.
Instructors will treat these reviews as confidential and will not take any action without first consulting you.
**This reflection is due in week 12, in Gradescope and requires one submission per person.** You will submit this through Gradescope, and like all parts of your educational record, this will be treated confidentially by the instructional team.
## Final Presentation
During the Unit 14 live session, each team will give a presentation of their work to their classmates, who will be seated with you as collaborating data scientists. As collaborating data scientists, your classmates will need to be informed of the specific product and research question that you are addressing.
### Presentation Guidelines
- **The presentation should be structured as 10 minutes of presentation and 5 minutes of questions from our classmates.** Please note that this is an *incredibly* limited amount of time to present.
- There should be no more than two slides that set-up your research question and these slides should take no more than two minutes to present. On this slide, it is quite alright to state bluntly: "**Research Question**: Do shorter lanyards increase the accuracy of catapult launches?" (2 minutes)
- You should ground the audience in an understanding of the data that you are using in your models. Take a short amount of time to describe key attributes of the variables that you are including in your model. This should minimally include a description of the outcome and key explanatory feature, but should also include any other variables or context that is necessary to reason about your model and results. (2-3 minutes)
- Do not present R code, discuss data wrangling, or normality - details like this are best left to the full analysis. It is tempting to want to share these process based stories with your peers, but save that time for after the presentation.
- There should then be several slides that present what you've learned from your models. It is a good practice to show your final regression table on a slide by itself. If you show a regression table, you need to provide your audience with enough time to read and engage with it. As a general rule, any model table you show will take at least two minutes to discuss. For any table (or plot) that you show, you should minimally interpret the variables (or axes) and the key point that you are making with that piece of evidence.
Finally, a few more general thoughts:
- Practice your talk with a timer!
- If you divide your talk with your teammates, practice your section with a timer so that you do not spill over into your teammates' time.
- There is no need to say, “Now I am going to hand it off to Becca.” And for Becca to say, “Thank you Adam.” Whoever’s turn it is to talk can simply move the presentation forward.
## Final Report
Your final deliverable is a written statistical analysis documenting your findings. **Please limit your submission to 6,000 words, excluding code cells and R output.**
The exact format of your report is flexible, but it should include the following elements.
### 1. An Introduction
Your introduction should present a research question and explain the concept that you're attempting to measure and how it will be operationalized. This section should pave the way for the body of the report, preparing the reader to understand why the models are constructed the way that they are. It is not enough to simply say, "We are looking for product features that enhance product success." Your introduction must do work for you, focusing the reader on a specific measurement goal, making them care about it, and propelling the narrative forward. This is also a good time to put your work into context, discuss cross-cutting issues, and assess the overall appropriateness of the data.
### 2. A description of the Data and Research Design
After you have presented the introduction and the concepts that are under investigation, what data are you going to use to answer the questions? What type of research design are you using? What type of models are you going to estimate, and what goals do you have for these models?
### 2a. A Model Building Process
You will next build a set of models to investigate your research question, documenting your decisions. Here are some things to keep in mind during your model building process:
1. *What do you want to measure*? Make sure you identify one, or a few, variables that will allow you to derive conclusions relevant to your research question, and include those variables in all model specifications. How are the variables that you will be modeling distributed? Provide enough context and information about your data for your audience to understand whatever model results you will eventually present.
2. What [covariates](https://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms) help you achieve your modeling goals? Are there problematic covariates? either due to *collinearity*, or because they will absorb some of a causal effect you want to measure?
3. What *transformations*, if any, should you apply to each variable? These transformations might reveal linearities in the data, make our results relevant, or help us meet model assumptions.
4. Are your choices supported by exploratory data analysis (*EDA*)? You will likely start with some general EDA to *detect anomalies* (missing values, top-coded variables, etc.). From then on, your EDA should be interspersed with your model building. Use visual tools to *guide* your decisions. You can also leverage statistical *tests* to help assess whether variables, or groups of variables, are improving model fit.
At the same time, it is important to remember that you are not trying to create one perfect model. You will create several specifications, giving the reader a sense of how robust (or sensitive) your results are to modeling choices, and to show that you're not just cherry-picking the specification that leads to the largest effects.
At a minimum, you need to estimate at least three model specifications:
The first model you include should include *only the key variables* you want to measure. These variables might be transformed, as determined by your EDA, but the model should include the absolute minimum number of covariates (usually zero or one covariate that is so crucial it would be unreasonable to omit it).
Additional models should each be defensible, and should continue to tell the story of how product features contribute to product success. This might mean including additional right-hand side features to remove omitted variable bias identified by your casual theory; or, instead, it might mean estimating a model that examines a related concept of success, or a model that investigates a heterogeneous effect. These models, and your modeling process should be defensible, incremental, and clearly explained at all points.
Your goal is to choose models that encircle the space of reasonable modeling choices, and to give an overall understanding of how these choices impact results.
### 4. A Results Section
You should display all of your model specifications in a regression table, using a package like [`stargazer`](https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf) to format your output. It should be easy for the reader to find the coefficients that represent key effects near the top of the regression table, and scan horizontally to see how they change from specification to specification. Make sure that you display the most appropriate standard errors in your table.
In your text, comment on both *statistical significance and practical significance*. You may want to include statistical tests besides the standard t-tests for regression coefficients. Here, it is important that you make clear to your audience the practical significance of any model results. How should the product change as a result of what you have discovered? Are there limits to how much change you are proposing? What are the most important results that you have discovered, and what are the least important?
### 5. Limitations of your Model
#### 5a. Statistical limitations of your model
As a team, evaluate all of the large sample model assumptions. However, you do not necessarily want to discuss every assumption in your report. Instead, highlight any assumption that might pose significant problems for your analysis. For any violations that you identify, describe the statistical consequences. If you are able to identify any strategies to mitigate the consequences, explain these strategies.
Note that you may need to change your model specifications in response to violations of the large sample model.
#### 5b. Structural limitations of your model
What are the most important *omitted variables* that you were not able to measure and include in your analysis? For each variable you name, you should *reason about the direction of bias* caused by omitting this variable and whether the omission of this variable calls into question the core results you are reporting. What data could you collect that would resolve any omitted variables bias?
### 7. Conclusion
Make sure that you end your report with a discussion that distills key insights from your estimates and addresses your research question.
## Encouragement for the Project
This project touches on many of the skills that you have developed in the course.
- When you are reasoning about the world and the way that it works, you are implicitly reasoning about *random variables*. Although you might not reason with specific functions (e.g. $f_{x}(x) = x^2$) to describe these random variables, you are very likely to be reasoning about conditional expectations.
- This class is not a class in pure theory! And so, theories you have about the world need to be informed by samples of data. These samples might be iid, or they might not be. The team will have to assess how this, and other possible violations of model assumptions shape what they learn.
- Given a set of input variables, OLS regression produces an estimate of the BLP. But, how good of a predictor is this predictor? And, does the team have enough data to rely on large-sample theory, or does the team need to engage with the requirements of the smaller-sample?
- Throughout, you will have to communicate both to a technical and non-technical audience.
**Finally, have fun with this project.** You have worked hard this semester to build a foundation for reasoning about the world through statistical models. This project is a chance for you and a team of peers to work to apply this reasoning.