Analysis Task
Due by 1 PM Mountain Time, Tuesday, March 24.
I strongly recommend that you do something using UAS data. There is data on a huge number of different variables when you include all the modules and it has the most detailed well-being data of any survey in the world.
Before starting in on your analysis task, take this survey to get a good sense of our UAS module:
https://uas.usc.edu/survey/playground/uas571/test/index.php
What I expect for the analysis task:
Your Analysis Task needs to report the analysis with tables or figures and also have text that clearly explains the analysis. The idea is that this is a first draft of the “Results” section of your term paper.
How to structure your writeup of the Analysis Task:
You can design a different structure, but a typical writeup could look like this:
Here is an interesting question or questions. The answers matter (people care or should care) because: …
Here is a statistical analysis that seems to have some bearing on this question or questions:
On the surface the statistical results seem to say: …
However, the following confounding factors could be giving rise to an illusion, making it seem like something is there that isn’t or that something is bigger or different than it really is.
Don’t forget to talk about the confounding factors! (4.)
Here is a Q&A about the analysis task:
Q:
What is the level of analysis you are expecting for this assignment?
A:
I don't expect you to have consistent estimates of anything, rather to be able to discuss any biases there might be in the estimates you do get, relative to something interesting. Please make the attempt to figure out the sign (+ or -) of any bias you discuss, and say what that would mean for the truth of the interesting thing one might care about. If there are multiple biases, try to figure out the sign of each one, even if all the biases put together can't be signed because some biases are likely to be + and others are likely to be -. Also, discuss whether you think a bias is likely to be large or small.
Especially Frequently Needed Advice for the Analysis Task:
Always report p-values. This means you’ll want to do at least some regressions, since that is the easiest way to get p-values. Report the raw p-values, then do the Benjamini-Hochberg FDR adjustment for multiple hypothesis testing if appropriate. (It is confusing if you don’t also report the raw p-values.)
Always give the full details of the wordings of the questions and the response options. You can always get these by doing the survey again, but when I am doing that sort of thing I don’t give real answers to the questions, I just click anything until I get to the questions I wanted to look at.
I said I love scatterplots, but there is an exception: when one of the two variables has only a few possible values, box plots for the other variable given each of those few possible values are a better way to show the relationship. Note that box plots are a lot like bar charts—but they have more total information in them than bar charts.
If you have income in the regression, always use log(income). When it is income ranges (bins) that people say, you should use log(midpoint of the range) as log(income). Using non-logged income is almost guaranteed to get you weird results. And using bin number gives a coefficient that is hard to interpret.
If you have log(income) in a regression (and I think it will be household income—all the adult incomes should be counted), I highly recommend using putting log(household size) in as another variable in the regression. That makes sure that you are accounting for a given amount of income being spread over more people while being fairly agnostic about economies of scale in the household.
Make sure to discuss causality and to discuss causality in the context of your particular analysis, not just in general terms. What are the likely biases? What is their likely direction? What are some things that are possible but that you don’t think are issues in your particular case?
Carefully use non-causal words where you aren’t actually claiming causality. You don’t want to use causal words until you are really ready to discuss causality.
Other Advice for the Analysis Task:
Use lots of graphs. I love scatterplots, but other types of graphs and figures can be good, too.
It’s fine to do some statistics on individual variables, but make sure you do something that relates pairs of variables to each other.
Do some formal statistical tests.
When you test more than one hypothesis, set it up so you can do the multiple hypothesis test correction using the False Discovery Rate procedure!
Make a distinction between being significant at the 5% level and being significant at the 1/2 % level.
If something isn’t statistically significant, you say “I can’t reject the null hypothesis that …” NOT “I reject the alternative hypothesis.” If you want to reject a hypothesis, you have to set it up as a null hypothesis.
Recognize reverse causality and cousin causality, including the consumer-theory-esque model I gave in class of how resources broadly construed help all good things, leading to the general principle (with only a few exceptions) that “All good things are positively correlated.” (This is a statement about the cross section.
Define variables in full. You need to act like your reader doesn’t know what the abbreviations mean. So write out the full text of the aspects, and describe fully all other variables. (You will see that we do this in our papers.)
Don’t order response categories alphabetically! They need to be ordered logically. For example, political leanings should be ordered from Left to Right and levels of education should be ordered from less to more.
When you have interesting results for several variables that are along the same lines, think of creating a simple index to get more statistical power. That is, take simple averages of similar variables and treat that simple average as an index.
Think about how nearly statistically exogenous your right-hand-side variables are. Other things equal, regressions with more nearly statistically exogenous right-hand-side variables are more interesting. That doesn’t mean you can’t do other things. Just think about this dimension.
Think seriously about scale use. Any statistical analysis you do with aspect-of-well-being data you can probably do both with the raw aspect ratings and with (aspect rating - average of calibration questions). Doing both of those analyses will be much more interesting than just the one analysis.