Social science genetics is on the rise. The article shown above is a recent triumph. By knowing someone’s genes alone, it is possible to predict 11–13% of the number of years of schooling they have. Such a prediction comes from adding up tiny effects of many, many genes.
Since a substantial fraction of my readers are economists, let me mention that some of the important movers and shakers in social science genetics are economists. For example, behind some of the recent successes is a key insight, which David Laibson among others had to defend vigorously to government research funders: large sample sizes were so important in accurately measuring the effects of genes that it was worth sacrificing quality of an outcome variable if that would allow a much larger sample size. Hence, when they were doing the kind of research illustrated by the article shown above, it was a better strategy to put the lion’s share of effort into genetic prediction of number of years of education than genetic prediction of intelligence, simply because number of years of education was a variable collected along with genetic data for many more people than intelligence was.
One useful bit of terminology for genetics is that other non-genetic information about an individual, such as years of education or blood pressure are called “phenotypes.” Another is that linear combinations of data across many genes intended to be best linear predictors of a particular phenotype are called “polygenic scores.” Polygenic scores have known signal-to-noise ratios, making it possible to do measurement-error corrections for their effects. (See “Adding a Variable Measured with Error to a Regression Only Partially Controls for that Variable” and “Statistically Controlling for Confounding Constructs is Harder than You Think—Jacob Westfall and Tal Yarkoni.”)
Both private companies and government-supported initiatives around the world are rapidly increasing the amount of human genotype data linked to other data (with a growing appreciation of the need for large samples of people over the full range of ethnic origins). For common variations (SNPs) that have well-known short-range correlations, genotyping using a chip now costs about $25 per person when done in bulk, while “sequencing” to measure all variations, including rare ones, costs about $100 per person, with the costs rapidly coming down. Sample sizes are already above a million individuals, with concrete plans for several million more that will be data sets that are quite accessible to researchers.
In this post, I want to forecast where social science genetics is headed in the next few years. I don’t think I am sticking my neck out very much with these forecasts about cool things people will be doing. Those in the field might say “Duh. Of course!” Here are some types of research I think will be big:
Genetic Causality from Own Genes and Sibling Effects. Besides increasing the amount of data on non-European ethnic groups, a key direction data collection will move in the future is to collect genetic data on mother-father-self trios and mother-father-self-sibling quartets. (For example, the PSID is now in the process of collecting genetic data.) Conditional on the mother’s and father’s genes, both one’s own genes and the genes of a full sibling are as random as a coin toss. As a result, given such data, one can get clean causal estimates of the effects of one’s own genes on one’s outcomes and the effects of one’s sibling’s genes on one’s outcomes.
Genetic Nurturance. By looking at the genes of the mother and father that were not transmitted to self, one can also get important evidence on the effects of parental genes on the environment parents are providing. (Here, the evidence is not quite as clean. Everything the non-transmitted parental genes are correlated with could be having a nurture effect on self.) Effects of nontransmitted parental genes are interesting because most things that parents can do other than passing on their genes are things a policy intervention could imitate. That is, the effects of non-transmitted parental genes reflect nurture.
Recognizing Faulty Identification Claims that Involve Genetic Data. One reason expertise in social science genetics is valuable is that questionable identification claims will be made and are being made involving genetic data. While the emerging data will allow very clean identification of causal effects of genes on a wide range of outcomes, the pathway by which genes have their effects can be quite unclear. People will make claims that genes are good instruments. This is seldom true, because the exclusion restriction that genes only act through a specified set of right-hand side variables is seldom satisfied. Also, it is important to realize that large parts of the causal chain are likely to go through the social realm outside an individual’s body.
Treatment Effects that Vary by Polygenic Score. One interesting finding from research so far is that treatment effects often differ quite a bit when the sample is split by a relevant polygenic score. For example, effects of parental income on years of schooling are more important for women who have low polygenic scores for educational attainment. That is, women who have genes predicting a lot of education will get a lot of education even if parental income is low, but women whose genes predict less education will get a lot of education only if parental income is high. The patterns are different for men.
Note that treatment effects varying by polygenic score has obvious policy implications. For example, suppose we could identify kids in very bad environments who had genes suggesting they would really succeed if only they were given true equality of opportunity. This would sharpen the social justice criticism of the lack of opportunity they currently have. Notice that many of the policy implications based on treatment effects that vary by polygenic score would be highly controversial, so knowledge of the ins and outs of ethical debates about the use of genetic data in this way becomes quite important.
Enhanced Power to Test Prevention Strategies. One unusual aspect of genes is that the genetic data, with all the predictive power that provides, are available from the moment of birth and even before. This means that prevention strategies (say for teen pregnancy, teen suicide, teen drug addiction or being a high school dropout) can be tested on populations whose genes indicated elevated risk, which could dramatically increase power for field experiments.
The Option Value of Genetic Data. Suppose one is doing a lab experiment with a few hundred participants. With a few thousand dollars, one could collect genetic data. With that data, one could immediately begin to control for genetic differences that contribute to standard errors and look at differences in treatment effects on experimental subjects who have different polygenic scores. But with the same data, one would also be able to do a new analysis four years later using more accurate polygenic scores or using polygenic scores that did not exist earlier. In this sense, genetic data grows in value over time.
The example above was with genetic data in a computer file that can be combined with coefficient vectors to get improved or new linear combinations. If one is willing to hold back some of the genetic material for later genetic analysis with future technologies (as the HRS did), totally new measurements are possible. For example, many researchers have become interested in epigenetics—the methylation marks on genes that help control expression of genes.
Assortative Mating. Ways of using genetic data that are not about polygenic scores in a regression will emerge. My own genetic research—working closely with Patrick Turley and Rosie Li—has been about using genetic data on unrelated individuals to look at the history of assortative mating. Genetic assortative mating for a polygenic score is defined as a positive covariance between the polygenic scores of co-parents. But one need not have direct covariance evidence. A difference equation indicates that a positive covariance between the polygenic scores of co-parents shows up in a higher variance of the polygenic scores of the children. Hence, data on unrelated individuals shows the assortative mating covariance among the parents’ generation in the birth-year of those on whom one has data. One can go further. When a large fraction of a population is genotyped, the genetic data can, itself, identify cousins. This makes it possible to partition people’s genetic data in a way that allows one to measure assortative mating in even earlier generations.
Conclusion: Why More and Better Data Will Make Amazing Things Possible in Social Science Genetics. One interesting thing about genetic data is that, there is a critical sample size at which it is possible to get good accuracy on the genes for any particular outcome variable. Why is that? after multiple-hypothesis-testing correction for the fact that there are many, many genes being tested, “genome-wide significance” requires a z-score of 5.45. (Note that, with large sample sizes, the z-score is essentially equal to the t-statistic.) But a characteristic of the normal distribution is that at such high z-score, even a small change in z-score can make a huge difference in p-value. A t-score of 5.03 has a p-value ten times as big, and a score of 5.85 has a p-value ten times smaller. That means that in this region, for a given coefficient estimate an 18% increase in sample size is guaranteed to change a p-value by an order of magnitude. Thinking about things the other way around, if there are genes with different sizes of effects distributed normally, if to start with, one can only reliably detect things far out on the normal distribution of effect sizes, then a modest percentage increase in sample size will make a bigger slice of the normal distribution of effect sizes reliably detectable, which will mean identifying many times as many genes as genome-wide significant.
The critical sample size depends on what phenotype one is looking at. Most importantly, power is lower for diseases or other conditions that are relatively rare. The critical sample size at which we will get a good polygenic score for anorexia is much larger than the sample size at which one can get a good polygenic score for educational attainment. But as sample sizes continue to increase, at some point, relatively suddenly, we will be there with a good polygenic score for anorexia. Just imagine if parents knew in advance that one of their children was at particularly high risk for anorexia. They’d be likely to do things differently and might be able to avert that problem.
I am sure there are many cool things in the future of social science genetics that I can’t imagine. It is an exciting field. I am delighted to be along for the ride!
Here are some other posts on genetic research: