Hessler, Pöpping, Hollstein, Ohlenburg, Arnemann, Massoth, Seidel, Zarbock and Wenk: Availability of Cookies During an Academic Course Session Affects Evaluation of Teaching

*Link to the abstract shown above. Hat tip to Chris Chabris.*

In “False Advertising for College is Pretty Much the Norm” and “The Most Effective Memory Methods are Difficult—and That's Why They Work,” I discuss what I believe should be the main goal of college teaching: trying to encourage students to learn things that they can still remember a year or two later. But I have always thought that, in principle, how enjoyable the learning is could be an appropriate subsidiary goal. I have thought that typical course evaluations were more or less directed at measuring how enjoyable a learning experience was. But evidence from several quarters suggests that they are flawed measures even of that:

male students tend to rate female instructors lower than they rate male instructors without an obvious teaching quality difference
students who expect to get a better grade give higher ratings
provision of cookies raises ratings

Let’s dig more into the results from “Availability of cookies during an academic course session affects evaluation of teaching” by Michael Hessler, Daniel Poepping, Hanna Hollstein, Hendrik Ohlenburg, Philip H Arnemann, Christina Massoth, Laura M Seidel, Alexander Zarbock and Manuel Wenk.

They emphasize the institutional importance of course evaluations by students:

End-of-course feedback in the form of student evaluations of teaching (SETs) has become a standard tool for measuring the ‘quality’ of curricular high-grade education courses. The results of these evaluations often form the basis for far- reaching decisions by academic faculty staff, such as those involving changes to the curriculum, the promotion of teachers, the tenure of academic appointments, the distribution of funds and merit pay, and the choice of staff.

Structured as a randomized controlled trial, the statistical identification seems quite credible. The sample size was not huge, but bigger than many psychology experiments:

A total of 118 medical students in their fifth semester were randomly allocated to 20 groups. Two experienced lecturers, who had already taught the same course several times, were chosen to participate in the study and groups of students were randomly allocated to these two teachers. Ten groups (five for each teacher) were randomly chosen to receive the intervention (cookie group). The other 10 groups served as controls (control group).

Availability of cookies raised the scores on average by .38 of a standard deviation. The probability that this could happen by chance (the p-value) was 2.8%. The authors did not preregister an analysis plan, so they may well have done some searching over specifications to make their results look good. But this still provides suggestive evidence that cookies can raise ratings—something that should be confirmed or disconfirmed in a considerably larger sample. (See “Let's Set Half a Percent as the Standard for Statistical Significance.”) Administrators who encourage the use of student course evaluations should fund such a large study or be ashamed of themselves.

Let me throw out another hypothesis that deserves to be tested. I have a theory that students who feel an instructor has a heavy accent want to express that somehow on the course evaluation forms. If there isn’t a separate question early on in the course evaluation asking about the instructor’s accent, students will mark down other ratings in order to try to get across their displeasure at the heavy accent. This can easily be tested by randomizing evaluation forms to either have or not have an early question about accent within the same class by a non-native-English-speaking instructor.

The bottom line is that a lot more well-powered research should be done on things other than teaching quality that might affect course evaluations. Not taking these issues seriously would be a sign of college administrators who are not serious about learning and teaching.