With the academic semester coming to a close, students all over North America will be filling out student evaluation of teacher forms, or SETs. Typically, students answer a range of questions regarding their experience with a class, including how difficult the class was, whether work was returned in a timely manner and graded fairly, and how effective their professors were overall. Some students will even go the extra step and visit RateMyProfessors.com, the oldest and arguably most popular website for student evaluations. There, students can report how difficult the course was, indicate whether they would take a class with the professor again, and, alas, whether one really did need to show up to class or not in order to pass.
While many students likely do not give their SETs much of a second thought once they’ve completed them (if they complete them at all), many universities take student ratings seriously, and often refer to them in determining raises, promotion, and even tenure decisions for early-career faculty. Recently, however, there have been growing concerns over the use of SETs in making such decisions. These concerns broadly fall into two categories: first, whether such evaluations are, in fact, a good indicator of an instructor’s quality of teaching, and second, that student evaluations correlate highly with a number of factors that are irrelevant to an instructor’s quality of teaching.
The first concern is one of validity, namely whether SETs are actually indicative of a professor’s teaching effectiveness. There is reason to think that they are not. For instance, as reported at University Affairs, a recent pair of studies performed by Berkeley professors Philp Stark and Richard Frieshtat indicate that student evaluations correlate most highly with actual or expected performance in a class. In other words, a student who gets an A is much more likely to give their professor a higher rating, and a student who gets a D is much more likely to give them a lower rating. Evaluations also highly correlated with student interest (disinterested students gave lower ratings overall, interested students gave higher ratings) and perceived easiness (easier courses receive higher ratings than difficult ones). These findings cast serious doubt on whether what students are evaluating is how effective their instructors were instead of simply how well they did or liked the class.
These and other studies have recently led Ryerson University in Toronto, Canada, to officially stop using SETs as a metric to determine promotion and tenure – a decision reached at the behest of professors who long-argued the unreliability of SETs. Perhaps even more troubling than student evaluations correlating with class performance and interest, though, were that SETs showed biases towards professors on the basis of “gender, ethnicity, accent, age, even ‘attractiveness’…making SETs deeply discriminatory against numerous ‘vulnerable’ faculty.” If SETs are indeed biased in this way, that would constitute a good reason to stop using them.
Perhaps the most egregious form of explicit bias in student ratings could be found up until recently on the RateMyProfessor website, which allowed students to rate professor “hotness,” a score that was indicated on a “chili pepper” scale. That such a scale existed removed a significant amount of credibility from the website; the fact that there are no controls over who can making ratings on the site is also a major reason why few take it seriously. The removal of the “hotness” rating came only after many complaints by numerous professors that argued that it contributes to the objectification of female professors, and contributed overall to a climate in which it is somehow seen as appropriate to evaluate professors on the basis of their looks.
While there might not exist any official SETs administered by a university that approximate the chili pepper scale, the effects of bias when it comes to perceived attractiveness are present regardless. The above-mentioned reports, for instance, found that when it comes to overall evaluations of professors, “attractiveness matters” – “more attractive instructors received better ratings,” and when it came to female professors, specifically, students were more likely to directly comment on their physical appearance. The study provided one example from an anonymous student evaluation that stated: “The only strength she [the professor] has is she’s attractive, and the only reason why my review was 4/7 instead of 3/7 is because I like the subject.” As the report emphasizes, “Neither of these sentiments has anything at all to do with the teacher’s effectiveness or the course quality, and instead reflect gender bias and sexism.”
It gets worse: in addition to evaluations correlating with perceived attractiveness, characteristics like gender, ethnicity, race, and age all affect evaluations of professors as well. As Freishtat reports, “When students think an instructor is female, students rate the instructor lower on every aspect of teaching,” white professors are rated generally higher than professors of other races and ethnicities, and age “has been found to negatively impact teaching evaluations.”
If the only problems with SETs were that they were unreliable, universities would have a practical reason to stop using them: if the goal of SETs is to help identify which professors are deserving of promotion and tenure, and they are unable to contribute to this goal, it seems that they should be abandoned. But as we’ve seen, there is a potentially much more pernicious side to SETs, namely that they systematically display student bias in numerous ways. It seems, then, that universities have a moral obligation to revise the way that professors are assessed.
Given that universities need some means of gauging professors’ teaching ability, what is a better way of doing so? Freishtat suggests that SETs, if they are to be used at all, should represent only one component of a professor’s assessment. Ultimately, those evaluations must be made a part of a more complete dossier in order to be put to better use; they need to be accompanied by letters from department heads, reviews from peers, and a reflective self-assessment of the instructor’s pedagogical approach.
But even if we can’t agree on the best way of evaluating instructor performance, it seems clear that a system that provides unreliable and biased results ought to be reformed or abandoned.