Stereotype threat kann erneut nicht repliziert werden

Der Stereotype Threat besagt im wesentlichen, dass zB Mädchen in einem Bereich, bei dem das Stereotyp besteht, dass Frauen dort schlechter sind, schlechter abschneiden, wenn man sie daran erinnert, dass sie Mädchen sind.

Allerdings zeigt sich gerade in vorangemeldeten Studien, dass eine Replizierung entsprechender Studien nicht gelingt. Was die Vermutung aufwirft, dass der Stereotype Threat anfällig für einen Publication Bias ist:

Der Publikationsbias auch Publikationsverzerrung ist die statistisch verzerrte (engl. bias [ˈbaɪəs]) Darstellung der Datenlage in wissenschaftlichen Zeitschriften infolge einer bevorzugten Veröffentlichung von Studien mit „positiven“ bzw. signifikanten Ergebnissen.“

Der Stereotype Threat war auch schon Thema hier:

In einer neuen, vorangemeldeten Studie wurde nun erneut der Stereotype Threat untersucht:

Stereotype threat has been proposed as one cause of gender differences in post-compulsory mathematics participation. Danaher and Crandall argued, based on a study conducted by Stricker and Ward, that enquiring about a student’s gender after they had finished a test, rather than before, would reduce stereotype threat and therefore increase the attainment of women students. Making such a change, they argued, could lead to nearly 5000 more women receiving AP Calculus AB credit per year. We conducted a preregistered conceptual replication of Stricker and Ward’s study in the context of the UK Mathematics Trust’s Junior Mathematical Challenge, finding no evidence of this stereotype threat effect. We conclude that the ‘silver bullet’ intervention of relocating demographic questions on test answer sheets is unlikely to provide an effective solution to systemic gender inequalities in mathematics education.

Quelle: Stereotype threat, gender and mathematics attainment: A conceptual replication of Stricker & Ward


Aus der Studie:

Our goal in this paper is to report a conceptual replication of Stricker and Ward’s [10] investigation of stereotype threat in an authentic high-stakes setting, using the analysis approach favored by Danaher and Crandall [9]. Such a replication is timely, as since Stricker and Ward’s [10, 11] debate with Danaher and Crandall [9], several researchers have questioned the reliability of lab-based stereotype threat research. One reason is that attempts to replicate Spencer et al.’s [5] original lab study have not always been successful [e.g., 18]. Stoet and Geary [19] reviewed 23 replication attempts, finding that only 55% had results consistent with Spencer et al.’s, and that half of these only had so when the researchers controlled for participants’ pre-existing mathematics achievement (an analytic choice not made by Spencer at al.).

Flore and Wicherts [6] pointed out that the lab-based literature on stereotype threat and mathematics has an excess of significant findings (more significant results than one would expect given the average statistical power of published studies). They investigated two possible reasons. First, earlier researchers may have engaged in p-hacking, by using questionable research practices (such as selectively including covariates) to obtain significant effects [20]. Second, the literature may be subject to publication bias, a phenomenon where articles which report significant results are more likely to be accepted for publication than those which do not [18]. Flore and Wicherts’s [6] meta-analysis of 47 lab studies that investigated stereotype threat and mathematics achievement found that publication bias might have “seriously distorted” the meta-analytic effect size estimate they derived from the literature. However, they found that questionable research practices such as p-hacking were not, on their own, sufficient to have created the effect. They left open the possibility that a combination of publication bias and questionable research practices may be present in the literature.

In sum, there is now some doubt about the reliability of the lab-based literature on stereotype threat. While lab studies, on average, report small effects in the same direction as Spencer et al.’s [5] original experiment, it is unclear whether this effect is robust or an artefact of publication bias. If stereotype threat effects cannot be robustly found in well-controlled lab studies, it seems unlikely that they could be found in authentic contexts such as real-world high-stakes tests.

Es bestehen also erhebliche Zweifel und vieles deutet darauf hin, dass es dort nicht mit rechten Dingen zuging bzw die Daten fehlerhaft sein könnten.

Aus den Ergebnissen:

To create our dependent variable we used the standard, and longstanding, JMC scoring system, which awarded 5 points for each correct answer to the first 15 questions, 6 points for each correct answer to the last 10 questions. One point and 2 points were deducted for incorrect answers to questions 15–20 and questions 21–25 respectively subject to a minimum total score of zero. Participants’ scores varied from 0 to 110, and their mean, M = 44.9, SD = 19.7, was slightly higher than the overall average, N = 251,064, M = 38.71, SD = 19.2, t(1168) = 10.7, p < .001, d = 0.312.

Participants’ mean scores, split by answer-sheet version and gender, are shown in Fig 2. As stated in our preregistration, these scores were subjected to a 2 (version) by 2 (gender) between-subjects Analysis of Variance (ANOVA). This revealed a significant main effect of gender, F(1,1165) = 8.410, p = .004, η2 = .007, which reflected that female participants had a higher mean score than male participants, 46.2 versus 42.7, d = 0.177. There was no significant main effect of version, F(1,1165) = 1.586, p = .208, η2 = .001, (means 45.7, 44.0, d = 0.091. Crucially, we did not find the hypothesized version-by-gender interaction effect, F(1, 1165) = 0.525, p = .469, η2 = .000. Indeed, contrary to the prediction of the stereotype threat account, female participants in the gender-first condition had slightly (but non-significantly) higher scores than those in the gender-last condition, 47.3 v 45.0, t(717) = 1.61, p = .108, d = 0.120.

Noch die Grafik dazu:

Stereotype Threat

Wie man sieht schneiden die Frauen sogar in beiden Versionen besser ab als die Männer, in der „Gender First“ Variante sogar noch besser als in der Gender Last Variante.

Hilfsbereit wie ich nun einmal bin schlage ich eine Rettung des Stereotype Threats vor: Es gibt nun anscheinend Stereotype gegen Männer in dem Bereich, anders sind die Ergebnisse nicht zu erklären. Oder das Stereotype ist, dass Mädchen einfach besser in der Schule abschneiden als die Jungs. Schon klappt alles wieder!