p-hacking
Synonyme
p-hacking, Garden of Forking Paths
Definitionen
So ist es
heute etwa in der Medikamentenforschung üblich, dass im Zuge der
Datenauswertung die Outcome-Variablen geändert werden: Das, was das
Medikament eigentlich hätte bewirken sollen, ist nicht signifikant. Stattdessen
werden andere signifikante Effekte gesucht und gefunden.
Dieses Vorgehen wird bisweilen als «p-Hacking» oder «Garden of Forking
Paths» bezeichnet. Auch wenn Forschende in solchen Situationen ohne böse
Absicht vorgehen, also sich nicht bewusst sind, dass sie «p-Hacking» betreiben,
ist die Konsequenz dieselbe: Es werden signifikante Effekte erzeugt, die
bedeutungslos und irreführend sind.
Von Marko Kovic im Text Die Wissenschaft in der Replikationskrise (2016) Bemerkungen
Diese Probleme sind inzwischen weitherum anerkannt. So hat die American
Statistical Association (ASA), die amerikanische Fachgesellschaft für Statistik, vor kurzem einen Warn- und Mahnhinweis veröffentlicht. Darin wird
bemängelt, dass viele Forschende die sogenannten p-Werte missverstehen. So
bezeichnet man die Kennzahl, anhand der statistische Signifikanz beurteilt wird.
Andererseits plädiert die ASA dafür, p-Werte nicht allzu ernst zu nehmen und
stattdessen auch andere statistische Kenngrössen zu beachten.
Von Marko Kovic im Text Die Wissenschaft in der Replikationskrise (2016) To illustrate how powerful p-hacking techniques can be, Joseph Simmons and colleagues Leif Nelson and Uri Simonsohn tested a pair of hypotheses they were pretty sure were untrue. One was an unlikely hypothesis; the other was impossible.
The unlikely hypothesis was that listening to children’s music makes people feel older than they really are. Volunteers listened to either a children’s song or a control song, and later were asked how old they felt. With a bit of p-hacking, the researchers concluded that listening to a children’s song makes people feel older, with statistical significance at the p < 0.05 level.
While suggestive, the initial study was not the most persuasive demonstration of how p-hacking can mislead. Maybe listening to a children’s song really does make you feel old. So the authors raised the bar and tested a hypothesis that couldn’t possibly be true. They hypothesized that listening to the classic Beatles song “When I’m Sixty-Four” doesn’t just make people feel younger, it literally makes them younger. Obviously this is ridiculous, but they conducted a scientific experiment testing it anyway. They ran a randomized controlled trial in which they had each subject listen either to the Beatles song or to a control song. Remarkably, they found that while people who listened to each song should have been the same age, people who heard “When I’m Sixty-Four” were, on average, a year and a half younger than people who heard the control. Moreover, this difference was significant at the p < 0.05 level! Because the study was a randomized controlled trial, the usual inference would be that the treatment—listening to the song—had a causal effect on age. Thus the researchers could claim (albeit tongue in cheek) to have evidence that listening to “When I’m Sixty-Four” actually makes people younger. To reach these impossible conclusions, the researchers deliberately p-hacked their study in multiple ways. They collected information about a number of characteristics of their study subjects, and then controlled for the one that happened to give them the result they were looking at. (It was the age of the subject’s father, for what that’s worth.) They also continued the experiment until they got a significant result, rather than predetermining the sample size. But such decisions would be hidden in a scientific report if the authors chose to do so. They could simply list the final sample size without acknowledging that it was not set in advance, and they could report controlling for the father’s age without acknowledging that they had also collected several additional pieces of personal information, which they ended up discarding because they did not give the desired result.
Von Carl T. Bergstrom, Jevin D. West im Buch Calling Bullshit (2020) im Text The Susceptibility of Science The unlikely hypothesis was that listening to children’s music makes people feel older than they really are. Volunteers listened to either a children’s song or a control song, and later were asked how old they felt. With a bit of p-hacking, the researchers concluded that listening to a children’s song makes people feel older, with statistical significance at the p < 0.05 level.
While suggestive, the initial study was not the most persuasive demonstration of how p-hacking can mislead. Maybe listening to a children’s song really does make you feel old. So the authors raised the bar and tested a hypothesis that couldn’t possibly be true. They hypothesized that listening to the classic Beatles song “When I’m Sixty-Four” doesn’t just make people feel younger, it literally makes them younger. Obviously this is ridiculous, but they conducted a scientific experiment testing it anyway. They ran a randomized controlled trial in which they had each subject listen either to the Beatles song or to a control song. Remarkably, they found that while people who listened to each song should have been the same age, people who heard “When I’m Sixty-Four” were, on average, a year and a half younger than people who heard the control. Moreover, this difference was significant at the p < 0.05 level! Because the study was a randomized controlled trial, the usual inference would be that the treatment—listening to the song—had a causal effect on age. Thus the researchers could claim (albeit tongue in cheek) to have evidence that listening to “When I’m Sixty-Four” actually makes people younger. To reach these impossible conclusions, the researchers deliberately p-hacked their study in multiple ways. They collected information about a number of characteristics of their study subjects, and then controlled for the one that happened to give them the result they were looking at. (It was the age of the subject’s father, for what that’s worth.) They also continued the experiment until they got a significant result, rather than predetermining the sample size. But such decisions would be hidden in a scientific report if the authors chose to do so. They could simply list the final sample size without acknowledging that it was not set in advance, and they could report controlling for the father’s age without acknowledging that they had also collected several additional pieces of personal information, which they ended up discarding because they did not give the desired result.
Verwandte Objeke
Verwandte Begriffe (co-word occurance) | Replikationskrise(0.13), Signifikanz(0.04) |
Häufig co-zitierte Personen
Uri
Simonsohn
Simonsohn
Leif D.
Nelson
Nelson
Joseph P.
Simmons
Simmons
John P. A.
Ioannidis
Ioannidis
Statistisches Begriffsnetz
Zitationsgraph
Zitationsgraph (Beta-Test mit vis.js)
6 Erwähnungen
- False-Positive Psychology - Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant (Joseph P. Simmons, Leif D. Nelson, Uri Simonsohn) (2011)
- Die Wissenschaft in der Replikationskrise (Marko Kovic) (2016)
- New Dark Age - Technology and the End of the Future (James Bridle) (2018)
- Calling Bullshit - The Art of Skepticism in a Data-Driven World (Carl T. Bergstrom, Jevin D. West) (2020)
- Launching Registered Report Replications in Computer Science Education Research (Neil Brown, Eva Marinus, Aleata Hubbard Cheuoua) (2022)
- Prompting Considered Harmful (Meredith Ringel Morris) (2024)