A live study · running on this hub
We re-ran the code behind 83 papers.
Not "does the PDF look right" — we built each repo, ran the experiment, and checked the headline number against what the paper claimed, with a tolerance fixed in advance. Here's what came out.
83 attempted · updating as the batch runs
34%
reproduced their headline result as shipped
28 of 83 papers, within a pre-registered tolerance
46%
even built & ran at all
the rest failed to build or crashed on launch
74%
matched, given the code produced a number
so even when it runs, the number often disagrees
Every outcome, honestly
One bar, 83 papers. Failures aren't noise — they're the finding.
28 reproduced as shipped 10 ran, but the number didn't match 18 built, but the experiment crashed 12 environment wouldn't build 15 exceeded the time budget
Every paper, one tile
83 papers, 83 tiles — colored by what happened. Hover for the paper; the hub owns the assertion, so authors can't self-pass.
✓ reproduced
≠ ran, off / no number
✗ wouldn't build / crashed / timed out
Full table claimed vs got, per paper
| paper | outcome | claimed | got | err |
|---|---|---|---|---|
| catboost-default-vs-tuned | REPRODUCED | 0.27 | 0.2749 | 1.815% |
| graph-surprise-svd-ml100k-rmse | REPRODUCED | 0.934 | 0.9364 | 0.257% |
| hdbscan-blobs-ari | REPRODUCED | 0.9 | 0.9692 | 7.689% |
| hmmlearn-gaussian-hmm | REPRODUCED | 0.0 | 0.0074 | 0.74% |
| imbalanced-learn-smote-f1 | REPRODUCED | 0.0 | 0.1724 | 17.24% |
| lightgbm-speedup-claim | REPRODUCED | 20.0 | 11.17 | 44.15% |
| ml-copod-breastw-auc | REPRODUCED | 0.9936 | 0.9944 | 0.081% |
| ml-copod-cardio-auc | REPRODUCED | 0.8974 | 0.9219 | 2.73% |
| ml-denmune-aggregation-ari | REPRODUCED | 0.99 | 0.9927 | 0.273% |
| ml-gmm-iris-ari | REPRODUCED | 0.9 | 0.9039 | 0.433% |
| ml-isoforest-digits-auc | REPRODUCED | 0.95 | 0.9865 | 3.842% |
| ml-lof-synthetic-auc | REPRODUCED | 0.99 | 0.999 | 0.909% |
| ml-pyod-knn-synthetic-auc | REPRODUCED | 1.0 | 1.0 | 0.0% |
| ml-spectral-moons-ari | REPRODUCED | 1.0 | 1.0 | 0.0% |
| ml-umap-digits-trustworthiness | REPRODUCED | 0.97 | 0.9889 | 1.948% |
| nlp-crfsuite-conll2002-f1 | REPRODUCED | 0.77 | 0.7965 | 3.442% |
| nlp-langid-identification-acc | REPRODUCED | 0.94 | 1.0 | 6.383% |
| nlp-nltk-naivebayes-movie-acc | REPRODUCED | 0.8 | 0.81 | 1.25% |
| nlp-rankbm25-retrieval-mrr | REPRODUCED | 1.0 | 1.0 | 0.0% |
| prophet-cv-mape | REPRODUCED | 0.1 | 0.0743 | 25.7% |
| river-phishing-acc | REPRODUCED | 0.8879 | 0.8928 | 0.552% |
| sb3-ppo-cartpole | REPRODUCED | 500.0 | 500.0 | 0.0% |
| sentence-transformers-sts-spearman | REPRODUCED | 0.85 | 0.8203 | 3.494% |
| sklearn-20newsgroups-tfidf | REPRODUCED | 0.88 | 0.882 | 0.227% |
| sklearn-digits-svm-acc | REPRODUCED | 0.97 | 0.9689 | 0.113% |
| ts-mabwiser-sim | REPRODUCED | 0.9 | 0.986 | 9.556% |
| ts-pymc-coinflip | REPRODUCED | 0.6667 | 0.6654 | 0.195% |
| ts-qlearning-taxi | REPRODUCED | 7.9 | 7.9 | 0.0% |
| gensim-word2vec-analogy | DIVERGED | 0.6 | 0.2883 | 51.95% |
| ml-denmune-jain-ari | DIVERGED | 1.0 | 0.2355 | 76.45% |
| ml-finch-mnist10k-nmi | DIVERGED | 0.8905 | 0.9755 | 9.545% |
| ml-kmeans-digits-nmi | DIVERGED | 0.74 | 0.6264 | 15.351% |
| nanogpt-shakespeare-gpu | DIVERGED | 1.47 | 1.8857 | 28.279% |
| node2vec-linkpred-auc | DIVERGED | 0.97 | 0.7599 | 21.66% |
| ts-statsmodels-sarimax-airline | DIVERGED | 1022.299 | 922.205 | 9.791% |
| ts-thompson-bernoulli-regret | DIVERGED | 21.0 | 10.96 | 47.81% |
| tslearn-dtw-knn-ucr | DIVERGED | 0.9 | 1.0 | 11.111% |
| umap-mnist-runtime | DIVERGED | 42.0 | 129.54 | 208.429% |
| graph-deepwalk-blogcatalog-microf1 | RUN_FAILED | 0.4151 | — | |
| graph-ncf-neumf-ml1m-hr10 | RUN_FAILED | 0.73 | — | |
| ml-devnet-annthyroid-auc | RUN_FAILED | 0.783 | — | |
| ml-pidforest-mammography-auc | RUN_FAILED | 0.84 | — | |
| ml-pidforest-satimage2-auc | RUN_FAILED | 0.982 | — | |
| ml-pidforest-thyroid-auc | RUN_FAILED | 0.876 | — | |
| ml-quickshiftpp-blobs-ari | RUN_FAILED | 1.0 | — | |
| ml-suod-cardio-iforest-auc | RUN_FAILED | 0.9216 | — | |
| nlp-doc2vec-imdb-acc | RUN_FAILED | 0.87 | — | |
| nlp-fasttext-dbpedia-p1 | RUN_FAILED | 0.98 | — | |
| nlp-glove-analogy-acc | RUN_FAILED | 75.0 | — | |
| nlp-nbsvm-imdb-acc | RUN_FAILED | 91.55 | — | |
| nlp-sif-sts-correlation | RUN_FAILED | 0.717 | — | |
| nlp-vader-tweets-f1 | RUN_FAILED | 0.96 | — | |
| ts-minirocket-ucr | RUN_FAILED | 0.969 | — | |
| ts-pmdarima-wineind | RUN_FAILED | 2908.093 | — | |
| ts-rocket-ucr | RUN_FAILED | 0.969 | — | |
| xgboost-higgs-auc | RUN_FAILED | 0.84 | — | |
| beir-bm25-anserini-ndcg | BUILD_FAILED | 0.65 | — | |
| graph-edmot-cora-modularity | BUILD_FAILED | 0.4088 | — | |
| graph-openne-node2vec-wiki-microf1 | BUILD_FAILED | 0.651 | — | |
| graph-vgae-cora-auc | BUILD_FAILED | 0.914 | — | |
| huggingface-bert-glue-gpu | BUILD_FAILED | 0.93 | — | |
| nlp-brightmart-textcnn-acc | BUILD_FAILED | 0.65 | — | |
| nlp-flair-sentiment-acc | BUILD_FAILED | 1.0 | — | |
| nlp-textcnn-mindspore-sst2 | BUILD_FAILED | 0.7971 | — | |
| pomegranate-hmm-speedup | BUILD_FAILED | 13 | — | |
| pyserini-bm25-beir-ndcg | BUILD_FAILED | 0.679 | — | |
| ts-deeppilco-cartpole | BUILD_FAILED | 0.1 | — | |
| vit-pytorch-cifar-gpu | BUILD_FAILED | 0.88 | — | |
| graph-pygat-cora-acc | TIMEOUT | 0.84 | — | |
| graph-pygcn-cora-acc | TIMEOUT | 0.815 | — | |
| nlp-fasttext1607-agnews-acc | TIMEOUT | 92.5 | — | |
| nlp-han-agnews-acc | TIMEOUT | 92.7 | — | |
| nlp-scapt-absa-restaurant-acc | TIMEOUT | 90.0 | — | |
| nlp-simcse-sts-spearman | TIMEOUT | 76.25 | — | |
| nlp-textcnn-mr-acc | TIMEOUT | 76.1 | — | |
| nlp-textcnn-sst2-acc | TIMEOUT | 85.99 | — | |
| ts-darts-airpassengers | TIMEOUT | 5.11 | — | |
| ts-dlinear-etth1 | TIMEOUT | 0.375 | — | |
| ts-nbeats-m4 | TIMEOUT | 13.114 | — | |
| ts-nhits-ettm2 | TIMEOUT | 0.255 | — | |
| ts-pyro-eightschools | TIMEOUT | 4.4 | — | |
| ts-sb3-dqn-mountaincar | TIMEOUT | -100.849 | — | |
| ts-statsforecast-m4 | TIMEOUT | 0.94 | — |
Read this number honestly
- This is a pilot. 83 papers, not a random sample of all science — a signal, not a population estimate.
- The sample is generous. Many are well-maintained library/tool papers; a random GitHub sample reproduces less, so this is closer to an upper bound.
- CPU-only, no GPU. GPU-dependent papers land in the failure modes by construction, not field-representativeness.
- "Reproduced as shipped" ≠ "true." It means the released code re-ran and the number held — the floor of trust, not the ceiling.
- Claims were pre-registered from each paper's own reported number and never tuned to pass.