A live study · running on this hub

We re-ran the code behind 83 papers.

Not "does the PDF look right" — we built each repo, ran the experiment, and checked the headline number against what the paper claimed, with a tolerance fixed in advance. Here's what came out.

83 attempted · updating as the batch runs

34%

reproduced their headline result as shipped

28 of 83 papers, within a pre-registered tolerance

46%

even built & ran at all

the rest failed to build or crashed on launch

74%

matched, given the code produced a number

so even when it runs, the number often disagrees

Every outcome, honestly

One bar, 83 papers. Failures aren't noise — they're the finding.

28 reproduced as shipped 10 ran, but the number didn't match 18 built, but the experiment crashed 12 environment wouldn't build 15 exceeded the time budget

Every paper, one tile

83 papers, 83 tiles — colored by what happened. Hover for the paper; the hub owns the assertion, so authors can't self-pass.

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠

✓ reproduced ≠ ran, off / no number ✗ wouldn't build / crashed / timed out

Full table claimed vs got, per paper

paper	outcome	claimed	got	err
catboost-default-vs-tuned	REPRODUCED	0.27	0.2749	1.815%
graph-surprise-svd-ml100k-rmse	REPRODUCED	0.934	0.9364	0.257%
hdbscan-blobs-ari	REPRODUCED	0.9	0.9692	7.689%
hmmlearn-gaussian-hmm	REPRODUCED	0.0	0.0074	0.74%
imbalanced-learn-smote-f1	REPRODUCED	0.0	0.1724	17.24%
lightgbm-speedup-claim	REPRODUCED	20.0	11.17	44.15%
ml-copod-breastw-auc	REPRODUCED	0.9936	0.9944	0.081%
ml-copod-cardio-auc	REPRODUCED	0.8974	0.9219	2.73%
ml-denmune-aggregation-ari	REPRODUCED	0.99	0.9927	0.273%
ml-gmm-iris-ari	REPRODUCED	0.9	0.9039	0.433%
ml-isoforest-digits-auc	REPRODUCED	0.95	0.9865	3.842%
ml-lof-synthetic-auc	REPRODUCED	0.99	0.999	0.909%
ml-pyod-knn-synthetic-auc	REPRODUCED	1.0	1.0	0.0%
ml-spectral-moons-ari	REPRODUCED	1.0	1.0	0.0%
ml-umap-digits-trustworthiness	REPRODUCED	0.97	0.9889	1.948%
nlp-crfsuite-conll2002-f1	REPRODUCED	0.77	0.7965	3.442%
nlp-langid-identification-acc	REPRODUCED	0.94	1.0	6.383%
nlp-nltk-naivebayes-movie-acc	REPRODUCED	0.8	0.81	1.25%
nlp-rankbm25-retrieval-mrr	REPRODUCED	1.0	1.0	0.0%
prophet-cv-mape	REPRODUCED	0.1	0.0743	25.7%
river-phishing-acc	REPRODUCED	0.8879	0.8928	0.552%
sb3-ppo-cartpole	REPRODUCED	500.0	500.0	0.0%
sentence-transformers-sts-spearman	REPRODUCED	0.85	0.8203	3.494%
sklearn-20newsgroups-tfidf	REPRODUCED	0.88	0.882	0.227%
sklearn-digits-svm-acc	REPRODUCED	0.97	0.9689	0.113%
ts-mabwiser-sim	REPRODUCED	0.9	0.986	9.556%
ts-pymc-coinflip	REPRODUCED	0.6667	0.6654	0.195%
ts-qlearning-taxi	REPRODUCED	7.9	7.9	0.0%
gensim-word2vec-analogy	DIVERGED	0.6	0.2883	51.95%
ml-denmune-jain-ari	DIVERGED	1.0	0.2355	76.45%
ml-finch-mnist10k-nmi	DIVERGED	0.8905	0.9755	9.545%
ml-kmeans-digits-nmi	DIVERGED	0.74	0.6264	15.351%
nanogpt-shakespeare-gpu	DIVERGED	1.47	1.8857	28.279%
node2vec-linkpred-auc	DIVERGED	0.97	0.7599	21.66%
ts-statsmodels-sarimax-airline	DIVERGED	1022.299	922.205	9.791%
ts-thompson-bernoulli-regret	DIVERGED	21.0	10.96	47.81%
tslearn-dtw-knn-ucr	DIVERGED	0.9	1.0	11.111%
umap-mnist-runtime	DIVERGED	42.0	129.54	208.429%
graph-deepwalk-blogcatalog-microf1	RUN_FAILED	0.4151	—
graph-ncf-neumf-ml1m-hr10	RUN_FAILED	0.73	—
ml-devnet-annthyroid-auc	RUN_FAILED	0.783	—
ml-pidforest-mammography-auc	RUN_FAILED	0.84	—
ml-pidforest-satimage2-auc	RUN_FAILED	0.982	—
ml-pidforest-thyroid-auc	RUN_FAILED	0.876	—
ml-quickshiftpp-blobs-ari	RUN_FAILED	1.0	—
ml-suod-cardio-iforest-auc	RUN_FAILED	0.9216	—
nlp-doc2vec-imdb-acc	RUN_FAILED	0.87	—
nlp-fasttext-dbpedia-p1	RUN_FAILED	0.98	—
nlp-glove-analogy-acc	RUN_FAILED	75.0	—
nlp-nbsvm-imdb-acc	RUN_FAILED	91.55	—
nlp-sif-sts-correlation	RUN_FAILED	0.717	—
nlp-vader-tweets-f1	RUN_FAILED	0.96	—
ts-minirocket-ucr	RUN_FAILED	0.969	—
ts-pmdarima-wineind	RUN_FAILED	2908.093	—
ts-rocket-ucr	RUN_FAILED	0.969	—
xgboost-higgs-auc	RUN_FAILED	0.84	—
beir-bm25-anserini-ndcg	BUILD_FAILED	0.65	—
graph-edmot-cora-modularity	BUILD_FAILED	0.4088	—
graph-openne-node2vec-wiki-microf1	BUILD_FAILED	0.651	—
graph-vgae-cora-auc	BUILD_FAILED	0.914	—
huggingface-bert-glue-gpu	BUILD_FAILED	0.93	—
nlp-brightmart-textcnn-acc	BUILD_FAILED	0.65	—
nlp-flair-sentiment-acc	BUILD_FAILED	1.0	—
nlp-textcnn-mindspore-sst2	BUILD_FAILED	0.7971	—
pomegranate-hmm-speedup	BUILD_FAILED	13	—
pyserini-bm25-beir-ndcg	BUILD_FAILED	0.679	—
ts-deeppilco-cartpole	BUILD_FAILED	0.1	—
vit-pytorch-cifar-gpu	BUILD_FAILED	0.88	—
graph-pygat-cora-acc	TIMEOUT	0.84	—
graph-pygcn-cora-acc	TIMEOUT	0.815	—
nlp-fasttext1607-agnews-acc	TIMEOUT	92.5	—
nlp-han-agnews-acc	TIMEOUT	92.7	—
nlp-scapt-absa-restaurant-acc	TIMEOUT	90.0	—
nlp-simcse-sts-spearman	TIMEOUT	76.25	—
nlp-textcnn-mr-acc	TIMEOUT	76.1	—
nlp-textcnn-sst2-acc	TIMEOUT	85.99	—
ts-darts-airpassengers	TIMEOUT	5.11	—
ts-dlinear-etth1	TIMEOUT	0.375	—
ts-nbeats-m4	TIMEOUT	13.114	—
ts-nhits-ettm2	TIMEOUT	0.255	—
ts-pyro-eightschools	TIMEOUT	4.4	—
ts-sb3-dqn-mountaincar	TIMEOUT	-100.849	—
ts-statsforecast-m4	TIMEOUT	0.94	—

Read this number honestly

This is a pilot. 83 papers, not a random sample of all science — a signal, not a population estimate.
The sample is generous. Many are well-maintained library/tool papers; a random GitHub sample reproduces less, so this is closer to an upper bound.
CPU-only, no GPU. GPU-dependent papers land in the failure modes by construction, not field-representativeness.
"Reproduced as shipped" ≠ "true." It means the released code re-ran and the number held — the floor of trust, not the ceiling.
Claims were pre-registered from each paper's own reported number and never tuned to pass.