Methodology

How the rankings are built

This page documents how the Top 100 list is constructed, what's in the data, and what's deliberately out.

Data sources

SourceWhat it givesLimitations
arXiv (math.NT, math.CO)Preprint-level: titles, abstracts, authors, dates, co-author graphBiased toward people who post preprints. Senior figures who publish only in journals are undercounted.
OpenAlexAuthor-level: paper count, citations, affiliations, countryConcept tagging is noisy in math; surname-only matching can misidentify.
zbMATH OpenCurated math review database; canonical author codes; editor-assigned MSC classification (we use the three Goldbach-core classes)Coverage of older non-Western mathematicians is the best of the three sources; the REST API is gated behind a one-time Terms-of-Use acceptance.
Math Genealogy ProjectAdvisor-student treesDissertation-era affiliations only; gaps for some non-Western mathematicians.

Pipeline

Title-weighting. A paper can mention the Goldbach conjecture without being about it, for example a paper that cites Goldbach in its introduction as a famous open problem. To separate genuine work from passing mentions, the arXiv and OpenAlex pipelines weight a keyword match by where it appears: a match in the paper title counts at full weight, and a match only in the abstract counts at half (a factor of 0.5). zbMATH is not title-weighted, because its documents are classified by human editors, so the subject class itself is the relevance signal.

  1. arXiv pull: 17 search terms (Goldbach, prime gap, twin prime, Vinogradov, Hardy-Littlewood, sum of two primes, and others) restricted to the math.NT and math.CO categories. Each paper's contribution to an author is title-weighted as above. A co-authorship graph is built and eigenvector centrality is the second factor in an arXiv composite of 0.60 * pr(weighted papers) + 0.40 * pr(eigen). Authors with at least 3 topical papers qualify. Result: 137 names.
  2. OpenAlex pull: 13 phrase queries (Goldbach conjecture, Goldbach problem, Goldbach's conjecture, and adjacent additive-prime phrases), with an author cap of 10 per work to remove physics megapapers. Works and their citations are title-weighted as above. Composite: 0.60 * pr(weighted works) + 0.40 * pr(weighted citations). Result: 572 ranked authors.
  3. zbMATH pull: documents tagged with any of the three Goldbach-core MSC classes, 11P32 (Goldbach-type additive problems), 11P55 (Hardy-Littlewood circle method), or 11N36 (applications of sieve methods). The broader distribution classes 11N05 (distribution of primes) and 11N13 (primes in arithmetic progressions) were tested and dropped: they pulled in general analytic and multiplicative number theorists whose connection to Goldbach is diffuse, while every genuine additive-prime and circle-method specialist survived the narrower set. Composite: 0.60 * pr(papers) + 0.40 * pr(eigen) over the co-authorship graph among authors with at least 3 topical documents. Result: 495 ranked authors. The editor-assigned MSC classes correct two systematic gaps in the other sources: pre-1995 Russian and Chinese number theorists, and specialists who publish in journals with sparse arXiv presence.
  4. Merge and scoring: the three rankings are surname-deduplicated and joined. The three ranks are combined with a weighted order statistic: each researcher's three ranks are sorted and weighted 0.70 on the best, 0.20 on the middle, and 0.10 on the worst. Sorting before weighting means the method rewards excellence in any one Goldbach pipeline (a researcher who is top in zbMATH but absent from arXiv still scores well), while a researcher strong across all three still finishes ahead. Lower combined score ranks higher. An earlier version simply summed the three ranks, which punished anyone outstanding in one source but weak in another; the weighted order statistic fixes that and made the hand-curated additions unnecessary.
  5. Estimating a missing rank (interpolation): a researcher ranked by only one or two of the three pipelines is not given a flat penalty. To estimate a missing rank, we order the whole pool by a pipeline the researcher does appear in, then walk outward to the two nearest researchers above and the two nearest below who carry a real rank in the missing pipeline, and average those (up to four) values. If two pipelines can each supply such a neighbourhood, we compute the estimate from each and average the two, so the two sources count equally. One rule protects the scoring: the 0.70 top weight may only land on a measured rank, so an estimate can support a researcher's score but can never be their headline signal. Estimated ranks show in [square brackets] on the Top 100 table; measured ranks show plain.
  6. Hand-curated edits: an exclusions file removes researchers the automated pipeline surfaced in error (see Audit decisions). The merge no longer hand-places any researcher: an earlier version floored a few canonical figures editorially, but the scoring below now lets them earn their rank on their own.
  7. Genealogy reseeded: the Top 100 becomes the seed list for the Mathematics Genealogy Project graph, and close-relations surface from the network analysis.

Audit decisions

Excluded

A small number of authors surfaced by the automated pipeline are removed by hand. Some work in unrelated fields, for example chemistry, mathematical demography, or coding theory, and were pulled in by surname collisions or noisy topic tagging. A few others are self-published authors whose output is not part of mainstream research. The specific names are kept internal: listing them here would only give them visibility, which is the opposite of the point.

Name aliases (forced MGP ids)

Display nameForced MGP record
R. C. VaughanRobert Charles Vaughan, MGP id 27012 (surname-only ambiguates with Charles Vaughan id 225220)
H. A. HelfgottHarald Andres Helfgott, MGP id 69999

What's not in this list