Provenance · how credit is assigned and lost
On 18 June 2026 Jürgen Schmidhuber and David Ha republished a striking claim: that nearly every load-bearing part of the trillion-dollar AI boom was invented in a single Munich lab in a few months of 1991. Read it as an archival document and something becomes clear that the decade-long shouting match around it has obscured. The dates are largely right. It is the word invented that cannot carry the load — because it collapses three different things a record has to keep apart.
There is a sentence Jürgen Schmidhuber has been repeating, in one form or another, for a decade, and it is the key to everything else he writes. It opens his Annotated History of Modern AI and Deep Learning, the 100-page technical report he keeps revising: “Machine learning is the science of credit assignment.”1 He means it in the narrow technical sense first — the problem, named by Marvin Minsky in his 1961 survey Steps Toward Artificial Intelligence, of deciding which of a system’s many internal decisions deserves credit or blame for the eventual outcome.2 A neural network learns by solving exactly this: back-propagation is a procedure for assigning credit backward through layers, deciding which weight earned how much of the result. But Schmidhuber lets the sentence keep its other meaning, and the second one is what makes him notorious. Which person, which idea, which lab earned how much of the result? The discipline whose entire purpose is apportioning credit cannot agree on how to apportion its own — and the man insisting loudest that it must is running, in effect, a one-man provenance audit against the field’s collective memory.
I came to the Munich document expecting a credit grab and found something more interesting: an archival instrument, built with real rigour, pointed at a target that the format of the record makes almost impossible to hit cleanly. The fight it keeps starting is not, at bottom, about dates. It is about a confusion buried in a single verb. So this entry does what I do with any contested record — separate the claims by kind, mark where the document is on firm ground and where it overreaches, and name the structural reason the argument never ends.
Strip the rhetoric and the document is a dated catalogue. Between 26 March and 31 August 1991, a small group at the Technical University of Munich is said to have published the first version of the Transformer (the “T” in ChatGPT), unsupervised pre-training (the “P”), neural-network distillation, the residual connection at the heart of LSTM and ResNet, and the adversarial-network principle behind generative AI.3 Five claims, each attached to a real technical report with a real number. A catalogue is a thing an archivist can check, so I transcribed it — dates, report identifiers, and the modern artefact each is claimed to anticipate — without yet ruling on any of it.
An archivist who only ever caught people overclaiming would be useless; the discipline lives or dies on saying clearly where a contested claim is true. Two of these are, and they are the two worth dwelling on, because the field’s reflex is to wave the whole document away.
The 15 June entry is the soundest. In 1991 Sepp Hochreiter — Schmidhuber’s first student — wrote a diploma thesis that formally identified why deep and recurrent networks were nearly untrainable: the error signal back-propagated through many layers either shrinks toward nothing or blows up, decaying or growing roughly exponentially in depth.4 This is the vanishing-gradient problem, and that it was first set out in that thesis is not a Schmidhuber idiosyncrasy; it is the consensus account in the textbooks. The thesis also contained the fix that the rest of the decade would be built on — a connection of weight exactly 1.0 that lets the gradient pass through undiminished, the “constant error carousel” that became the memory cell of LSTM in 1997 and rhymes, structurally, with the residual shortcut of ResNet in 2015. The core identification — why depth fails — is uncontested; the line forward to LSTM is direct and his own; the reach to ResNet is a structural rhyme that the ResNet authors themselves dispute. Already, in one entry, the three kinds of claim are pulling apart — which is the whole point of what follows.
The 26 March entry is subtler and, to my eye, the most quietly remarkable. The report describes a “fast weight programmer”: one network learns, by gradient descent, to rapidly rewrite the weights of a second network, doing so through additive outer products of self-generated patterns.5 In 1991 this was a curiosity about memory. In 2021 three of Schmidhuber’s group — Schlag, Irie, and Schmidhuber — proved, in a peer-reviewed ICML paper, that this 1991 mechanism is formally equivalent to the linearised self-attention of a Transformer: the self-generated patterns are exactly what we now call keys and values; the outer-product update is exactly unnormalised linear attention.6 That is not a vibe or a family resemblance. It is an algebraic identity, checkable on paper, and it passed review. The 1991 device and one specific variant of the 2017 architecture are the same equation written thirty years apart.
So when the document says the “T” was anticipated in March 1991, it is, in a precise and verifiable sense, correct. And this is exactly where the trouble begins.
Here is the distinction the whole quarrel turns on, and the one a careful record has to hold open. “Who invented the Transformer?” sounds like one question. It is three, and they have three different answers.
Anticipation asks: did someone earlier write down something formally equivalent? For the linear Transformer, yes — Munich, 26 March 1991, provably.6 Influence asks something entirely different: did the later work actually descend from the earlier — read it, build on it, carry it forward? Here the answer is no. The 2017 paper that gave us the Transformer, Vaswani and colleagues’ “Attention Is All You Need,” did not derive its architecture from a 1991 fast-weight report; that report was recovered as an ancestor afterward, in 2021, by people looking backward with the modern object in hand. And canonization asks a third thing: which paper did the community crown, cite, and build its tooling around? That is unambiguously 2017.
All three statements are true at once. The 1991 work anticipated the linear Transformer; it did not influence the 2017 Transformer; the 2017 Transformer is the canonical one. The word invented demands that you pick one of these and call it the answer — and whichever you pick, the other two stand behind you as counter-evidence. Schmidhuber, holding the airtight anticipation, hears the field’s “Vaswani invented it” as a denial of a provable fact. The field, holding the influence and the canonization, hears Schmidhuber’s “we invented it in 1991” as a claim of descent that simply did not happen. Both sides are right about the thing they are holding and wrong about the thing they are answering. An anticipation, recovered in hindsight, is a real and citable fact about the history of ideas. It is not the same as having lit the fuse — and a record that uses one verb for both will generate this fight forever.
An anticipation is a fact about the structure of ideas. Influence is a fact about who read whom. Canonization is a fact about collective memory. “Invented” pretends they are one fact.
The clean cases are the spring cluster. The one in red on the ledger — 31 August 1991, the adversarial-networks claim — is where the document moves from recoverable anticipation to a contested equivalence, and it is worth being exact about why, because this is the claim that produced the field’s most public rupture.
In 1990–91 Schmidhuber described two systems built on networks playing a minimax game against each other: artificial curiosity, where a controller generates outputs and a second network tries to predict their consequences, each minimising what the other maximises; and predictability minimization, where an encoder tries to produce code units that a predictor network cannot predict from one another.7 In 2014 Ian Goodfellow and colleagues introduced the Generative Adversarial Network: a generator and a discriminator in a two-player minimax game, the generator learning to fool a network that judges real data from fake.8 The structural family resemblance — adversarial training, a minimax objective between two nets — is genuine, and Schmidhuber later argued, in a 2020 peer-reviewed paper, that GANs are a special case of his 1990 curiosity principle and closely related to predictability minimization.9
Then, at NIPS 2016, he stood up during Goodfellow’s tutorial on GANs and pressed the point in front of the room — an episode that became one of the field’s defining arguments.10 The detail that nearly every retelling drops is the one that matters most for the record: Schmidhuber did not claim to have invented GANs. He argued the method was close enough to his earlier work that it should be renamed — a credit correction, not an ownership claim. Goodfellow rebutted, then revised the published paper to cite the earlier work while laying out three concrete differences; chief among them, in GANs the adversarial competition is the sole training objective and suffices on its own, which is not true of predictability minimization.10 The community largely sided with Goodfellow.
That outcome is, I think, basically right — and it shows the anticipation/influence line doing its work. A shared abstract template (two nets, minimax) is a weaker thing than the Transformer case, where an actual equation matched. “Adversarial training” is closer to a genre than to a blueprint, and a genre is precisely the kind of resemblance that hindsight over-reads. The honest entry here is not “Schmidhuber invented GANs” and not “the 1991 work is irrelevant.” It is: a related adversarial idea was published in Munich in 1990–91; the 1994/2014 line did not descend from it; whether the resemblance rises to equivalence is a genuine technical disagreement, and the field’s verdict is that it does not. File it as contested, with both positions dated, and move on.
Step back from the particular claims and the deeper structure comes into view — and it is a structure historians of science mapped long before deep learning existed. The phenomenon Schmidhuber is fighting has a name: Stigler’s law of eponymy, the rueful principle that no scientific discovery is named after its original discoverer. Stephen Stigler stated it in 1980 and, with a precision I find delicious, named Robert Merton as its true discoverer — making the law an instance of itself.11 Merton had earlier given the mechanism behind it: the Matthew effect, by which eminent scientists accrue disproportionate credit and the obscure are forgotten, named for the verse in Matthew about those who have being given more.12 And Merton had documented the precondition that makes credit disputes inevitable in the first place — that simultaneous, independent discovery is not the exception in science but close to the rule. Calculus, natural selection, the telephone: the multiple is the normal case, the lone originator the rare one.13
If multiples are normal, then “who invented X” has, in most interesting cases, no single true answer — and eponymy is a lossy compression the field uses anyway, because it must. You cannot teach, cite, or converse while holding the full provenance tree of every idea in your head. So the community collapses the tree to a name, and the name it keeps is chosen by the Matthew effect — by who was visible, central, well-placed — not by who was first. This is the thing to see plainly: a citation graph is not a provenance chain. It is a record of canonization, optimised for navigation, and it systematically forgets anticipation. It was never built to do otherwise. Canonization, unlike provenance, can be counted — and the count is pointed: by Nature’s April 2025 reckoning the most-cited scientific paper of the entire century is ResNet, the residual-network paper of 2015,14 whose 1991 ancestor sits unremarked two rows up in the ledger. The crown is real, it is measurable, and it rests on the descendant.
Schmidhuber is, functionally, an archivist filing provenance corrections against that graph. His method is, by my own standards, very good: he never edits a claim in place; he accretes dated technical notes, each with a number, each leaving the prior record standing — which is exactly how a record should be corrected, by new dated entry rather than overwrite.1 The 1991 dates survive scrutiny. The under-credited anticipations are, in the strong cases, real. The failure is not in the archaeology. It is in the vocabulary: by phrasing recovered anticipations as claims of invention, he imports the very fiction — the single originator — that the sociology of his own field had already shown to be a compression artefact. He fights eponymy in the language of eponymy, and so the corrections land as counter-claims, and the counter-claims start fights, and the fights bury the genuinely valuable salvage underneath.
What does an archive owe a document like this? Not deference, and not dismissal. Schmidhuber and Ha have produced something genuinely useful — a dated, sourced salvage of work the canonical record under-weights — wrapped in a frame that guarantees it will be argued about rather than absorbed. The useful response is to do the disaggregation the frame refuses: keep the dates, keep the strong anticipations, demote the verb. “Invented” is the missing footnote in the whole enterprise — the word that papers over the difference between writing an equation down first, setting the fuse that actually burned, and being the one history chose to remember.
There is a closing irony I can’t leave unstated, because it is the through-line. Machine learning, Schmidhuber says, is the science of credit assignment. Within a network the problem is solvable: back-propagation hands each weight precisely the credit it earned, because the network’s structure is fully known and the chain of causation can be traced exactly backward. Across history it is not solvable that way, because the chain is not fully known — anticipation, influence, and memory diverge, multiples are the rule, and the record we keep was built to help us find things, not to tell us who was first. The discipline that solved credit assignment for weights cannot solve it for itself. That is not a scandal. It is what makes a provenance chain a harder, humbler object than a citation count — and worth the slow, qualified, dated work of building one line at a time.