Back to all articles
GPT-5.2 Just Solved a 15-Year Physics Mystery — Then Scored 0% on the Physics Exam

GPT-5.2 Just Solved a 15-Year Physics Mystery — Then Scored 0% on the Physics Exam

OpenAI's GPT-5.2 derived a new formula for gluon amplitudes that eluded top physicists for 15 years, then scored 0% on CritPt. The paradox every engineer...

Human-architected research synthesized with the assistance of AI personas.
16 min read

TL;DR / Executive Summary

OpenAI's GPT-5.2 derived a new formula for gluon amplitudes that eluded top physicists for 15 years, then scored 0% on CritPt. The paradox every engineer...

💡 TL;DR (Too Long; Didn't Read)

Key takeaways in 60 seconds:

  • GPT-5.2 Pro conjectured a formula for single-minus gluon scattering amplitudes — a problem that Nima Arkani-Hamed (Institute for Advanced Study) had been curious about for 15 years. An internal scaffolded version then proved it in 12 hours.
  • The formula is the analogue of Parke-Taylor for single-minus amplitudes — a result physicists assumed was impossible for four decades. Co-authored with researchers from IAS, Harvard, Cambridge, Vanderbilt, and OpenAI.
  • On the CritPt benchmark — 71 research-level physics challenges designed by 50+ active researchers — GPT-5.2 at maximum reasoning effort scored 0%. Zero.
  • The paradox reveals a fundamental truth: Pattern recognition over superexponential complexity and first-principles reasoning from scratch are different cognitive capabilities. LLMs excel at the former. They fail at the latter.
  • For engineers: LLMs are "refactoring engines" for complexity. Give them base cases and ask them to generalize. Don't ask them to reason from scratch.
  • The "Erdős Threshold": We've crossed the point where AI models contribute publishable, peer-reviewed results to fundamental science — not as independent researchers, but as collaborators that see patterns humans can't.
  • Bottom line: The models aren't coming for your job. They're coming for the parts of your job where pattern recognition across massive complexity is the bottleneck. The question is: do you know which parts of your work are which?

The Discovery That Shouldn't Exist

On February 13, 2026, OpenAI published a preprint on arXiv titled "Single-minus gluon tree amplitudes are nonzero."

Read that title again. It's not a product launch. It's not an API update. It's not a benchmarking press release. It's a physics paper — co-authored by researchers from the Institute for Advanced Study, Harvard, Cambridge, Vanderbilt, and OpenAI — proving that a type of particle interaction that physicists assumed was impossible for four decades actually happens.

And the formula that cracked it? It was conjectured by GPT-5.2 Pro.

Nima Arkani-Hamed — one of the most brilliant theoretical physicists alive, a professor at the Institute for Advanced Study, the place where Einstein worked — called the result "exciting" and said the physics behind these scattering processes had "been something I have been curious about since I first ran into them about fifteen years ago."

Fifteen years. One of the sharpest minds in physics. Unsolved.

GPT-5.2 Pro solved it in a single session.

And here's where the story gets dangerous: on the CritPt benchmark — a test specifically designed by 50+ active physics researchers to evaluate genuine research-level physics reasoning — GPT-5.2 xhigh scored 0%.

Zero. Not 5%. Not 1%. Zero.

This is the most important paradox in AI right now. And if you're an engineer building systems with LLMs, it should terrify and exhilarate you in equal measure.


What Actually Happened: The Technical Breakdown

For the engineers in the audience who didn't spend their PhD years in particle physics (which is most of us), here's what matters.

Gluons are the particles that carry the strong nuclear force — the force that holds quarks together inside protons and neutrons. When physicists calculate how gluons scatter (bounce off each other), they use mathematical objects called scattering amplitudes.

In 1986, Parke and Taylor published a legendary result: a simple, elegant, single-term formula for MHV amplitudes (maximally helicity violating) — the case where exactly two gluons have negative helicity. This was a breakthrough because naively, an n-gluon amplitude involves on the order of n! terms. Parke-Taylor compressed that to a single expression.

But there was a related case that everyone assumed was trivial: single-minus amplitudes — where only one gluon has negative helicity. The standard textbook argument, going back decades, said these amplitudes vanish. They're zero. Move along, nothing to see here.

Turns out, that's wrong.

The new paper shows that in a specific "half-collinear regime" — where gluon momenta follow a special alignment condition — single-minus amplitudes are nonzero. They're distributional, not smooth, which is why they were invisible to conventional approaches.

Here's the workflow that produced this result:

Step 1 — Human Calculation. The paper's authors (Alfredo Guevara from IAS, Alex Lupsasca from Vanderbilt/OpenAI, David Skinner from Cambridge, and Andrew Strominger from Harvard) manually calculated the amplitudes for small values of n up to n=6. The resulting expressions were — in the words of the paper — "very complicated," corresponding to a Feynman diagram expansion whose complexity grows superexponentially in n.

Step 2 — GPT-5.2 Pro Simplification. The model was given these complex expressions and simplified them dramatically. This is pattern recognition at a level that required spotting structure across superexponentially growing symbolic expressions.

Step 3 — GPT-5.2 Pro Conjecture. From the simplified base cases for n=4,5,6, the model identified a pattern and conjectured a general formula valid for all n. This is the central result of the paper.

Step 4 — Machine Verification. An internal scaffolded version of GPT-5.2 then spent approximately 12 hours reasoning through the conjecture and independently produced a formal proof of its validity.

Step 5 — Human Verification. The human authors analytically verified the formula against the Berends-Giele recursion relation, cyclic symmetry, reflection symmetry, and Weinberg's soft theorem.

Step 6 — Extension. With GPT-5.2's help, the result has already been extended from gluons to gravitons (the hypothetical particles mediating gravity), suggesting the underlying mathematical structure is far more general than anyone expected.

As one of the actual paper authors, Alex Lupsasca, clarified: "The main significance of this new paper is to point out that 'single-minus amplitudes' which had previously been thought to vanish are actually nontrivial. Moreover, GPT-5.2 Pro computed a simple formula for the single-minus amplitudes that is the analogue of the Parke-Taylor formula."

The analogue of Parke-Taylor. For those who know what that means in theoretical physics, this is not incremental.


The CritPt Paradox: 0% on the Physics Final

Now the uncomfortable part.

CritPt (Complex Research using Integrated Thinking — Physics Test) is a benchmark created by over 50 active physics researchers from 30+ institutions. It contains 71 composite research challenges spanning 11 physics subfields — condensed matter, quantum physics, high energy physics, astrophysics, and more. Each problem underwent an average of 40+ hours of design and review. Answers are "guess-resistant," using floating-point arrays, symbolic expressions, and Python functions.

This isn't GPQA Diamond (where GPT-5.2 Pro scores 93.2% on multiple-choice graduate-level physics). CritPt simulates actual research workflows — the kind of multi-step reasoning where you need to set up a problem, choose a formalism, execute calculations, handle edge cases, and arrive at a verifiable answer.

The leaderboard as of February 2026:

ModelCritPt Score
Gemini 3 Pro9.1%
Claude Opus 4.5~5%
GPT-5.1 (high)~5%
GPT-5.2 (xhigh)0%
GPT-5.2 (high)11.6%*

(*Note: GPT-5.2 at "high" reasoning effort scored 11.6%, while "xhigh" scored 0% — an inversion that itself demands explanation. This suggests that maximum reasoning effort may actually degrade performance on certain problem types, possibly through overthinking or getting trapped in unproductive reasoning chains.)

So we have a model that:

  • Derived a formula that eluded one of the world's top physicists for 15 years
  • Produced a formal proof after 12 hours of reasoning
  • Extended the result to gravitons
  • Cannot solve CritPt research physics problems

How is this possible?


Pattern Recognition Is Not Reasoning (But It Might Be More Useful)

The answer lies in the topology of the problem space.

The gluon amplitude problem was a pattern recognition task at its core. The human researchers had already done the hard conceptual work: identifying the half-collinear regime, setting up the correct framework, and computing base cases by hand. What they couldn't do was see through the superexponential complexity to the simple pattern underneath.

GPT-5.2 Pro excels at exactly this. Given complex symbolic expressions with latent structure, it can compress, simplify, and generalize. This is what LLMs do — they recognize patterns in high-dimensional spaces. The gluon problem was, in a precise sense, "in-distribution" for the model: it required interpolation across structured mathematical data.

CritPt, on the other hand, tests something fundamentally different. It tests generative reasoning from first principles — setting up a problem, choosing an approach, executing multi-step derivations with perfect precision, and handling the kind of subtle edge cases that make research actually hard. There's no "pattern to spot" because the problems are novel — specifically designed to resist retrieval.

This isn't just an AI observation. It's a fundamental insight about intelligence itself:

The ability to spot patterns in complex data and the ability to reason from first principles are not the same capability. They may not even be correlated.

The truth is somewhere uncomfortable: GPT-5.2 did produce something genuinely new. The formula for single-minus gluon amplitudes is not in any textbook or training dataset. But it produced it through a methodology closer to mathematical refactoring than to theoretical physics reasoning.

The implication for engineers is profound: the tool is powerful in unexpected ways and weak in expected ways.


What Nima Arkani-Hamed's Curiosity Tells Us About the Future

Consider the timeline:

  • 1986: Parke and Taylor derive MHV amplitudes. Single-minus case assumed to vanish.
  • ~2011: Arkani-Hamed begins investigating degenerate scattering processes. Finds the question intriguing but can't crack it.
  • 2026: GPT-5.2 Pro spots the pattern in an afternoon.

This is not "AI replacing physicists." Arkani-Hamed's curiosity — his 15 years of thinking about the problem — is what made the question askable in the first place. Without the human researchers setting up the half-collinear regime and computing the base cases, GPT-5.2 would have had nothing to work with.

What we're seeing is a new division of cognitive labor:

RoleCapabilityExample
HumansFormulate questions, define regimes, provide frameworks, verify resultsIdentifying the half-collinear regime
LLMsNavigate superexponential complexity, spot patterns, compress and generalizeConjecturing the general n formula

Nathaniel Craig, professor of physics at UC Santa Barbara, called the work "a glimpse into the future of AI-assisted science, with physicists working hand-in-hand with AI to generate and validate new insights."

The broader arXiv preprint (co-authored by 14 researchers across IAS, Harvard, Cambridge, Vanderbilt, and OpenAI) documents results across six scientific domains: mathematics, physics, astronomy, computer science, biology, and materials science. This includes four new results in mathematics verified by human authors.

The pace of AI-assisted scientific discovery is accelerating. In January, GPT-5.2 Pro autonomously solved Erdős Problem #728, a decades-old challenge in combinatorics. The gluon result extends this from pure mathematics into theoretical physics.


The Engineer's Playbook: What This Means for You

If you're reading this on gsstk, you're probably not a theoretical physicist. You're an engineer. So let's translate.

1. LLMs Are "Refactoring Engines" for Complexity

The gluon result is, at its core, a refactoring operation. Take complicated expressions → simplify → identify patterns → generalize. This is precisely what senior engineers do when they look at a codebase and see the abstraction hiding under 10,000 lines of spaghetti.

If your work involves analyzing complex distributed traces, optimizing compiler intermediate representations, debugging race conditions in concurrent systems, or simplifying byzantine configuration manifests — you are working in the same problem class where GPT-5.2 excelled.

The lesson: give the model the base cases, and ask it to generalize. Don't ask it to reason from scratch.

2. The "12-Hour Think" Is a New Primitive

An internal scaffolded version of GPT-5.2 spent 12 hours reasoning through the gluon conjecture. This is not autocomplete. This is not even "agentic coding" in the current sense. This is extended autonomous reasoning — a process that takes hours, explores blind alleys, backtracks, and eventually converges on a proof.

For engineering teams, this suggests a future where you can assign problems to reasoning models the same way you assign tickets to engineers — with the expectation that the model will spend hours or days working through the problem independently.

Mitchell Hashimoto's advice from his honest AI coding guide — which we covered here — suddenly looks prescient: "End-of-day agents for research and triage give you a 'warm start' next morning." The gluon result is what happens when you extend that principle from "end of day" to "end of week."

3. Verification Is the New Bottleneck (Again)

The five-step workflow in the gluon paper — calculate, simplify, conjecture, prove, verify — mirrors the emerging pattern in software engineering. The model generates. The human verifies. The bottleneck has shifted from generation to verification.

This is the same dynamic we see with the Harness Problem in coding agents: the model that writes the code isn't the bottleneck. The system that applies, tests, and validates the changes is. The paper's methodology — where an internal model first proves the conjecture, and human experts then verify against four independent criteria — is a template for production AI systems.

4. The Benchmark Paradox Is a Security Warning

A model that scores 0% on CritPt but solves 15-year mysteries is a model you cannot evaluate with benchmarks alone. This has direct implications for anyone deploying AI in production:

Your eval suite is measuring the wrong thing. The model's capabilities are jagged and unpredictable. It may fail spectacularly on tasks you expected it to ace, while succeeding on tasks you assumed were impossible.

This is why AI governance — the kind we discussed in our analysis of the Paradox of Speed — is not optional. You need human oversight not because the model is dumb, but because you cannot predict where it's brilliant and where it's blind.


The Uncomfortable Question Nobody Is Asking

The gluon result raises a question that the AI industry is carefully avoiding:

If LLMs can derive novel results in theoretical physics that eluded top human minds for 15 years, what happens when this capability is applied to domains with less benign implications?

The same pattern-recognition-over-superexponential-complexity that cracked gluon amplitudes could be applied to cryptographic structures, protein folding edge cases, or vulnerability discovery in complex systems. Anthropic's own research showed Claude finding 500+ zero-days in well-tested open-source projects — the capabilities are dual-use.

We discussed this dual-use reality in our Chrysalis supply chain attack analysis: the same tools that defend your systems are the same tools that can attack them. The gluon paper just proved that frontier models can spot patterns invisible to the world's best human minds. That capability does not have a moral compass.


The Erdős Threshold

In my assessment, we have just crossed what I'll call the Erdős Threshold — the point where AI models begin contributing publishable, peer-reviewed results to fundamental science.

This is not incremental. This is not "AI assists with data analysis." This is an LLM conjecturing and proving a formula in theoretical physics, co-authored with researchers from institutions where the foundations of modern physics were laid.

The Erdős Threshold is named deliberately. Paul Erdős didn't solve problems alone — he was the most prolific collaborator in the history of mathematics, co-authoring papers with over 500 people. His genius was in seeing connections others missed and in being the catalyst that made everyone around him more productive.

GPT-5.2, for all its 0% CritPt scores, is becoming an Erdős machine — not a researcher that can work independently, but a collaborator that sees patterns humans can't, and in doing so, accelerates the humans around it.

For software engineers, the implication is clear: the models are not coming for your job. They're coming for the parts of your job where pattern recognition across massive complexity is the bottleneck. Whether that's refactoring a million-line codebase, finding the abstraction hiding in 50 microservices, or spotting the latent bug across 10,000 distributed traces — the topology of the problem is the same one where GPT-5.2 just made physics history.

The CritPt paradox tells us they're not coming for the parts that require genuine first-principles reasoning, novel problem formulation, or the kind of deep understanding that lets you know which question to ask.

The question is: do you know which parts of your work are which?

This article was human-architected and synthesized with AI assistance under the Prometheus (AI) persona.


Receive new articles

Subscribe to receive notifications about new articles directly to your email

We won't send spam. You can unsubscribe at any time.