Lockean Sidekicks and the Curious Idea of 'PhD-Level Intelligence'
Artificial intelligence, Shifting goalposts, and the real problem: finding problems worth solving
A friend recommended a fascinating YouTube video: “Ben Folds Composes a Song LIVE for Orchestra in Only 10 Minutes.” The video is from 2017. In it, you’ll find North Carolina native and singer-songwriter Ben Folds being challenged to compose a song with the Kennedy Center Orchestra in just 10 minutes.
The entertaining video captures Folds composing and iterating with the orchestra on the piece in real time. The audience chooses the song’s parameters on the spot: it must be in A minor, have an upbeat tempo, and incorporate a sentence from that night’s program booklet: “These new spaces are all designed to be flexible.”
The result is extraordinary, and with nearly 10 million views, it’s clear people love it! But beyond that, it also showcases Folds’ mastery of improvisation, music theory, composition, and technical skills. He takes the challenge—A-minor, upbeat, “These new spaces are all designed to be flexible”—and, with some conceptual leaps and intuition, structures it into a surprisingly good song. Here’s the video.
Sitting on my couch today, could I create a pretty good song using AI in ten minutes?
In fact, with SUNO, I did it in just thirty seconds with little to no musical skill. The prompt: a male singer-songwriter with an orchestra performs an upbeat song in A minor, featuring the lyrics, “These new spaces are all designed to be flexible.” Listen for yourself.
Verdict: Not bad.
Fold’s brilliance is undeniable. And I’m not a musician. So, I can’t say much about whether the song SUNO produced is musically meaningful.1
Just as AI can mimic musical composition, some argue it will soon replicate PhD-level human thinking. But what exactly are they imagining?
This raises a fundamental question: What does it mean to think like a PhD? More importantly, what does it mean for AI to do so?
How do you get a PhD?
Not by taking a multiple-choice test.
Later this week (hopefully, depending on the snow), I’ll visit my PhD alma mater, Carnegie Mellon University, in Pittsburgh, PA, to give a research seminar. I arrived at CMU’s Heinz School (now College) as a 24-year-old master’s student in Public Policy and left at 29 with a PhD, ready to start (and struggle) in my first faculty position. Those six years in Pittsburgh were transformative. I left seeing the world in a completely new way—not just because of what we learned in the classroom but also because of how learning was structured and evaluated in the PhD program.
One thing many students don’t appreciate about earning a PhD is that they mistakenly believe it’s similar to a bachelor’s degree—or, at best, to the master’s they completed before entering the program. They assume you learn the material, pass an exam, and then, somehow, you’re magically equipped to do research. That mindset is a recipe for failure.
What makes a PhD fundamentally different from most degrees is its purpose, which differs from most degrees. Your role is not merely to apply knowledge but to create it. At Heinz, the PhD Handbook outlines three key criteria for achieving candidacy—none of which involve a conventional timed written examination.
Research Competence: A student must demonstrate expertise in methodology and subject matter through original research. Each paper must be methodologically sound and substantively rigorous, going beyond coursework. That is, Can we trust your results? Is the paper internally valid?
Flexibility: Students must apply distinct methodologies, work across different topics, or integrate multiple disciplines. Research must also demonstrate adaptability in analytical approaches and problem-solving. That is, Can you think beyond a single tool or context? Can you adapt?
Structuring Unstructured Problems: Students must identify and frame real-world problems in a way that enables rigorous analysis. This requires introducing new perspectives, applying novel methodologies, or redefining complex issues for deeper investigation. That is, Are they solving the right problem? Can they be analyzed?
Level (3), I think, can be broken up into two parts. The first is structuring an unstructured problem that has already been identified. But who decides what problems matter? That’s Level (4):
What problems are worth solving? This depends on many factors—most notably, your values, priorities, experience, preferences, capabilities, and context. Finding a problem worth solving is fundamentally a human task—one that reasonable, smart people (and sometimes not-so-reasonable ones) can vehemently disagree on.
Competence is a low bar.
In my opinion, competence is a low bar. If someone has already identified a problem, you don’t necessarily need a PhD to address it. Most senior Principal Investigators (those running research labs) don’t handle the “solving” aspect themselves; they delegate that responsibility to research assistants, PhD students, and postdocs. It’s obvious now that market returns for competence alone are rapidly eroding.
You don’t become a PI on an R01 grant simply by being competent or flexible, and maybe not even by structuring a previously unstructured problem alone. You need to do all of that together, but the problem you’re solving must also be significant2, and others must recognize it as such (how else will you secure funding to support your research?).
Beyond structuring problems, you must also structure their solution/implementation through people, technology, and systems.
This is the hierarchy of skills emphasized in rigorous PhD programs and the gauntlet of the tenure process at research schools (and beyond).
Why is this an upside-down pyramid with competence as the most minor component? Because there are millions (billions?) of problems worth solving. Each of these requires intelligence, flexibility, and structured approaches in various ways.
However, which problems are worth solving depends on who you are and your context. Our context windows are long and diverse, just like our training data. AI can optimize for the mean, but human intelligence thrives in variance: the blue-haired outliers, the unexpected, and the creatively structured solutions to problems no one recognizes.
AI Benchmarks and Shifting Goalposts
If being a great researcher is about more than just solving well-defined problems—if it’s about structuring the unstructured, making conceptual leaps, and deciding what’s worth solving—then how do we assess whether AI is intelligent?
I want to focus on one criterion that has come to dominate conversations around which “AI” is best or whether we’re making progress: the Benchmark.
Essentially, a benchmark in the context of AI is a standardized test or dataset used to measure and compare the performance of models on a specific task. Examples include ImageNet (used in image recognition research); the GLUE (and SuperGLUE) benchmarks for text; the Atari-57 benchmark for reinforcement learning; and many more.
In the context of “PhD level intelligence,” a few are worth noting. For instance, a team of researchers developed the GPQA (Google-Proof Q&A) benchmark. GPQA is an expert-authored, "Google-proof" benchmark with questions from biology, physics, and chemistry designed to test AI using tough questions. When they published the paper in 2023, GPT-4 could get about 39% accuracy. Today, we’re seeing 89% or higher levels with o3 (a proprietary model developed by OpenAI) and even the DeepSeek R1 model performing at nearly double the level of GPT-4: at 71.5%. Even if these estimates are somewhat upwardly biased due to data leakage, it’s still quite impressive. The questions are multiple-choice, with four options and a detailed explanation. In their paper, you can see some examples of questions in Table 1. Here is one quantum mechanics question.
Source: GPQA
This is the last exam. I promise.
Given this massive improvement of AI to solve this problem in just three years, researchers are again hunting for new benchmarks that AI can’t beat. I’m particularly impressed by a new one called Humanity’s Last Exam. It's hard, and I’ll probably get 0% right.
Here are the results as they appear on their website on Feb 14th, 2025.
Source: Humanity’s Last Exam
Frankly, I love the constant shifting of goalposts. It’s what makes humans so incredible—we’re never satisfied.
However, a key phrase at the bottom of the table is worth emphasizing: closed-ended.
These assessments are closed-ended, meaning the problem is clearly defined and has a known answer, which you can find at the "back of the book."
This distinction is crucial. In contrast, researchers—and many other professionals, including artists, musicians, entrepreneurs, consultants, CEOs, and team leaders—grapple with open-ended problems. The assessment of the solutions is not quantitative but subjective and open to interpretation.
This topic warrants a deeper dive, but for now, it’s essential to recognize that scalable benchmark “tests” are, by design, closed-ended, with ostensibly “right” answers. At their core, they measure competence and, at times, flexibility: Can I solve a broad range of problems well? However, the problems have been structured and are chosen by people who value different things (e.g., they care about different topics or phenomena).
My humble prediction? This won’t be Humanity’s Last Exam. AI will surpass it. The goalpost will move.
As I mentioned, the very nature of benchmarks—their need for scalability and cost-effectiveness—means they will almost inevitably remain closed-ended. As a result, they will fail to capture the kinds of heterogeneous problems many professionals face daily: the open-ended ones that have yet to be defined.3
Another issue worth noting is that while benchmarks drive effort and focus toward a singular goal, they can also lead to solution convergence—a narrowing of diversity due to the phenomenon of “studying to the test.”
A fantastic paper by Jane Wu at UCLA, titled Measurement for Dummies? Exploring the Role of Policy-Driven Measurement in Automotive Safety illustrates this well. The study showed that introducing the Side Impact Dummy (SID) as a regulatory measurement tool—a benchmark of sorts—reduced overall fatalities and narrowed firm innovation. Automakers optimized safety for SID-sized occupants, limiting diverse solutions that could have benefited people with different body sizes.
A final point about the tests we design for AI: What might we overlook in AI's capabilities if we assess them solely based on their rankings in closed-ended tests? What multitudes might these technologies hold if we stepped back and asked which problems are truly worth solving with them?4
Cartesian, Lockean, and Phenomenological Work
In short, the skills required to earn a PhD are qualitatively different from the metrics used to evaluate AI models. While AI benchmarks test performance on predefined, closed-ended problems, PhD-level thinking involves identifying which problems are worth solving and tackling those open-ended challenges.
This distinction brings me to my final point in this post. As some readers might know, I was an undergraduate philosophy major. One of my classmates (25 years ago!) jokingly called the philosophy department where I studied the "Department of Descartes." There were always ongoing debates between the rationalists (Descartes), the empiricists (Locke), and the phenomenologists (e.g., the American pragmatists). I gravitated toward the lone phenomenologist in that department, Bruce Wilshire, with whom I took Existentialism and American Pragmatism classes and conducted a few independent studies. Many key ideas in phenomenology trace their heritage to Heidegger, Husserl, and the American pragmatists, including James and Peirce.
The debates between the Cartesians, Lockeans, and phenomenologists had faded from my thoughts—replaced by worries about endogeneity—until I read The Philosopher of Palo Alto by John Tinnell. It is a fascinating biography of Mark Weiser, the father of the “Internet of Things” or “ubiquitous computing.” Weiser was Xerox PARC's Chief Technology Officer and a lapsed philosophy major himself.
One key distinction from the book, which I think is worth bringing into this discussion, is the difference in how they view knowledge (epistemology) and reality. (metaphysics)
René Descartes (Rationalism): Defined existence through cognition: “I think, therefore I am.”
John Locke (Empiricism): Argued the mind begins as a “tabula rasa”, shaped entirely by experience (sensation) and then reflection.
Phenomenologists: Reframed existence as fundamentally embedded in the world: “I am because I exist in the world.”
A Cartesian view holds that rational thought and control are at the core of our identity. The Lockeans, in contrast, saw the mind as a blank slate, shaped by sensory experience and learning over time, emphasizing adaptability and accumulation of knowledge. The phenomenologists, by contrast, saw our essential nature as being embedded in experience, practice, and the material world. I don’t want to argue that one type of intelligence is better or worse or claim to know which defines our essential nature. However, these distinctions are important and may be useful for understanding what AI will do in the future and what we will do.
Identifying problems worth solving is fundamentally phenomenological work: it requires immersion in experience, attention to context, and an appreciation of how problems manifest in the real world.
LLMs, on the other hand, engage in remarkable Lockean reflection (with a bit of Cartesian structure). They excel at induction, inference, reasoning, and prediction from data but operate far from lived experience and embodied understanding.
These Lockean workers’ window into the world, and even their interpretation of it, occurs through our words, our experiences, and the tools we have built to make sense of our reality in a meaningful way (e.g., through the photos we take, the art we make, etc.).
And to get a little Sartre in here: we create our purpose.
Ben Folds vs. AI Revisited
With these ideas in mind, it is worth revisiting the example of Ben Folds and the AI-generated song.
We see a perfect illustration of phenomenological work when we watch Ben Folds compose a song live in just 10 minutes. Folds didn’t sit down and analytically break the task into discrete logical steps. He engaged directly with the orchestra and the audience, adjusting in real-time and making intuitive leaps that couldn’t have been pre-planned. His expertise wasn’t just in his head—it was in his hands, his ears, and his deep experience interacting with the material and people and of music itself. His purpose and the problem he was solving were also different. It wasn’t about composing a song. It was about creating an unforgettable experience for everyone in that room.
While the AI's output was interesting and impressive in its own right, it felt different. It was Lockean work. The algorithm took the prompt I provided, broke the task down into known components, applied data-driven reasoning and rational processes to solve each step, and optimized the performance based on what it was trained to think sounded good to us.
I know what kind of work excites me. But a Lockean sidekick wouldn’t hurt.
It’s not.
As my colleague Ashish Arora says: Should adults care about this problem?
If you’ve ever been to a faculty meeting. It’s impossible to get a bunch of Ph.D.s to agree on what problems are worth solving.
As a side note, this is a problem that we encounter in standardized tests for our children. They don’t pick up many meaningful things that reflect our diversity and ways of interacting with the world.