Diverging Benchmarks: AI Models Are Improving While Humans Are Losing Signal

Why Does it Mean for Entry Level Labor Markets That Goal Posts for Computers are Moving in the Opposite Direction Than Those For Humans?

May 20, 2025

Select Superadditive.co Posts are now in:

ChatGPT Generated Image and Tagline (with the requisite emdash): "Humans celebrate an early finish while the robots power ahead—toward a finish line that keeps moving."

This is the second post in my series on AI. In this post, I explore an overlooked dimension in the competition between humans and machines: the clarity of labor market signals.

TL;DR.

While AI advances through increasingly rigorous benchmarks that improve and communicate model capabilities, human potential becomes harder to evaluate due to grade inflation. This may lead to a dangerous divergence in signal clarity that could distort labor markets and undervalue human capabilities, particularly in entry-level labor markets.

Humans, AI, and the Changing Nature of Skill and Ability

There was an interesting op-ed in the New York Times from a LinkedIn executive sounding the alarm about the potential collapse of entry-level labor markets due to AI.

I've been thinking a lot recently about the competition between AI and humans, especially how we humans can come out ahead.

There's a puzzle I'm trying to solve: AI models keep improving as we make their tests more challenging. At the same time, humans are becoming increasingly difficult to evaluate. Students, in particular, face tests that are becoming easier and less informative. Think grade inflation.

This divergence raises an important question. What economic consequences will this shift create for the labor market?

One of the most crucial elements of this system is the educational institutions that train, sort, and help workers signal their abilities to firms. Education provides one of the few opportunities for effective worker signaling across the entire market. Once you work for a firm, much of your signaling ability fades because a significant portion of your value comes from firm-specific knowledge, and much of your information remains trapped within the firm.

What is a Benchmark? Signal and Learning in Education and AI

Before I get into the details of my hypotheses regarding the impact of the divergence in benchmarks for humans and AI, it's helpful to consider what a benchmark is. I briefly discussed this in my previous post, but it's worth reiterating.

A benchmark is a test.

This test has correct and incorrect answers, and it serves two main purposes.

The primary purpose is to convey a signal regarding the test taker’s abilities. This assumes that the test is well-designed and can differentiate between strong and weak test takers. It indicates something about their underlying abilities, which are more challenging to observe.

The second purpose is its treatment effect. A more challenging test means people are more likely to answer questions incorrectly. If the test and the testing methods are well designed, individuals then identify what they answered incorrectly, adjust their learning strategies, and learn how to correctly answer those questions. The next time they take the test, they perform better. This has a different significance for the test taker, as the test does not merely reveal their quality; it also contributes to improving it.

Easier tests mean that everyone performs well. You cannot distinguish the very best from the average or poor performers. Similarly, an easy test offers no real chance for failure; therefore, no feedback loop exists to alter the learning process. Without this feedback, the test taker does not improve.

Why AI Benchmarks Are Getting Harder and What That Means

Given these definitional issues, it is worth noting that one reason AI models are improving is that the benchmarks are becoming more challenging. These benchmarks are growing tougher not only due to the inclusion of more difficult tasks, such as classifying harder images, but also because they are becoming much more diverse in what they measure. With the edge cases and complexities present in these new benchmarks, we are accomplishing two things simultaneously.

AI Performance on a set of expert-level mathematics problems (https://epoch.ai/data/ai-benchmarking-dashboard)

We are gaining a better understanding of what these models can do, which would not be possible without individuals creating improved and more comprehensive benchmarks.

At the same time, we are pushing the models to improve on tasks they could not previously handle. There are, of course, trade-offs, as getting better at one task may sometimes lead to worse performance in another. (And AI benchmarks aren’t perfect either. There are many problems, and I will explore this in a future post.)

However, more challenging benchmarks benefit both sides of the market.

Competitive Implications of Rigorous Benchmarks

Firms such as OpenAI, Anthropic, Meta, and many others in the foundation model business have incentives to improve their models and showcase that enhancement to the market by excelling on benchmarks.

This benchmark-driven competition produces models for which we can obtain comparable performance evaluations on specific tasks. From these evaluations, we draw conclusions about each model’s capability to handle the work we do in our organizations.

Screenshot from llm-stats.com showing how different companies models are improving on the GPQA benchmark.

This relies on another somewhat hidden assumption: do the benchmarks provide relevant information to firms?

That is, can the scores that models achieve on benchmarks be used to predict how well the models will perform when “working” within a firm?

Firms are figuring this out. We’ll hopefully see the results in the productivity statistics.

The Grade Inflation Crisis and Why Human Benchmarks Are Getting Easier

Now, let’s turn to human benchmarks.

There is ample evidence that grades are not what they once were.

Evidence suggests significant grade inflation exists everywhere, including high schools and colleges, especially elite ones. For example, the typical high school GPA in the late 1980s was around 3 (slightly higher for females, lower for males). By 2002, that average was closer to 3.2-3.4. Average GPAs are likely even higher now.

From College Board Research Report 2003-4, “Whose Grades are Inflated?”

It is unlikely that high school students have become that much smarter. The more plausible explanation is that benchmarks have become easier. We also see similar trends in other assessments outside regular grading. AP scores tell a similar story: scores have dramatically improved.

This can be interpreted in various ways. One possibility is that students are indeed improving, which means the exam is masking real gains among the very top students. This information may be valuable to the student and the labor market.

Another possibility is that the benchmarks themselves are becoming easier, which means students are not improving, their grades are.

As a result, the information from these scores is less valuable.

Evidence (from one consistent externally administered benchmark, the SAT), suggests that grades and test scores have decoupled. Average SAT scores have remained pretty flat over this same time period (early 80s to 2006/7). This is in stark contrast to the GPA inflation above.

In any case, there are two possible outcomes. First, the signal becomes weaker, and markets cannot distinguish between the best and the rest.

Everyone seems to be the best.

Second, ability either stagnates or declines as incentives to learn decrease.

Economic Incentives Behind Grade Inflation

Many factors may have contributed to the rise of grade inflation. Some of these relate to the economics of educational institutions.

First, universities have expanded their range of programs.

In particular, there has been a massive increase in master’s programs, which help fill a revenue gap for these institutions. This expansion has resulted in a sharp increase in the number of students earning these degrees, as well as the cost of those degrees. Consequently, there has been a notable shift in how universities perceive their enrollees. The role of the student is evolving toward that of a customer.

The second set of factors is that this is not just a university phenomenon; it is a “systems” issue. As the value of a college degree, especially for knowledge work in the economy, has increased (though with some caveats), grades have become more important as a signal in the labor market.

It’s possible that the incentives would then be to make each candidate appear favorable in the labor market, especially as more people attain college degrees.

There is also the issue of rankings that affect incentives and sometimes bad behavior. But rankings deserve their own post.

The final explanation presents a more micro-level incentive story.

The increasing economic value of education has created a greater need to assess professors. There has been a notable rise in teacher-student evaluations, particularly in universities. Evaluations are used for tenure and promotion, and they are required to apply for jobs at other universities. So, faculty have many incentives to get good evaluations.

Having been through these evaluations myself, they can be influenced. The more we demand from my students, the more they feel judged, and the harsher the reviews we receive. Research supports this dyanmic (see more).

All of these forces create a perfect storm that makes benchmarks for humans noisier.

When Signals Fail in Markets for Skill

If we take both Kenneth Arrow (“Higher Education as a Filter”) and Michael Spence (“Job Market Signaling”) seriously and accept that higher education acts as a filter providing signals to the labor market, then many universities and high schools may muddy that signal.

If we view the labor market as a system, the goal of higher education is twofold. First, it trains individuals and develops their skills and capabilities. Second, it provides clear signals that assist the market in matching workers to firms.

Because firms do not observe worker skills directly, they rely on signals to make decisions. A key question is whether signals like GPA remain useful.

Why Signal Clarity Matters As Much (Maybe More) Than Raw Capability

As I have written earlier, there is much more noise in the system because the cost of applying is now lower. AI is also making it easier to fake other signals, such as the quality of a cover letter.

These forces, combined with the fact that GPAs are no longer strong signals, mean firms must find other ways to separate the workers they want from those they do not.

This uncertainty contrasts with the much greater clarity regarding AI's capabilities. While AI may not possess all the necessary capabilities (just yet, or ever), and we do not fully comprehend how benchmarks translate to actual performance within organizations, we can discern the distinction between the best and the lowest-performing models within a specific signal class.

In contrast, humans are entering a classic market for lemons.

In a very interesting paper by Amanda Pallais, this dynamic is evident. Employers prefer hiring workers who may have lower quality but offer clearer signals rather than higher quality workers who lack clear signals.

When you know with greater certainty that a person is of quality q, even if it is not ideal, you can design around it. Uncertainty is harder to design around.

Swapability and Risk in Organizational Workflows

Currently, we have not yet determined where algorithms fit into organizational workflows. This process is just beginning, and we will learn much more in the coming years. However, AI models are quite easy to swap (at least technically). You could replace Claude with ChatGPT in just a few minutes for many workflows.

Getting rid of a “bad” human worker, however, is costly. There is legal risk, endless performance improvement plans, and of course, the human cost. Some industry estimates suggest that replacing a “bad” worker can cost over 2x their salary or even more (recruiter.com says it’s 840k for a person who makes 62k. Who knows whether this exact figure is true, but it is quite high!).

This presents an intriguing dilemma worthy of consideration in the ongoing debate about who will prevail in the future: AI versus humans. Capabilities are one aspect influencing hiring decisions, but uncertainty is another.

If AI models provide clear signals, this poses a fundamental risk to human labor markets that do not.

In other words, AI models maintain a competitive advantage with reduced uncertainty even if their abilities are weaker.

Humans, conversely, emit weak signals and incur high correction costs. They carry risk.

Talented workers are particularly likely to be undervalued in this environment. When they sit in classrooms, working hard and paying attention, but their grades are the same as those of someone who never shows up, simply because they pay the same tuition, that degree becomes less useful. They will seek another way to demonstrate their skill.

This outcome is detrimental to workers and the institutions connected to the ecosystem that are responsible for providing clarity about which humans are worth hiring.

Strategic Responses for Workers, Employers, and Universities

What will begin to happen? How will the various agents in the system respond strategically?

If I were a capable worker in this era, I would put in a lot of effort to develop my “personal brand.” I would start building my own portfolio to showcase the work I have done and create my own profiles. I would not rely on my degree to signal my value because I know firms cannot distinguish me from someone else who has the same degree but less ability.

For employers, I would conduct more internal assessments and training and rely less on universities for my labor pool.

A Crisis of Credibility?

Are universities undermining their position in the labor market as honest brokers by sending misleading signals?

There is a tension that makes this problem difficult to solve.

If a single university (or professor, for that matter) makes a unilateral choice to grade more stringently, it might penalize itself.

Evidence shows that many educational institutions, from undergraduate to graduate levels, do not consider the fact that some universities inflate grades more than others. Students from universities that do not inflate grades are at a disadvantage compared to those with impressive grades from institutions that award higher grades to everyone. Princeton, for instance, decided to go back to inflating grades after trying to “deflate.”

I don’t believe we will resolve this dilemma easily, certainly not in a single Substack post. However, if benchmarks continue to diverge, with AI tests becoming more difficult while human educational systems persist in goldbricking (students pretend to learn, we pretend to teach, and everyone gets a trophy), we will encounter a significant problem.

So, as we worry about whether AI will replace humans, we might also want to consider signal clarity alongside capabilities.

This is a collective action problem that no single person can resolve.

Aligning Human and Machine Benchmarks for a Human-First Economy

So what is the punchline?

Markets punish opacity and reward clarity.

Much of our current discussion focuses on whether AI will replace humans due to its superior capabilities. However, in designing a human-first labor market, we should consider better ways to signal human capabilities.

These capabilities are obscured by vague, reassuring A’s or B’s that should have been C’s or worse. If we desire a human-first economy, we must establish human benchmarks that enable us all to communicate value clearly in the labor market.

This will help us compete with the machines.

Superadditive: Deep dives into innovation and organization

Discussion about this post