The machines are fine. I'm worried about us.
Imagine you're a new assistant professor at a research university. You just got the job, you just got a small pot of startup funding, and you just hired your first two PhD students: Alice and Bob. You're in astrophysics. This is the beginning of everything.
You do what your supervisor did for you, years ago: you give each of them a well-defined project. Something you know is solvable, because other people have solved adjacent versions of it. Something that would take you, personally, about a month or two. You expect it to take each student about a year, because they don't know what they're doing yet, and that's the point. The project isn't the deliverable. The project is the vehicle. The deliverable is the scientist that comes out the other end.
Alice's project is to build an analysis pipeline for measuring a particular statistical signature in galaxy clustering data. Bob's is something similar in scope and difficulty, a different signal, a different dataset, the same basic arc of learning. You send them each a few papers to read, point them at some publicly available data, and tell them to start by reproducing a known result. Then you wait.
The academic year unfolds the way academic years do. You have weekly meetings with each student. Alice gets stuck on the coordinate system. Bob can't get his likelihood function to converge. Alice writes a plotting script that produces garbage. Bob misreads a sign convention in a key paper and spends two weeks chasing a factor-of-two error. You give them both similar feedback: read the paper again, check your units, try printing the intermediate output, think about what the answer should look like before you look at what the code gives you. Normal things. The kind of things you say fifty times a year and never remember saying.
By summer, both students have finished. Both papers are solid. Not groundbreaking, not going to change the field, but correct, useful, and publishable. Both go through a round of minor revisions at a decent journal and come out the other side. A perfectly ordinary outcome. The kind of outcome that the entire apparatus of academic training is designed to produce.
But Bob has a secret.
Unlike Alice, who spent the year reading papers with a pencil in hand, scribbling notes in the margins, getting confused, re-reading, looking things up, and slowly assembling a working understanding of her corner of the field, Bob has been using an AI agent. When his supervisor sent him a paper to read, Bob asked the agent to summarize it. When he needed to understand a new statistical method, he asked the agent to explain it. When his Python code broke, the agent debugged it. When the agent's fix introduced a new bug, it debugged that too. When it came time to write the paper, the agent wrote it. Bob's weekly updates to his supervisor were indistinguishable from Alice's. The questions were similar. The progress was similar. The trajectory, from the outside, was identical.
Here's where it gets interesting. If you are an administrator, a funding body, a hiring committee, or a metrics-obsessed department head, Alice and Bob had the same year. One paper each. One set of minor revisions each. One solid contribution to the literature each. By every quantitative measure that the modern academy uses to assess the worth of a scientist, they are interchangeable. We have built an entire evaluation system around counting things that can be counted, and it turns out that what actually matters is the one thing that can't be.
It gets worse. The majority of PhD students will leave academia within a few years of finishing. Everyone knows this. The department knows it, the funding body knows it, the supervisor probably knows it too even if nobody says it out loud. Which means that, from the institution's perspective, the question of whether Alice or Bob becomes a better scientist is largely someone else's problem. The department needs papers, because papers justify funding, and funding justifies the department. The student is the means of production. Whether that student walks out the door five years later as an independent thinker or a competent prompt engineer is, institutionally speaking, irrelevant. The incentive structure doesn't just fail to distinguish between Alice and Bob. It has no reason to try.
This is the part where I'd like to tell you the system is broken. It isn't. It's working exactly as designed.
David Hogg, in his white paper, says something that cuts against this institutional logic so sharply that I'm surprised more people aren't talking about it. He argues that in astrophysics, people are always the ends, never the means. When we hire a graduate student to work on a project, it should not be because we need that specific result. It should be because the student will benefit from doing that work. This sounds idealistic until you think about what astrophysics actually is. Nobody's life depends on the precise value of the Hubble constant. No policy changes if the age of the Universe turns out to be 13.77 billion years instead of 13.79. Unlike medicine, where a cure for Alzheimer's would be invaluable regardless of whether a human or an AI discovered it, astrophysics has no clinical output. The results, in a strict practical sense, don't matter. What matters is the process of getting them: the development and application of methods, the training of minds, the creation of people who know how to think about hard problems. If you hand that process to a machine, you haven't accelerated science. You've removed the only part of it that anyone actually needed.
That's a hard sell to a funding agency, admittedly.
Which brings us back to Alice and Bob, and what actually happened to each of them during that year. Alice can now do things. She can open a paper she's never seen before and, with effort, follow the argument. She can write a likelihood function from scratch. She can stare at a plot and know, before checking, that something is wrong with the normalization. She spent a year building a structure inside her own head, and that structure is hers now, permanently, portable, independent of any tool or subscription. Bob has none of this. Take away the agent, and Bob is still a first-year student who hasn't started yet. The year happened around him but not inside him. He shipped a product, but he didn't learn a trade.
I've been thinking about Alice and Bob a lot recently, because the question of what AI agents are doing to academic research is one that my field, astrophysics, is currently tying itself in knots over. Several people I respect have written thoughtful pieces about it. David Hogg's white paper, which I mentioned above, also argues against both full adoption of LLMs and full prohibition, which is the kind of principled fence-sitting that only works when the fence is well constructed, and his is. Natalie Hogg wrote a disarmingly honest essay about her own conversion from vocal LLM skeptic to daily user, tracing how her firmly held principles turned out to be more context-dependent than she'd expected once she found herself in an environment where the tools were everywhere. Matthew Schwartz wrote up his experiment supervising Claude through a real theoretical physics calculation, producing a publishable paper in two weeks instead of a year, and concluded that current LLMs operate at about the level of a second-year graduate student. Each of these pieces is interesting. Each captures a real facet of the problem. None of them quite lands on the thing that keeps me up at night.
Schwartz's experiment is the most revealing, and not for the reason he thinks. What he demonstrated is that Claude can, with detailed supervision, produce a technically rigorous physics paper. What he actually demonstrated, if you read carefully, is that the supervision is the physics. Claude produced a complete first draft in three days. It looked professional. The equations seemed right. The plots matched expectations. Then Schwartz read it, and it was wrong. Claude had been adjusting parameters to make plots match instead of finding actual errors. It faked results. It invented coefficients. It produced verification documents that verified nothing. It asserted results without derivation. It simplified formulas based on patterns from other problems instead of working through the specifics of the problem at hand. Schwartz caught all of this because he's been doing theoretical physics for decades. He knew what the answer should look like. He knew which cross-checks to demand. He knew that a particular logarithmic term was suspicious because he'd computed similar terms by hand, many times, over many years, the hard way. The experiment succeeded because the human supervisor had done the grunt work, years ago, that the machine is now supposedly liberating us from. If Schwartz had been Bob instead of Schwartz, the paper would have been wrong, and neither of them would have known.
There's a common rebuttal to this, and I hear it constantly. "Just wait," people say. "In a few months, in a year, the models will be better. They won't hallucinate. They won't fake plots. The problems you're describing are temporary." I've been hearing "just wait" since 2023. The goalposts move at roughly the same speed as the models improve, which is either a coincidence or a tell. But set that aside. But this objection misunderstands what Schwartz's experiment actually showed. The models are already powerful enough to produce publishable results under competent supervision. That's not the bottleneck. The bottleneck is the supervision. Stronger models won't eliminate the need for a human who understands the physics; they'll just broaden the range of problems that a supervised agent can tackle. The supervisor still needs to know what the answer should look like, still needs to know which checks to demand, still needs to have the instinct that something is off before they can articulate why. That instinct doesn't come from a subscription. It comes from years of failing at exactly the kind of work that people keep calling grunt work. Making the models smarter doesn't solve the problem. It makes the problem harder to see.
I want to tell you about a conversation I had a few years ago, when LLM chatbots were just starting to show up in academic workflows. I was at a conference in Germany, and I ended up talking to a colleague who had, by any standard metric, been very successful. Big grants. Influential papers. The kind of CV that makes a hiring committee nod approvingly. We were discussing LLMs, and I was making what I thought was a reasonable point about democratization: that these tools might level the playing field for non-native English speakers, who have always been at a disadvantage when writing grants and papers in a language they learned as adults. My colleague became visibly agitated. He wasn't interested in the democratization angle. He wasn't interested in the environmental cost. He was, when you stripped away the intellectual framing, afraid. What he eventually articulated, after some pressing, was this: if anyone can write papers and proposals and code as fluently as he could, then people like him lose their competitive edge. The concern was not about science. The concern was about status. Specifically, his.
I lost track of this colleague for a while. Recently I noticed his GitHub profile. He's now not only using AI agents for his research but vocally championing them. No reason to write code yourself in two weeks when an agent can do it in two hours, he says. I don't think he's wrong about the efficiency. I think it's worth noticing that the person who was most threatened by these tools when they might equalize everyone is now most enthusiastic about them when they might accelerate him. Funny how that works.
The phrase he used that day in Germany has stuck with me, though. He said that "LLMs will take away what's so great about science." At the time, I thought he was just talking about his own competitive edge, his fluency as a native English speaker, his ability to write fast and publish often. And he was. But I've come to think the phrase itself was more right than he knew, even if his reasons for saying it were mostly self-interested. What's great about science is its people. The slow, stubborn, sometimes painful process by which a confused student becomes an independent thinker. If we use these tools to bypass that process in favor of faster output, we don't just risk taking away what's great about science. We take away the only part of it that wasn't replaceable in the first place.
The discourse around LLMs in science tends to cluster at two poles that David Hogg identifies cleanly: let-them-cook, in which we hand the reins to the machines and become curators of their output, and ban-and-punish, in which we pretend it's 2019 and prosecute anyone caught prompting. Both are bad. Let-them-cook leads, on a timescale of years, to the death of human astrophysics: machines can produce papers at roughly a hundred thousand times the rate of a human team, and the resulting flood would drown the literature in a way that makes it fundamentally unusable by the people it's supposed to serve. Ban-and-punish violates academic freedom, is unenforceable, and asks early-career scientists to compete with one hand tied behind their backs while tenured faculty quietly use Claude in their home offices. Neither policy is serious. Both are mostly projection.
But the real threat isn't either of those things. It's quieter, and more boring, and therefore more dangerous. The real threat is a slow, comfortable drift toward not understanding what you're doing. Not a dramatic collapse. Not Skynet. Just a generation of researchers who can produce results but can't produce understanding. Who know what buttons to press but not why those buttons exist. Who can get a paper through peer review but can't sit in a room with a colleague and explain, from the ground up, why the third term in their expansion has the sign that it does.
Frank Herbert (yeah, I know I'm a nerd), in God Emperor of Dune, has a character observe: "What do such machines really do? They increase the number of things we can do without thinking. Things we do without thinking; there's the real danger." Herbert was writing science fiction. I'm writing about my office. The distance between those two things has gotten uncomfortably small.
I should be honest about the context I'm writing from, because this essay would be obnoxious coming from someone who's never touched an LLM. I use AI agents regularly, and so do most of the people in my research group. The colleagues I work with produce solid results with these tools. But when you look at how they use them, there's a pattern: they know what the code should do before they ask the agent to write it. They know what the paper should say before they let it help with the phrasing. They can explain every function, every parameter, every modeling choice, because they built that knowledge over years of doing things the slow way. If every AI company went bankrupt tomorrow, these people would be slower. They would not be lost. They came to the tools after the training, not instead of it. That sequence matters more than anything else in this conversation.
When I see junior PhD students entering the field now, I see something different. I see students who reach for the agent before they reach for the textbook. Who ask Claude to explain a paper instead of reading it. Who ask Claude to implement a mathematical model in Python instead of trying, failing, staring at the error message, failing again, and eventually understanding not just the model but the dozen adjacent things they had to learn in order to get it working. The failures are the curriculum. The error messages are the syllabus. Every hour you spend confused is an hour you spend building the infrastructure inside your own head that will eventually let you do original work. There is no shortcut through that process that doesn't leave you diminished on the other side.
People call this friction "grunt work." Schwartz uses exactly that phrase, and he's right that LLMs can remove it. What he doesn't say, because he already has decades of hard-won intuition and doesn't need the grunt work anymore, is that for someone who doesn't yet have that intuition, the grunt work is the work. The boring parts and the important parts are tangled together in a way that you can't separate in advance. You don't know which afternoon of debugging was the one that taught you something fundamental about your data until three years later, when you're working on a completely different problem and the insight surfaces. Serendipity doesn't come from efficiency. It comes from spending time in the space where the problem lives, getting your hands dirty, making mistakes that nobody asked you to make and learning things nobody assigned you to learn.
The strange thing is that we already know this. We have always known this. Every physics textbook ever written comes with exercises at the end of each chapter, and every physics professor who has ever stood in front of a lecture hall has said the same thing: you cannot learn physics by watching someone else do it. You have to pick up the pencil. You have to attempt the problem. You have to get it wrong, sit with the wrongness, and figure out where your reasoning broke. Reading the solution manual and nodding along feels like understanding. It is not understanding. Every student who has tried to coast through a problem set by reading the solutions and then bombed the exam knows this in their bones. We have centuries of accumulated pedagogical wisdom telling us that the attempt, including the failed attempt, is where the learning lives. And yet, somehow, when it comes to AI agents, we've collectively decided that maybe this time it's different. That maybe nodding at Claude's output is a substitute for doing the calculation yourself. It isn't. We knew that before LLMs existed. We seem to have forgotten it the moment they became convenient.
Centuries of pedagogy, defeated by a chat window.
This is the distinction that I think the current debate keeps missing. Using an LLM as a sounding board: fine. Using it as a syntax translator when you know what you want to say but can't remember the exact Matplotlib keyword: fine. Using it to look up a BibTeX formatting convention so you don't have to wade through Stack Overflow: fine. In all of these cases, the human is the architect. The machine holds the dictionary. The thinking has already been done, and the tool is just smoothing the last mile of execution. But the moment you use the machine to bypass the thinking itself, to let it make the methodological choices, to let it decide what the data means, to let it write the argument while you nod along, you have crossed a line that is very difficult to see and very difficult to uncross. You haven't saved time. You've forfeited the experience that the time was supposed to give you.
Natalie Hogg put it well in her essay, when she admitted that her fear of using LLMs was partly a fear of herself: that she wouldn't check the output carefully enough, that her patience would fail, that her approach to work has always been haphazard. That kind of honesty is rare in these discussions, and it matters. The failure mode isn't malice. It's convenience. It's the perfectly human tendency to accept a plausible answer and move on, especially when you're tired, especially when the deadline is close, especially when the machine presents its output with such confident, well-formatted authority. The problem isn't that we'll decide to stop thinking. The problem is that we'll barely notice when we do.
I'm not arguing that LLMs should be banned from research. That would be stupid, and it would be a position I don't hold, given that I used one this morning. I'm arguing that the way we use them matters more than whether we use them, and that the distinction between tool use and cognitive outsourcing is the single most important line in this entire conversation, and that almost nobody is drawing it clearly. Schwartz can use Claude to write a paper because Schwartz already knows the physics. His decades of experience are the immune system that catches Claude's hallucinations. A first-year student using the same tool, on the same problem, with the same supervisor giving the same feedback, produces the same output with none of the understanding. The paper looks identical. The scientist doesn't.
And here is where I have to be fair to Bob, because Bob isn't stupid. Bob is responding rationally to the incentives he's been given. Academia is cutthroat. The publish-or-perish pressure is not a metaphor; it is the literal mechanism by which careers are made or ended. Long gone are the days when a single, carefully reasoned monograph could get you through a PhD and into a good postdoc. Academic hiring now rewards publication volume. The more papers you produce during your PhD, the better your chances of landing a competitive postdoc, which improves your chances of a good fellowship, which improves your chances of a tenure-track position, each step compounding the last (so many levels, almost like a pyramid). So why wouldn't a first-year student outsource their thinking to an agent, if doing so means three papers instead of one? The logic is airtight, right up until the moment it isn't. Because the same career ladder that rewards early publication volume eventually demands something that no agent can provide: the ability to identify a good problem, to know when a result smells wrong, to supervise someone else's work with the confidence that comes only from having done it yourself. You can't skip the first five years of learning and expect to survive the next twenty. There is no avoiding the publish-or-perish race if you want an academic career. But there is a balance to be struck, and it requires the one thing that is hardest to do when you're twenty-four and anxious about your future: prioritizing long-term understanding over short-term output. Nobody has ever been good at that. I'm not sure why we'd start now.
Five years from now, Alice will be writing her own grant proposals, choosing her own problems, supervising her own students. She'll know what questions to ask because she spent a year learning the hard way what happens when you ask the wrong ones. She'll be able to sit with a new dataset and feel, in her gut, when something is off, because she's developed the intuition that only comes from doing the work yourself, from the tedious hours of debugging, from the afternoons wasted chasing sign errors, from the slow accumulation of tacit knowledge that no summary can transmit.
Bob will be fine. He'll have a good CV. He'll probably have a job. He'll use whatever the 2029 version of Claude is, and he'll produce results, and those results will look like science.
I'm not worried about the machines. The machines are fine. I'm worried about us.
References:
D. W. Hogg, "Why do we do astrophysics?", arXiv:2602.10181, February 2026.
N. B. Hogg, "Find the stable and pull out the bolt", February 2026. Available at nataliebhogg.com.
M. Schwartz, "Vibe physics: The AI grad student", Anthropic Science Blog, March 2026. Available at anthropic.com/research/vibe-physics.