Refine

Feb 24, 2026

I recently tried refine, an AI tool for refining academic articles, developed by Yann Calvó López and Ben Golub. I sent it the current draft of my booklet on inflation, to see what it can offer. I just used it once so far, with the free trial mode. I will be a regular user forever.

The results are stunning. The comments it offered were on the par of the best comments I’ve received on a paper in my entire academic career. And more concise and organized than the best ones. They aren’t perfect, but the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”

Here are its main comments:

Operationalizing the “fiscal news” narrative

The core episode claim in Ch. 2.3 is that inflation “started” when agents learned in early 2021 that a material share of deficits would not be repaid, and “ended” in mid-2022 when expectations shifted regarding future unfunded programs. As currently written, the evidentiary support risks circularity: the text leans on the occurrence of inflation to infer that debt was perceived as “unbacked,” while offering mostly qualitative narrative signals.

To carry the identification burden, this narrative requires operationalizing. Ch. 4 explicitly claims recent history offers unusually clean “experiments.” To substantiate this, the text needs a tighter set of dated, externally visible observables—specific legislative hurdles, “debt doesn’t matter” rhetoric, or election probabilities—tied to sign and timing predictions for inflation expectations and long-bond prices. Without pinning the “news” to specific dates distinct from the inflation series itself, the episode reads as a post hoc overlay rather than a discriminating application of the FTPL mechanism.

The program found the core episode and summarizes the main argument of an 80 page paper. I did write a disclaimer that finding separate measures of the present value of surpluses is not possible, and that my goal was to establish plausible rather than proof. But I was lazy, and the program suggests the obvious way in which I should tighten up the presentation, within the realm of the possible. Back to work.

Clarifying the fiscal regime distinction in the “completeness” claim
The manuscript repeatedly asserts completeness, noting in Ch. 3 (”Now, stop and admire what we have accomplished. We have a full, complete… theory…”) and Ch. 4 (”the only known, complete, economic theory… consistent with current institutions”). However, Ch. 4.5 also correctly notes that “All of the equations… are in fact the same” between FTPL and New Keynesian (NK) models.
For the academic audience, the claim that the standard NK model “Doesn’t Work” (Ch. 6.1) or is incomplete creates immediate friction unless the distinction regarding regime closure is explicit early on. The substantive disagreement is not the presence or absence of the valuation equation, but the fiscal regime: does the equation determine prices (FTPL), or does it determine future surpluses via a “passive” fiscal assumption (NK)? Reframing the critique would strengthen the argument. Rather than stating the NK model fails on internal logic, the text should argue it requires an empirically untenable assumption of passive fiscal behavior during the 2020–2022 episode.

The point is fair, and a good case of how the program improves writing. There is a tension between the observational equivalence theorem and my claim that recent experiments tell us which model is right. The answer is in the paper — observational equivalence is only about observed time series, and we can use what we know about how central banks behave from other sources to think about what regime is plausible. Central banks don’t threaten explosive behavior to select equilibria. But that’s spread out in the paper. I didn’t really stress the causal interpretation of the valuation equation issue — does the valuation equation express how inflation causes surpluses or how surpluses cause inflation. (Like the 50 year old debate whether MV=PY expresses how money supply causes inflation or inflation causes money demand accommodated by supply.) The model picked this up on its own, likely by being trained on previous FTPL controversies. Back to work

Resolving ambiguity in the transmission mechanism
There is a tension in how the text describes the transmission of interest rate hikes. In Ch. 3.1 and Fig. 3.2, the argument is that rate hikes can produce short-run disinflation by lowering the nominal market value of outstanding bonds (the numerator in the valuation equation). However, Ch. 2.2 and Ch. 4.5 emphasize that discount rates and interest costs are central to the present value of surpluses (the denominator).
A rise in real rates that compresses bond prices also generically lowers the present value of surpluses via discounting or raises interest costs. This can push the price level upward, creating a “stepping on a rake” effect where the central bank shifts inflation across time. The policy conclusions (”should promptly raise rates,” Ch. 3.1–3.2) need to be reconciled with this mechanism. The text should explicitly partition what the central bank can achieve alone (via the valuation effect on the numerator) versus what requires fiscal adjustment, and clarify when the disinflationary channel is generic versus conditional on duration duration.

Here too, the model caught my habit of introducing an idea and then later refining it. The long-term debt mechanism is a way that higher future interest rates can result in lower inflation today. And the typical persistent shock links higher current and future interest rates. But sticky prices work against that mechanism. With sticky prices, higher nominal rates mean higher real rates which means higher interest costs on the debt which raises inflation. I sort of dribbled that out in several places. This may or may not be a good idea. I don’t like to overwhelm readers right away. But the program makes me think about that choice.

And more, which I won’t bore you with.

It also found algebra errors, such as a negative sign that slipped in the solution of a differential equation.

And that’s the free mode! I’ll rush back in paid mode (much more extensive) after I finish incorporating these comments.

****

This is the first time I’ve seen AI at work in something I do daily, and it really is revolutionary.

Refereeing and evaluating papers is one of the more unpleasant and time consuming tasks in our profession. I’ve read a lot of referee reports in my 40 years as an economist, and this is top 5% for sure. Most referee reports do not identify the major point of the paper, and do not assess if the paper backs up that point. They do not notice glaring gaps of logic, basic theorems violated, econometrics advice 101 ignored. Editors are lucky if one out of three reports is vaguely useful. Clearly, this task is going to be radically impacted by AI. If I were an editor, I’d feed every paper to refine on receipt. Or, I would require the author to spend the $50 and send in the last refine report! I will surely get refine’s opinion before any referee report I write in the future.

Will all the referees be out of jobs? No! You still have to read and evaluate what refine offers. But the speed, accuracy, and quality of reports will jump. And economists will save a lot of time.

Before anyone asks me for comments on a paper, I’m going to ask “did you submit it to refine?” That will save a lot of time too.

And it should produce much better written papers, which will also save a lot of time.

It does feel weird writing defensive prose to an AI program, as I do to the thoughtful humans who offer me comments. The last comment brings this up

Strengthening the discrimination against the Monetarist alternative
Ch. 4.2’s flagship discrimination claim is that Monetarism predicts QE should be inflationarily equivalent to helicopter transfers, and that the observed contrast falsifies Monetarism. However, the manuscript emphasizes the institutional reality of Interest on Reserves (IOR) and “ample reserves.” A sophisticated “money view” would invoke precisely these facts to argue that reserves and T-bills are near-perfect substitutes under IOR, making QE neutral while transfers are not.
If the Monetarist benchmark—invoked here to be knocked down—is not the strongest version consistent with the institutional realism the book champions, the adjudication in Ch. 4 loses credibility. To persuade the readers most likely to scrutinize this section, the text should address why the FTPL explanation dominates even a sophisticated Monetarist view that accounts for IOR, rather than only the primitive version that ignores it.

Yes, I did leave out the liquidity value of treasury debt. But is there really a serious school of thought that thinks treasury debt is a perfect substitute for money, so that we should think of the whole stock of treasury debt as the monetary base? That would mean open market operations are completely irrelevant, undoing a lot of standard monetarism. And the famous “stability” of the M2-nominal GDP ratio is totally absent for total debt, which has fluctuated form 25% of GDP in the 1970s to 100% now. Velocity shocks? Perhaps the paid version will send me some cites so I know if I really have to deal with it.

But one way or another it is astonishing that a computer program came up with this logical possibility, which is at least a worthy whatabout seminar comment from the back of the room.

In the meantime, I also tried Claude to update some graphs. My prompt was just “write a matlab program that fetches data series xyz from Fred using the API, and make a graph that..” with pretty detailed description of the graph. It ran right out of the box, even doing a decent job of “put text labels on the graph in a way that doesn’t conflict with the plotted time series.” Claude did not do a good job of finding which Fred data series would work, but that was a small task. And it produced code using a lot of commands I don’t recognize. Making sure programs do what you think they do will be a new challenge. It went on and did things I didn’t ask for, like offer summary statistics! Still, an hour job took 5 minutes.

This is all old news to most of my colleagues, who are integrating AI into workflows with great speed. But if you’re not using these tools, the time to start is now.

(I used no AI to write this substack, and have not done so in the past. I’ll acknowledge when I do. As I included refine in the thanks of the inflation booklet.)

Update

On reflection I have started to worry again. In 10 to 20 years nobody will read anything any more, they just will read LLM digests. So, the single most important task of a writer starting right now is to get your efforts wired in to the LLMs. Nothing you write will matter if it is not quickly adopted to the training dataset. As the art of pushing your results to the top of the google search was the 1990s game, getting your ideas into the LLMs is today’s. Refine is no different. It’s so good, everyone will use it. So whether refine and its cousins take a FTPL or new Keynesian view in evaluating papers is now all determining for where the consensus of the profession goes.

Communicating with a journal editor, I can see the usefulness of a version tailored for evaluating papers. For example, “did the author incorporate and respond to the referee’s comments?” “do the referee’s comments make any sense?” “which of the referee’s comments address the paper’s correctness or importance, and which are merely suggestions for further work?” and so forth. However, capture of the LLM still strikes me as a potential problem, either by a wing —economists love nothing more than a methodological fight, “this paper uses outdated structural/reduced form methodology,” “this paper ignores important behavioral/general equilibrium analysis” etc — or by the “settled science” — imagine the LLM reaction in recent years to climate, gender, public health, inequality, etc. papers if trained on the “consensus.” I should test refine on some controversial topics. Also how it does on the bullshit benchmark, a really important problem in economics academia. We need a quantitative bullshit evaluation — bs wrapped up in fancy equations.

Joe Cobb

Feb 25

When the Nobel committee is evaluating you in a few years, will they discriminate against you because of your use of AI? (Or will they use AI themselves to evaluate their nominees?)

1 reply

Margaret Stumpp

Good timing. I just tried the "free" version on a very technical finance paper that I am in the process of refereeing. I fed it the .docx that had been provided by the editor and ran it through their free (abridged) trial. Unfortunately, it appears that the authors cut and pasted equations from another (unknown) program into Word and, consequently, Refine saw them as pictures and ignored them. So the comments from Refine largely centered on the paper's text, Nonetheless, it didn't seem to fully grasp the primary thrust of the paper from the text. So, Refine's feedback was of marginal help. It pointed out a weak spots associated with estimation techniques that were mentioned in the body of the paper, but it probably only saved a few minutes of review time. Not a fair test of the program, but perhaps a cautionary tale. I think the only way to reliably test the program is to test it on early version of one's own research so that inputs can be controlled.

30 more comments...

The Grumpy Economist

Discussion about this post

Ready for more?