"...the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”"
The best analogy I can think of is the transistor. I still don't understand how a bunch of, essentially, on-and-off switches can result in, for example, a movie streamed to my living room tv. But somehow it does. Still, definitely indistinguishable from magic. (And these are just the early days – the "Eniac" of LLM days...)
Oh, pshaw! It's not magic. It's the theory of dark in action. Light bulbs are really dark suckers which send the dark through the wires to the power plant where the dark is shot skyward from those big chimneys, and eventually the sun sucks up all the dark and has to take a break to digest it. The proof is how much dark is emitted when you short out a circuit in anything electrical.
Thanks for pointing this out. Hard to keep on top of tools. Was "meh" for me on a shortish paper I just submitted. Maybe a bit worse than Claude Opus 4.6 in project mode with all my research in the working directory. Perhaps on par with ChatGPT cold.
It is very verbose, which to us classists indicates inability to condense! That in turn indicates inability to precisely match language to ideas. Thus, taking 100 words to try to capture your meaning because you cannot do it in 10.
Good timing. I just tried the "free" version on a very technical finance paper that I am in the process of refereeing. I fed it the .docx that had been provided by the editor and ran it through their free (abridged) trial. Unfortunately, it appears that the authors cut and pasted equations from another (unknown) program into Word and, consequently, Refine saw them as pictures and ignored them. So the comments from Refine largely centered on the paper's text, Nonetheless, it didn't seem to fully grasp the primary thrust of the paper from the text. So, Refine's feedback was of marginal help. It pointed out a weak spots associated with estimation techniques that were mentioned in the body of the paper, but it probably only saved a few minutes of review time. Not a fair test of the program, but perhaps a cautionary tale. I think the only way to reliably test the program is to test it on early version of one's own research so that inputs can be controlled.
On the main circularity point raised by the AI, there’s no way this can be the first time this has been raised with respect to fiscal theory. The whole how do you know that people think future deficits are worse because bond yields rose argument was never the strongest.
When the Nobel committee is evaluating you in a few years, will they discriminate against you because of your use of AI? (Or will they use AI themselves to evaluate their nominees?)
I suggest that you put your 2023 book, The Fiscal Theory of The Price Level, through "Refine" and use Refine's output as a guide to revise the original text, graphics and equations as the starting point for an "updated, 2nd edition". It could do with a do-over.
I find it hilarious the AI pinged you for not taking seriously imperfect substitutability of government liabilities and lack of institutional realism. Ben’s program really is that good!
I am curious how this program would respond to writings of 100 years ago? I recently read Sir Winston Churchill's book "The Story of the Malakand Field Force," an account of British combat operations in the North-West frontier of India, written in 1898. Churchill's prose are considerably different from those of the 21st century. Would this program have Churchill rewrite the account to fit modern writing styles?
The fun thing is we've been doing this for over a year (originally for causal inference papers only, but readily expandable into other domains), it requires a system prompt of about a hundred lines of pseudocode, and it runs on any frontier model for free. One would think that by this stage, every researcher has their own review prompt at the ready.
Thanks for great inspiration, and very relevant warning. Audience capture, er training set capture of LLMs is certain to be a big and growing problem.
If reality, in econ & other areas, is only accurately described by multivariate non-linear multi-dimensional equations, which have been mathematically intractable and thus not part of any one theory, it should lead to partial falsifications of all understandable theories.
But at another level there is the question of understanding vs. policy-nudging/ coercion/ incentivizing. Is economics to understand or to influence, or even control?
I would also recommend those who cannot or don't want to pay 50 dollars for one run, to look into the "skills" and "rules" that have been shared for Claude Code or Cursor. Pedro Sant'Anna shared his setup recently. Maybe not as good as Refine, but it runs continuously and costs you 20 dollars a month.
If the profession shifts to “LLM summary first, paper second,” then whatever evaluation prior the summarizer bakes in (NK vs FTPL, structural vs reduced-form, “settled science” vs live dispute) becomes a quiet consensus engine.
One design response is AI councils: don’t ask “what does the LLM think of this paper?” Ask “what do multiple coherent evaluators think—each with explicit premises—and where do they disagree?”
Concretely:
Review should have a panel, not singleton. The digest is produced by a council of judges: an NK judge, an FTPL judge, an ID-first judge, a structural judge, a GE/behavioral judge, plus an “editor pragmatist” judge that answers the questions you listed (referee responsiveness, which comments are correctness vs importance vs optional extensions). The goal isn’t to force pluralism as an ideology; it’s to prevent the interface from silently choosing a worldview for the reader. LLM-as-judge for is already using this method for training the model that are producing these stunning gains.
"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" Verga et al 2024
Anthropic’s “persona selection” framing is helpful here: models have access to many possible “voices,” and post-training largely selects among them. That suggests we should deliberately select evaluator personas that correspond to real methodological lenses, rather than letting one latent default voice dominate the digest.
Aggregation shouldn’t average away disagreement. The council shouldn’t collapse into a single blended paragraph. Each judge should issue (i) a recommendation and (ii) an intensity/conviction signal, and the digest should surface high-conviction objections even if they’re minority views. That’s exactly where a lot of real methodological fights live: one camp sees a point as fatal, another sees it as irrelevant.
"From Many Voices to One: Statistically Principled Aggregation of LLM Judges"
This also reframes your proposed test: propose the proper set of judges and how they are coherently formed, ie create the persona that will power the AI evaluator in the council. Then see whether it can (a) generate multiple coherent methodological reads, and (b) preserve live disagreement rather than laundering it into a single authoritative tone. That’s the failure mode you’re pointing at.
Couldn't we just use it in journals? Referees don't get paid anyway, we could just substitute multiple referees and one editor with one editor one referee and multiple feedback sessions with Refine.
"...the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”"
The best analogy I can think of is the transistor. I still don't understand how a bunch of, essentially, on-and-off switches can result in, for example, a movie streamed to my living room tv. But somehow it does. Still, definitely indistinguishable from magic. (And these are just the early days – the "Eniac" of LLM days...)
It is even more perplexing how we got to where we are from "pass on your genes."
You're reminding me of the classic Calvin and Hobbes strip, John and Mike!
https://www.reddit.com/r/calvinandhobbes/comments/14y2kvr/dad_coming_through_with_the_explanation/#lightbox
Oh, pshaw! It's not magic. It's the theory of dark in action. Light bulbs are really dark suckers which send the dark through the wires to the power plant where the dark is shot skyward from those big chimneys, and eventually the sun sucks up all the dark and has to take a break to digest it. The proof is how much dark is emitted when you short out a circuit in anything electrical.
Thanks for pointing this out. Hard to keep on top of tools. Was "meh" for me on a shortish paper I just submitted. Maybe a bit worse than Claude Opus 4.6 in project mode with all my research in the working directory. Perhaps on par with ChatGPT cold.
Just out of curiosity, what type of paper was it/what area? We'd happily give you some free reviews to see how you feel about it on a longer paper.
it does shine most on a project like John's 80-pager.
Would be happy to chat - feel free to ping me at ben@refine.ink
IMO, the human still has value. Refine goes for narrower critiques, but sometimes you need to disagree with large choices.
I say “still has value”, but of course, statement only valid for the next two years!
That AI tool left a really good comment about your critique of monetarism, and you didn’t listen to it!
It is very verbose, which to us classists indicates inability to condense! That in turn indicates inability to precisely match language to ideas. Thus, taking 100 words to try to capture your meaning because you cannot do it in 10.
Good timing. I just tried the "free" version on a very technical finance paper that I am in the process of refereeing. I fed it the .docx that had been provided by the editor and ran it through their free (abridged) trial. Unfortunately, it appears that the authors cut and pasted equations from another (unknown) program into Word and, consequently, Refine saw them as pictures and ignored them. So the comments from Refine largely centered on the paper's text, Nonetheless, it didn't seem to fully grasp the primary thrust of the paper from the text. So, Refine's feedback was of marginal help. It pointed out a weak spots associated with estimation techniques that were mentioned in the body of the paper, but it probably only saved a few minutes of review time. Not a fair test of the program, but perhaps a cautionary tale. I think the only way to reliably test the program is to test it on early version of one's own research so that inputs can be controlled.
Does it read LaTex files and PDFs with ease?
On the main circularity point raised by the AI, there’s no way this can be the first time this has been raised with respect to fiscal theory. The whole how do you know that people think future deficits are worse because bond yields rose argument was never the strongest.
When the Nobel committee is evaluating you in a few years, will they discriminate against you because of your use of AI? (Or will they use AI themselves to evaluate their nominees?)
It should. This admission by Cochrane should leave us all wondering if anything he writes in the future is his own work.
I suggest that you put your 2023 book, The Fiscal Theory of The Price Level, through "Refine" and use Refine's output as a guide to revise the original text, graphics and equations as the starting point for an "updated, 2nd edition". It could do with a do-over.
I find it hilarious the AI pinged you for not taking seriously imperfect substitutability of government liabilities and lack of institutional realism. Ben’s program really is that good!
I am curious how this program would respond to writings of 100 years ago? I recently read Sir Winston Churchill's book "The Story of the Malakand Field Force," an account of British combat operations in the North-West frontier of India, written in 1898. Churchill's prose are considerably different from those of the 21st century. Would this program have Churchill rewrite the account to fit modern writing styles?
I wonder how well AI could rewrite it in other styles, like Mark Twain or Shakespeare.
The fun thing is we've been doing this for over a year (originally for causal inference papers only, but readily expandable into other domains), it requires a system prompt of about a hundred lines of pseudocode, and it runs on any frontier model for free. One would think that by this stage, every researcher has their own review prompt at the ready.
Thanks for great inspiration, and very relevant warning. Audience capture, er training set capture of LLMs is certain to be a big and growing problem.
If reality, in econ & other areas, is only accurately described by multivariate non-linear multi-dimensional equations, which have been mathematically intractable and thus not part of any one theory, it should lead to partial falsifications of all understandable theories.
But at another level there is the question of understanding vs. policy-nudging/ coercion/ incentivizing. Is economics to understand or to influence, or even control?
I would also recommend those who cannot or don't want to pay 50 dollars for one run, to look into the "skills" and "rules" that have been shared for Claude Code or Cursor. Pedro Sant'Anna shared his setup recently. Maybe not as good as Refine, but it runs continuously and costs you 20 dollars a month.
If the profession shifts to “LLM summary first, paper second,” then whatever evaluation prior the summarizer bakes in (NK vs FTPL, structural vs reduced-form, “settled science” vs live dispute) becomes a quiet consensus engine.
One design response is AI councils: don’t ask “what does the LLM think of this paper?” Ask “what do multiple coherent evaluators think—each with explicit premises—and where do they disagree?”
Concretely:
Review should have a panel, not singleton. The digest is produced by a council of judges: an NK judge, an FTPL judge, an ID-first judge, a structural judge, a GE/behavioral judge, plus an “editor pragmatist” judge that answers the questions you listed (referee responsiveness, which comments are correctness vs importance vs optional extensions). The goal isn’t to force pluralism as an ideology; it’s to prevent the interface from silently choosing a worldview for the reader. LLM-as-judge for is already using this method for training the model that are producing these stunning gains.
"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" Verga et al 2024
https://arxiv.org/abs/2404.18796
Anthropic’s “persona selection” framing is helpful here: models have access to many possible “voices,” and post-training largely selects among them. That suggests we should deliberately select evaluator personas that correspond to real methodological lenses, rather than letting one latent default voice dominate the digest.
https://www.anthropic.com/research/persona-selection-model
Aggregation shouldn’t average away disagreement. The council shouldn’t collapse into a single blended paragraph. Each judge should issue (i) a recommendation and (ii) an intensity/conviction signal, and the digest should surface high-conviction objections even if they’re minority views. That’s exactly where a lot of real methodological fights live: one camp sees a point as fatal, another sees it as irrelevant.
"From Many Voices to One: Statistically Principled Aggregation of LLM Judges"
Zhao et al 2025
https://neurips.cc/virtual/2025/loc/san-diego/125205
This also reframes your proposed test: propose the proper set of judges and how they are coherently formed, ie create the persona that will power the AI evaluator in the council. Then see whether it can (a) generate multiple coherent methodological reads, and (b) preserve live disagreement rather than laundering it into a single authoritative tone. That’s the failure mode you’re pointing at.
You should write an equivalent document to Anthropic's constitution: https://www.anthropic.com/constitution
Couldn't we just use it in journals? Referees don't get paid anyway, we could just substitute multiple referees and one editor with one editor one referee and multiple feedback sessions with Refine.