57 Comments

For a few years now, AI sceptics have argued "well it can answer question A, but it still gets harder question B wrong", ignoring that six months ago it couldn't answer A either and it's the direction of travel that's important. It feels like we are now beginning to run out of room to make the questions harder (unless it's to ask questions that humans can't answer either); and the rate of AI improvement shows no sign of slowing down.

Expand full comment

Chains of thought! Forcing the model to give the "TRUE" or "FALSE" response first, robs it of any chance to actually use reason in working out its answer. Instead, I recommend something like prompting the AI to give: - a set of relevant points, - further inferences that can be made from those points, - THEN an explanation leading it up to its ultimate answer, - THEN followed by the actual TRUE or FALSE.

This may seem like a lot of effort, but keep in mind that the AI does not have a consciousness or thought as we do; if we want it to actually "think about a problem", the thinking has to take place inside the text it outputs. Even students get to use a scratch pad.

Expand full comment

Wait I thought that the bet was inflation adjusted! That's a really unfair bet!

Expand full comment

I was curious how much of these results are being driven by the perfect score responses to "what did notable person X say about subject Y", since those are questions where GPT has a predictable advantage over the typical student with a flesh-brain. Replacing the scores for those questions with average score of the other T/F/E questions gets 63.5, still an A, but a more marginal one. Still doesn't augur well for your bet, but I'd buy derivatives of your bet up to 20 cents on the dollar from 10-15, up to 30 if you wanted to maximize your chance of winning by dropping "what does X think about Y" questions from future exams.

Expand full comment

I want to know what happens to students if you write an exam that GPT does badly on. Use more complex examples that don't all have a similar form or (for quantitative problems) aren't carefully designed to have simple nice answers.

I bet you end up with an exam that is much better at differentiating students who actually have a conceptual understanding to those which have basically done what GPT is doing here: kinda learned the basic problem forms and to pick out what the instructor is testing but don't necessarily have any ability to apply thaf in the real world.

Expand full comment

Dang, that sucks about your bet. I think you probably could have predicted this differently though. John Carmack, for example, was talking about how strong LLMs are and he is usually much more sober than most people with technology related predictions.

Now that you've lost, what are your predictions for AI capabilities? I think that they're likely to automate large swaths of knowledge work, but I'm not sure when we reach the point where very little knowledge work is done by humans. I would guess that happens within 40 years, but I'm not sure why I guess that time specifically. It could happen a lot sooner, but I don't think that exponentials can continue

Expand full comment

When you made the bet I commented it was dumb. Basically with the right training data, GPT 3 could have passed your test. So the bet has always been is the right data available to answer your questions, and is the AI smart enough to make use of it. The answer to both for this test is clearly yes. You can design a test the AI doesn't have data to answer, or a test that exploits a common mistake the AI makes to ensure the wrong answers. So you can win the bet if you choose, but you lost the war. The AI can pass a test like this at this point the mechanical chance you'll win is not worth the time just concede and pay it out. You've lost, learn from it.

Expand full comment

https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks

"As further evidence for this hypothesis, we tested it on Codeforces problems from different times in 2021. We found that it could regularly solve problems in the easy category before September 5, but none of the problems after September 12.

In fact, we can definitively show that it has memorized problems in its training set: when prompted with the title of a Codeforces problem, GPT-4 includes a link to the exact contest where the problem appears (and the round number is almost correct: it is off by one). Note that GPT-4 cannot access the Internet, so memorization is the only explanation."

It sounds like it trained on questions that are more similar to your exam questions. Have you tried asking for similar exam questions to the ones you gave to try to determine what it used for training? Another thing to try would be to ask new questions to test similar concepts.

Expand full comment

I am curious how you would grade this answer to question 5:

FALSE. The part about externalities is correct, since by definition, don’t affect private returns, so selfish students won’t care about them. Severe credit market imperfections, however, even if we presume they only result in too little credit rather than too much, need not make you more eager to continue your education either. If the goal of education is to increase your human capital and social capital, but imperfections make it more difficult to access financial capital, then spending down your financial capital and access to credit is more costly, and can stop you from having the complements necessary to profit from your education. One cannot assume that credit market imperfections will impact student loans or that loan rates reflect expected returns to education, since the student loan market is heavily subsidized by the government, so this does not provide strong evidence that returns to education are likely to be higher. One possible way, however, that excess returns could be implied is if, lacking good information on borrowers, those providing loans are using education as a proxy measure when giving out loans. In that case, continuing one’s education would provide additional access to more credit on better terms, which is valuable, increasing returns to education.

Expand full comment

You were not my professor but I (more or less) remember virtually all of these questions in some form of another. The set of training data for basic economics across the web (nevermind all the “helper/tutor/cheater” sites) is breathtaking, widely distributed, and essentially syndicated. I’m not surprised at all. Yet, this is not intelligence - this is distribution of intellegence (or otherwise). That said, the culmination of the worlds syndicated data will lead to increasing reliance (business models) on “good-enough” answers to many (most?) regular requirements across a very broad range of industries. People worry about the Red Herring known as AGI and, the spend time worrying about “alignment” as if it’s even an applicable word/concept other than theoretical modles that left the station long ago. Alignment without understanding the power of markets is an academic exercise. Don’t worry about alignment, worry about markets. Don’t worry about AGI with “good enough” narrow AI is already here. We have arrived. Where we are going is a fun but useless exercise.

Expand full comment

Interesting that it got the lowest score on what was, in my opinion, the easiest question (licensing requirements & prices).

Expand full comment

I think this is why Tyler has been bullish on Chat-GPT, he saw this exponential when it was first released

Expand full comment

Could the success of GPT-4 simply be a result of you posting the test results, answer key, and discussion of the correct answers?

Expand full comment

For the next exam, do you mind sharing where we can read certain passages? I have The Accidental Theorist and probably whatever Lansdburg's passage is from (unless it's from an article). However, I really didn't know the context of the passages so it felt really difficult to answer them correctly. I got Krugman answer correct, but as you know being besties with a philosopher, it may not have been a justified belief.

Expand full comment

> "The most natural explanation to my mind was that my blog post made it into the new training data, but multiple knowledgeable friends assure me that there is no new training data."

Bryan, are you 100% sure about this? I've seen ChatGPT reference things after 2021. Example confirmation: https://twitter.com/tszzl/status/1638655356346957824

They might well be correct that your blog post is entirely separate from the training data. But I've heard "the training data ends in 2021" point referenced a few times, and at least in that simple form, it doesn't seem to be the full story. And the results of a bet like this absolutely hinge on the precise details.

Expand full comment

Bryan: "Score: 15/20. The AI lost five points for failing to explain that (a) services are a large majority of jobs in modern economies

AI: ""This is important because services, such as meal preparation, are a significant part of modern economies."

???

Expand full comment