GPT-4 Takes a New Midterm and Gets an A

Apr 3, 2023

43 Comments

Apr 3, 2023

The wonder is not GPT's performance but that it was done under such a capricious and arbitrary grading scheme. Once again I am grateful never to have faced one of Caplan's tests.

Expand full comment

Reply (3)

honeypuppy

Apr 3, 2023

>It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided.

This is just blatantly docking points for disagreeing with your politics. It's easy to justify an UBI experiment as an EA even if you think UBI wouldn't be the best thing to *personally* fund, if you think UBI has a good chance of being a significant improvement over the status quo and an experiment might be a low-cost way of helping bring about that change. EAs don't have to *just* be about malaria nets for the very poor.

I also concur with the sentiments of the other posters that the questions and answers are not that difficult but obviously loaded to make a point about your politics.

I wonder if the next generation of AI will seek to do well on tests by predicting how to flatter the graders.

Expand full comment

Bryan, is this going to make you update your thoughts on AI safety?

If 'capabilities' is 5 years ahead of what you expected (this should be a huge shock to your model of AI) would you be worried if AI safety/alignment is 5 years+ behind capabilities?

Expand full comment

Timothy

Pieces of Knowledge

Apr 3, 2023

Apart from all the AI stuff, the exam seems so weird an biased. First of all, the questions all seem really easy, I feel like I could pass the exam by reading 7 random Caplan blog posts. And they seem so biased. Especially number two. Politics is the mind killer (https://www.lesswrong.com/posts/9weLK2AJ9JEt2Tt8f/politics-is-the-mind-killer) if you want people to think rationally or to learn something, don't use politics except if you have to. Six is also kind of weird but I do like the mention of EA.

But I have never seen another Public Policies exam, so maybe other ones are worse.

Expand full comment

Reply (1)

Luis Augusto Fretes Cuevas

Apr 5, 2023

I'm incredibly surprised that this is what you use to grade your students. All the questions are incredibly loaded, easy and basically boiled down to "did you read me, paraphrase my believes to me".

Honestly I'm shock that this passes for an exam and I would have been shocked if GPT-4 had failed it, given the things I've seen it accomplish.

Expand full comment

AJ Gyles

Apr 3, 2023

So now the question is, what are you going to do about it? How do you give a fair test when GPT-4 makes it so easy to cheat on a test? The way I see it there are three options, none of them good.

Option 1: The chump. Just keep doing things the way they are now. Some students will use GPT-4 to cheat, some will try to do it themselves. The ones who do it themselves will be punished by a lower score, especially with GPT-4 bumping up the grading curve. Eventually, all the top students will be cheaters.

Option 2: The prison ward. All students are forced to take the tests in person, with pencil and paper only. All of them are forced to turn over all cell phones and electronic devices before the test, maybe going through something like airport security. The teacher must sit and watch them the entire time. Everyone will be miserable, but at least it's still fair.

Option 3: Give up. Classes will no longer be graded. This will make it very hard for students to motivate themselves to study, and also very hard for grad schools/employers to tell which students did a good job in the classes. The value of a college degree continues its downward slide.

Like I said, I really don't like any of these options, but what else is there? It seems terrible that we've invented such a powerful cheating device with no defense.

Expand full comment

Reply (4)

Jerome Powell

Apr 7, 2023

This is honestly an astoundingly vacuous exam.

Expand full comment

Bwhillers

Bwhillers’s Substack

Apr 3, 2023

The response to the second question (asking about why liberal Californians moving to Texas is surprising") tells me that GTP-4 is ready to host its own television show.

Expand full comment

Christopher Romanowski

Apr 7, 2023

Stumbled across this due to my interest in the AI content, but more interesting is the content of this "economics" exam. Q1 is really the only objective, non-loaded question on the exam, and I'd wager any real economist that didn't write this exam would score ChatGPT's answers better than Prof. Caplan's "suggested answers." Want to Bet On It??

Q3: "It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided." Perfect example of why so many scientists don't consider economics to be a science at all- Lack of testable hypotheses, lack of consensus, and inherent political overtones.

Q2: Every year, 60-80,000 Californians move to Texas and 35-40,000 Texans move to California. What would be surprising is if all those individuals moved for political reasons. In liberal California, 49% of residents identify as Democrat, 30% identify as Republican, 21% no lean. In conservative Texas, 40% of residents identify as Democrat, 39% identify as Republican, 21% no lean. Should we be surprised when literally half of each state's population don't move across the country to better self-sort by political affiliation? There's more to life than party affiliation, and party affiliation is more complex than any single issue such as guns or taxes. https://www.pewresearch.org/religion/religious-landscape-study/compare/party-affiliation/by/state/

Q5: Even Milton Friedman agreed that occupational licensing benefits can outweigh the costs in some cases where there is a public safety concern. An economist should also consider external costs, not only individual costs. Should we let a budget airline choose to save a few bucks by hiring cheaper unliscensed pilots? Should we allow insurance companies to save a few bucks by approving only cheaper unliscensed surgeons?

Expand full comment

Steven Landsburg

Apr 5, 2023

By contrast, GPT-4 failed my final exam egregiously: https://www.thebigquestions.com/2023/04/05/gpt-4-fails-economics/

Expand full comment

Bernke

Apr 4, 2023

ChatGPT(not sure if GPT4 or not) scores a zero on Steven Landsburg's exam: https://www.thebigquestions.com/2023/04/04/chatgpt-fails-economics/

Expand full comment

Mike H

Apr 3, 2023·edited Apr 3, 2023

Does this update any of your priors about other things machine learning programs may be able to do in the near future and its likely consequences?

Expand full comment

Apr 4, 2023

For what it's worth, I had trouble replicating results for Question 1 on Bing Chat, which I thought was powered by GPT-4. I tried all three of "Creative Mode", "Balanced Mode", and "Precise Mode". In Balanced Mode, Bing Chat actually found this Bet On It post and saw Bryan Caplan give the answer as 80M but went through its own calculation anyways and came up with the wrong answer. In Creative Mode, it just gave an incorrect answer. In Precise Mode, it did the correct calculation.

However, I was suspicious because it clearly already knew about this Bet On It post. So, I decided to ask a similar question in Precise Mode to see whether it could answer. I asked, "Suppose that local authorities are considering banning a harmful technology. They estimate that the ban will reduce quality of life by 10% for a year but save 732 life-years. How high does total population have to be before the ban fails CBA?"

Bing Chat replied: "Cost-benefit analysis (CBA) is a technique used to compare the total costs of a policy or project with its benefits. In this case, the cost of the ban is a 10% reduction in quality of life for one year and the benefit is 732 life-years saved. To determine whether the ban fails CBA, we need to know the value of a statistical life-year (VSLY) and the total population. The ban fails CBA if the total cost of the ban (10% reduction in quality of life for one year for the entire population) is greater than the total benefit (732 life-years saved * VSLY). Without knowing the VSLY and total population, it is not possible to determine whether the ban fails CBA."

This version of the problem is even easier than the original because one doesn't need to subtract two numbers to get the net benefit of 732 life-years. I eliminated the required subtraction step because I didn't want Bing Chat to just be able to use the original solution in this Bet On It post. Using the same solution method as the original problem, the answer is that the ban fails CBA for populations above 732/(0.10) = 7320. (VSLY appears on both the cost and benefit side and, thus, cancels.)

Was there any special prompting that Collin Gray had to use to get GPT to work? It's possible that Bing Chat's implementation is just worse than other GPT-4 implementations. I have found Bing Chat makes many mathematical errors, e.g., algebraic errors, even though others have reported GPT-4 doing very well on SAT-Math and GRE-Quant. It's also possible that I'm not prompting it skillfully.

Expand full comment

Reply (1)

Vella Rae

Apr 8, 2023

It seems that the AI is been trained to become a teacher's pet and docilely submit to indoctrination to align with whatever fuzzy philosophy the teacher is espousing.

Expand full comment

Phemepark

Apr 7, 2023

But doesn‘t testing GPT-4 on your exams contradict your free-market principles?

Shouldn‘t consumers be free to (blindly) trade off quality against price in markets with asymmetrical information (market for lemons)? Didn’t Milton Friedman urge us to be “free to AI hallucinate”?

Expand full comment

RRC

Apr 6, 2023

How much time did people have to complete the test? Do students usually get a chance to answer every question? If students had more time to do the test, would they do a lot better?

Consider that it's quite easy to make an exam that humans suck at and chat gpt sucks at less:

1. Make a very long test of easy questions

2. Give students a short amount of time to take the test—the shorter the better

Predictably, chat-gpt is able to answer all of the questions. The questions are fairly easy so it does okay on most of them and it gets partial credit for all the questions. Humans weren't able to answer every questions so they do a lot worse. However, you might expect the number of questions humans got perfect scores on to be higher than the number chatgpt got.

Expand full comment