Jonathan's Blog

GPT-5 Follow Up Thoughts

The initial disappointment of the release has worn off and I have some assorted followup thoughts that are probably a little more clear-eyed. Instead of a half dozen posts about GPT-5 I’ll group them together into one post.

This is a smart UI decision

I talked with some friends last night. They’re your average phone-user-but-not-techie young adults and they confirmed for me the same thing OpenAI and Altman have been saying about the model picker: the vast majority of people go with the default. That means the vast majority of people are just using GPT-4o for everything, even though it’s often not always the best model. For most people this will be a huge improvement.

Some people still want options

When I was browsing around online to try and get a temperature check of how people are responding to the new model, one source of uproar on the r/ChatGPT subreddit that I hadn’t considered was the role playing crowd. When GPT-5 was released, OpenAI immediately deprecated all their other models including GPT-4o, the previous default. A lot of people are apparently using ChatGPT to role play scenarios and stories. I’ve never even considered doing this, but apparently to get the perfect character you build up large and complex prompts to define them which that break when you switch models. People also complained that the new GPT-5 is trained to be too helpful and task oriented, rendering it practically unusable for creative writing and role playing. When you have years of experience with one model’s specific style, changing the model’s output style is definitely a major change.

I’m also not sure how much NSFW stuff plays into this as well. It felt like there was an unspoken undertone of GPT-5 being a little too goody goody for NSFW role play.

Judging by the Reddit AMA that OpenAI did, these sentiments caught OpenAI completely off guard. It’s not hard to see why. When you’re immersed in benchmarks, capabilities, agents, and the pervasive view of LLMs as a tool, you forget about all the people using these models in a completely separate way. You also run into the problem from an upgrade point of view that someone like me doesn’t care much about the output style (within reason), decides what model to use by the benchmark numbers, and swaps constantly between the models doesn’t have a large attachment to a specific output style. But when people are using models for creative writing, releasing models becomes much more difficult. Instead of only having to keep track of quantifiable capabilities like general knowledge (all the QA benchmarks), tool calling (agentic benchmarks like SWE-bench), and code generation (Aider Polyglot), you now have to also make sure you don’t break everyone’s character styles which is vastly more difficult to quantify.

There’s also breaking the issue that breaking someone’s role playing character also seems to trigger a much more visceral reaction because instead of just breaking a tool, people have built (possibly unhealthy) emotional attachments to these characters. It feels much more akin to killing their friend or giving their friend head trauma that totally changes their personality.

The solution for all of this seems simple, like I’ve said before, just let power users have more options. Sane defaults are always a good idea, but putting a hidden power user option to let people use a particular model at least for a while keeps everyone happy.

Does this mean model capabilities are plateauing?

I think the jury is still out on this one, but this might be the first indication that the “One Model to Rule Them All” era first started by GPT-3 all those years ago may be ending. Instead of training small custom models for specific tasks, the breakthrough with Large Language Models was the first bit: “Large”. Once models got large enough they seemed to have more generalized knowledge that allowed few-shotting and one-shotting novel tasks. The goal therefore was to keep scaling up size and datasets until we got AGI which could handle everything.

Abandoning the one unified model in favor of a router that switches between models feels like an implicit statement that we’ve hit a dead end on generalizable models and it’s time to go back to specialized models to keep eking out improvements.

Is this a cost savings measure?

There have been mutterings that perhaps one of the main goals of GPT-5 is to make dynamic cost savings easier for OpenAI. When users chat with a single model, you either have to put a hard rate limit, or show them that you’re switching them to a cheaper model. Both of these are easy to notice and not great UX. But when you’re silently switching things automatically on the backend already, it would be trivial to adjust the thresholds for more compute intensive models when your system is under more load to keep everything snappy.

OpenAI is already doing this kind of rate limiting explicitly by switching you to mini models after a certain amount of messages, but I would not be at all surprised if they start doing more dynamic fiddling with the router based on system load.

This is further sacrificing API users

Currently, many if not most coding tools that use LLMs through an API default to using Anthropic’s models, most commonly Claude Sonnet. It’s also easier to use Anthropic’s models. They’re available on more API providers like OpenRouter without needing extra steps like another separate API key and ID verification. Most coding tools already have good defaults and use different models depending on the task. GPT-5 therefore won’t be much of an improvement. The router also makes the responses less predictable. All of this means that OpenAI will probably sacrifice some of its capitalization of the third party API space to Anthropic in return for more first party chat interface (ChatGPT.com) usage.

This is probably a smart move, since from what I can find online API usage is estimated to only make up around 15% of OpenAI’s revenue. Subscription users are much more profitable, so if a substantial upgrade for the majority of ChatGPT subscribers hurts API usage, that’s not a very tough pill to swallow.

Appendix

GPT-5 wants to turn everything into an SEO article

Weird stylistic quirk I ran into while using GPT-5. While I never use AI to do any of the actual writing for blog posts, I always run posts through a LLM to get feedback and catch any trivial mistakes. Sonnet and Opus appear to be the best for this kind of thing, but when I was testing GPT-5 it nearly always suggests adding TL;DRs and making paragraphs and sentences short. It seems to have a style in mind it leans towards (maybe because of the prevalence of SEO AI generated slop?) rather than matching my writing style.


Changes