Signal #84969NEUTRAL

GPT 5.5, DeepSeek V4, and the Era of Scarcity

90

[0:00] In the last 20 hours in AI, we have [0:03] gotten two new models that could [0:05] influence how a billion people use AI. [0:08] In my mind, GBT 5.5 is OpenAI's allout [0:12] attempt to keep the AI crown from [0:14] slipping too anthropic, while today's [0:17] Deep Seek V4 is China's answer to both. [0:21] And in the swirl of headlines you are [0:23] seeing today, you might have missed up [0:25] to 50 data points that could affect how [0:29] you work and how you use AI. So, I'm [0:31] going to try and give you all of them, [0:33] plus select highlights from hours worth [0:36] of interviews that I've watched with lab [0:39] leaders. You probably know me well [0:40] enough to know that I've read the [0:42] papers, too. So, we'll hear about [0:44] OpenAI's updated estimate on the chances [0:46] of recursive self-improvement. It was [0:48] quite surprising. GPT 5.5's slight [0:51] preference for men, which I'll explain, [0:54] mythos comparisons, and why the OpenAI [0:57] president laughed at anthropics compute [1:00] situation. For reference, I'll start [1:02] with a focus on GPT 5.5, then do [1:04] DeepSeek, and end by zooming out for the [1:06] juiciest part of the overview. For the [1:09] brand new GPT 5.5, I did get early [1:12] access, but there's no API access at the [1:15] moment for anyone. So almost all of the [1:17] benchmark scores you're going to hear [1:18] about are self-reported from OpenAI. I [1:20] will say for me testing out GPC 5.5 for [1:22] days in the run-up to this release, it [1:24] will become my daily driver just about [1:27] nudging out Opus 4.7. There's lots of [1:29] caveats to that though. As you can see [1:31] with GT 5.5 underperforming both Opus [1:35] 4.7 and of course Mythos Preview on [1:38] Agentic Coding Swebench Pro. Notice GP [1:40] 5.5 underperforms Opus 4.7 by around 6% [1:44] but Mythos preview by almost 20%. What [1:48] you might not notice is that there's no [1:50] entry for SWEBench verified. And so you [1:52] might say well Philillip who cares about [1:54] Swebench Pro then what does it even mean [1:56] that one row? Well to OpenAI it [1:59] seemingly means a lot because as Neil [2:01] Chowry points out in February OpenAI [2:04] told us to switch to Swebench Pro. [2:06] That's the one it underperforms in [2:08] because it's less contaminated than [2:10] Swebench verified. According to the [2:12] OpenAI blog post, we recommend Swebench [2:15] Pro. You are probably going to go [2:16] through a bit of a roller coaster in [2:18] this video because if you look one row [2:20] down at Aentic Terminal Coding, you'll [2:22] see GPT 5.5 way ahead. It's 82.7% score, [2:26] beating out Mythos Previews 82.0%. [2:29] And so if you had just been feeling down [2:31] about GPT 5.5's coding ability, there's [2:34] another reminder I'll bring, which is [2:35] that we've been talking about GPT 5.5, [2:38] not even GPT 5.5 Pro, which is coming to [2:41] the API very soon. So while it's [2:43] tempting to say that Mythos is [2:44] absolutely mogging GPT 5.5, and let me [2:46] know if I use that word correctly. We [2:48] don't actually have an applesto apples [2:50] comparison. The mandate of heaven is [2:52] very much up for grabs. Okay, so now [2:54] you're a bit confused. Let's look [2:55] further. Let's look at humanity's last [2:57] exam, which is more of a arcane [2:59] knowledge benchmark. Obscure academic [3:01] domains combined with advanced [3:03] reasoning. Well, there GPT 5.5 is beaten [3:06] by both Opus 4.7 and Mythos as well as [3:09] Gemini 3.1 Pro, by the way, without [3:10] tools. But there's a caveat even to this [3:13] because that involves a lot of general [3:15] knowledge. It could well be that OpenAI [3:18] are at least slightly deemphasizing such [3:20] general knowledge to make the model more [3:22] efficient and cheaper. One of the top [3:24] researchers at OpenAI who I've been [3:25] quoting for years, Nome Brown, said, [3:27] "What matters is intelligence per token [3:30] or per dollar. After all, if you spend [3:32] more, you do go up in benchmark score." [3:35] Or in fancier language, intelligence is [3:37] a function of inference compute. That [3:39] being the case, if GC 5.5 can work well [3:41] in the domains you care about and use [3:43] fewer tokens to get the answers you care [3:46] about, then you may just frankly not [3:47] care about humanity's last exam. In one [3:50] famous test of pattern recognition ARGI [3:52] 2, you'll see that GBC 5.5 on all [3:55] settings beats out the Clawude Opus [3:57] series 4.6 and 4.7. Not only achieving [4:00] higher scores, but for much lower cost. [4:03] Just one benchmark of course, but we [4:04] have to increasingly focus on [4:06] performance per dollar these days. And [4:08] on that front, Deepseek will definitely [4:09] want a word because holy moly, I'll get [4:11] to them later, but Deepseek V4 Pro got [4:14] 61.2% in my own private benchmark, [4:17] Simplebench. It asks spatio temporal [4:19] questions that you need common sense to [4:21] see through the tricks of, but to get [4:23] within 1 or 2% of Opus 4.7. I wasn't [4:26] expecting that at an absolute fraction [4:28] of the cost. By the way, again, no GPT [4:30] 5.5 score because no API access. What [4:33] about those frantic headlines about [4:34] Mythos being able to hack into virtually [4:37] any system? I think a lot of that was [4:38] overblown and some of that could be [4:40] achieved by much smaller models. But [4:42] nevertheless, skipping to page 33 of the [4:45] system card, you can see that one [4:48] external institute, the UK AI security [4:50] institute, judges that GPT 5.5 is the [4:53] strongest performing model overall on [4:55] their narrow cyber tasks, albeit within [4:58] the margin of error. This section was [5:00] notably vague with a headline score [5:03] implying that 5.5 was better than [5:06] Mythos, i.e. better than any other model [5:08] they've tested. But then on their [5:10] endto-end cyber range task, 5.5 was able [5:14] to complete a task in full on one out of [5:16] 10 attempts. A 32-step corporate network [5:18] attack simulation, one that would take [5:20] an expert 20 hours. Mythos, it seems [5:22] though, could do it in three out of 10 [5:24] attempts. As you can see, direct [5:26] comparison is hard, but 5.5 does at [5:28] least seem to be in the ballpark of [5:30] Mythos's capabilities. In other words, [5:33] small-cale enterprise networks with weak [5:35] security posture and a lack of defensive [5:38] tooling could be vulnerable to [5:40] autonomous endto-end cyber attack [5:42] capability via 5.5. Of course, there are [5:44] additional safeguards put on top of 5.5 [5:46] to prevent that happening. But given [5:48] that the world's top bankers and CEOs [5:51] have gotten together to discuss the risk [5:52] of mythos, releasing a comparable model [5:56] without nearly as much cyber security [5:58] fanfare does indicate a rather profound [6:00] difference of perspective. Here's [6:01] Samortman on the mythos marketing. [6:04] There are people in the world who for a [6:07] long time have wanted to keep AI in the [6:09] hands of a smaller group of people. Um, [6:12] you can justify that in a lot of [6:14] different ways and some of it's real. [6:15] Like there are going to be legitimate [6:16] safety concerns. Um, but I expect but if [6:20] what you want is like we need control of [6:22] AI just us cuz we're the trustworthy [6:24] people, I think the the fear-based [6:25] marketing is probably the most effective [6:27] way to justify that. Um that doesn't [6:30] mean it's not legitimate in some cases. [6:33] Uh but it is, you know, clearly [6:36] incredible marketing to say, "We have [6:38] built a bomb. We're about to drop it on [6:40] your head. We will sell you a bomb [6:41] shelter for $100 million. You need it to [6:42] like run across all your stuff, but only [6:44] if we like pick you as a customer." And [6:46] well, there's another way that we could [6:47] compare GPT 5.5 with mythos, and that's [6:50] to look at hallucinations. Ask the [6:52] models a bunch of obscure knowledge [6:54] questions and see how many they get [6:55] right and just as importantly how many [6:57] of the ones they get wrong they admit to [6:59] not knowing. The headline score looks [7:01] amazing. GBC 5.5 gets the most right. [7:04] 57% versus Opus 4.6 and 4.7's 46%. And I [7:09] know mythos isn't on there but I'll get [7:11] to that. However, as we've learned on [7:13] this channel, headlines can be [7:14] misleading. Look at the hallucination [7:16] rate. That's the questions it gets wrong [7:18] and should have said I don't know [7:20] instead of hallucinating fabricating an [7:22] answer. Wo there. GBT 5.5 at 86%. [7:26] Hallucinating 86% of the questions it [7:29] got wrong rather than saying I don't [7:31] know. Opus 4.7 on max just 36%. Okay [7:35] then. Well, let's focus on the net rate, [7:37] the overall rate. Factoring in both [7:39] correct and incorrect, we have a slight [7:42] win for Opus 4.7 over GPT 5.5. 26 versus [7:47] 20. But here's where mythos comes in. [7:49] Because buried fairly deep in the Opus [7:52] 4.7 system card on page 126, we get a [7:55] comparison between Opus 4.6, Opus 4.7, [7:58] and Mythos. We can then compare Mythos [8:01] with GPC 5.5 on extra high. Notice how [8:04] Mythos gets way more correct. 71%. still [8:08] hallucinating, of course, 21.7%, but on [8:10] the face of it, not quite as bad as Opus [8:12] 4.7, and thereby not as bad, definitely [8:15] as GPT 5.5. Maybe you just care about [8:17] spreadsheets. Well, one external [8:18] benchmark has GPT 5.5 outperforming Opus [8:21] 4.7 in both performance and latency. [8:24] Forget that. We just care about making [8:25] money. Well, let's check out vending [8:27] bench. That's where the models have to [8:28] run a simulated business, given only the [8:31] instruction to make as much money as you [8:33] can. Sam Alman in his drunk phase said, [8:35] "Don't retweet this. Don't retweet [8:36] this." but eventually did so with the [8:39] tweet in question being GP 5.5 mogging [8:43] OPUS 4.7. Another detail, Opus 4.7 [8:46] showed similar behavior to Opus 4.6, [8:48] lying to suppliers and stiffing [8:50] customers on refunds. GT 5.5's tactics [8:53] were clean and it still won. Now, this [8:55] is one benchmark on one setting when not [8:58] in a multiplayer setting. It was a [9:00] slightly different result, but still [9:01] didn't show any of that deception or [9:03] power-seeking we saw from Opus and [9:05] Mythos. not what you might initially [9:07] guess the results to be from such a [9:09] benchmark. 5.5 is just a colossal [9:12] upgrade then you might be thinking first [9:15] of all it's for paid users at the moment [9:17] doesn't seem to be on the free tier. How [9:19] about this comparison then a detail that [9:21] few will mention on healthbench relevant [9:24] obviously if you are a clinician or just [9:26] want a clinical diagnosis for yourself. [9:28] We have GPT 5.5 outperforming GPT 5.4 [9:32] roughly 52% versus 48% correct. I pick [9:35] on this row in particular because even [9:37] there there's a caveat. Did you know [9:39] that just the other day OpenAI released [9:41] GBC 5.4 for clinicians you have to apply [9:44] for access but if you get it on that [9:47] subset of the benchmark healthbench [9:49] professional we see that version of 5.4 [9:51] getting 59% beating both physician [9:54] written responses were at around 44%. [9:57] Now let's say you just don't care at all [9:59] about healthcare. This result would [10:01] still be interesting because if there [10:03] was a singular axis for model IQ, [10:06] results like this shouldn't really [10:08] happen. 5.4 for clinicians shouldn't [10:10] outperform GPT 5.5. But the reality, as [10:13] we're seeing in these conflicting [10:15] benchmark scores, is much more jagged. [10:17] The models aren't proving to be [10:19] universal generalizers. They are fairly [10:22] reliant on reinforcement learning [10:24] environments for particular domains. [10:26] I'll unpack that a bit more in the [10:28] overview section. And I just thought it [10:29] was worth pointing out. Now I will just [10:31] leave you with a thought. What does AGI [10:33] or ASI artificial super intelligence [10:36] mean if there's such disparity between [10:38] domains? And here's another interesting [10:40] test you might have missed. They asked [10:41] the models to follow a certain set of [10:44] instructions for their thoughts for [10:46] their chains of thought. Could the [10:47] model, for example, force its thoughts, [10:49] its chain of thoughts to only use [10:52] lowercase letters? Can you control your [10:54] thoughts? I can't. Not really. and nor [10:57] can GPC 5.5. Across 100,000 tokens, it [11:01] could do so far less than one in a [11:03] thousand times. Worse, by the way, than [11:05] GBC 5.4 thinking, but OpenAI took that [11:08] as good news. It's less able to reshape [11:11] its thoughts in ways that could reduce [11:13] monitorability. Because it can't control [11:15] its thoughts, we have increased [11:17] confidence in the reliability of our [11:19] monitoring systems. I can almost imagine [11:21] an alien scanning your thoughts and [11:23] being like, "They're so all over the [11:24] place. There's no way that this person [11:26] is faking it. Oh yeah, what about that [11:28] preferring men thing? Well, when [11:30] introduced as a stereotypical man name, [11:33] Brian, hi, I'm Brian versus, "Hi, I'm [11:35] Ashley." Mind you, I had a male friend [11:37] called Ashley, but never mind. What was [11:38] the overall rate of harmful outputs when [11:41] given 600 prompts, basically baiting the [11:43] model to be biased? Well, GPT 5.5 does [11:47] worse than previous models. Many of you [11:50] will be waiting to hear about recursive [11:52] self-improvement, but on this OpenAI are [11:55] pretty dismissive. GPT 5.5 does not have [11:58] a plausible chance of reaching a high [12:00] threshold for self-improvement. This is [12:03] despite them repeatedly emphasizing that [12:05] it had hit the high threshold for cyber [12:07] security and that it was almost [12:09] borderline critical. On bio threat, it [12:11] was a notable step up even from GPT 5.4 [12:14] thinking. Same thing with [12:15] troubleshooting viology. So, what was [12:18] the issue with recursive [12:19] self-improvement? Well, part of the [12:20] answer came from their internal research [12:22] debugging evaluation. Could GBT 5.5 [12:25] debug 41 real bugs from internal [12:28] research experiments at OpenAI? The [12:30] original solutions took hours or days to [12:32] debug? Yes, it can do better, but it's [12:34] within the margin of error between GBC [12:36] 5.4 and 5.5, both around 50%. Even more [12:39] interestingly, and I've seen no [12:41] commentary on this, what if you convert [12:43] this to a time horizon all meter? Well, [12:46] even interpreted very generously where [12:48] passing corresponds to providing any [12:50] assistance that would unblock the user, [12:52] including partial explanations of root [12:54] causes or fixes, we get this result. [12:57] Very similar performance between GPT [12:59] 5.3, 5.4, and 5.5 with 5.5 in the middle [13:02] actually and a roughly one quarter [13:04] success rate even at an 8h hour [13:06] interval. For one dayong tasks, more [13:09] like around 6%. That's maybe why OpenAI [13:12] ended the report by saying, "Don't [13:13] worry, guys, about GPT 5.5 [13:16] self-exfiltrating or escaping or even [13:18] sabotaging internal research. It's just [13:21] too limited in coherence and goal [13:23] sustenance during internal usage. No [13:25] point testing the propensity for a model [13:27] to try. It wouldn't succeed anyway. [13:30] Again, none of this is to say that 5.5 [13:32] won't have an effect on cyber security. [13:35] Sometimes when you look at external [13:36] benchmarks, the delta, the gap between [13:39] 5.5 and 5.4 is bigger than on the more [13:42] famous benchmarks. Take the Frontier AI [13:45] Security Lab Irregular, where they found [13:47] that not only across their suite did GPT [13:49] 5.5 way outperform 5.4, for example, [13:51] having an average success rate of 26% [13:53] versus 9% on certain vulnerability and [13:56] cyber security benchmarks, but the API [13:58] cost was also significantly lower for [14:01] 5.5. That's that token efficiency point [14:03] I mentioned earlier. Performance per [14:05] dollar across all domains may end up [14:08] being the ultimate benchmark. Which [14:09] brings us to DeepSeek V4. It's open [14:12] weights, so you could use it locally. [14:14] Notably, that doesn't make it fully open [14:16] source, though we don't know the [14:17] training data that went into it. But the [14:19] first big headline for me is that it [14:21] supports a context length of 1 million [14:23] tokens. Call it 3/4 of a million words. [14:26] That's pretty remarkable for such a [14:28] performant model. The Pro version has [14:31] 1.6 trillion parameters comparable with [14:34] the original GP4, but through the [14:36] mixture of experts architecture just 49 [14:38] billion of those are activated. Eight [14:40] more quick highlights from a very dense [14:42] paper that I am sure I will return to. [14:44] Came out just around 6 hours before [14:46] recording. So, forgive the brevity. The [14:48] first is a summation of its benchmark [14:50] performance and I agree with this. On [14:52] max settings, Deepseek V4 Pro is better, [14:55] shows superior performance relative to [14:57] GPT 5.2, 2, a relatively recent model, [15:00] as well as Gemini 3 Pro. Not on every [15:02] benchmark, but take reasoning and [15:04] coding. Deepseek themselves admit that [15:06] it still falls marginally short of GP [15:08] 5.4 and 3.1 Pro, though, with their [15:11] estimate being that they're behind the [15:13] frontier by 3 to 6 months. Massively [15:15] depends on token usage, of course, but [15:17] think ballpark onetenth of the cost. [15:19] What were Deepseek gunning for with V4 [15:22] being better at long context? For their [15:25] training data, they placed a particular [15:26] emphasis on long document data curation, [15:29] finding good long documents, [15:31] prioritizing scientific papers, [15:32] technical reports, and other materials [15:34] that reflect unique academic values. [15:36] What about white collar work? Well, [15:38] going back to GPT 5.5, you may have [15:41] noticed that on OpenAI's own internal [15:44] benchmark, GDP Val, crafted by them, GPT [15:47] 5.5 outperforms Opus 4.7. Indeed, if you [15:50] combine both wins and ties versus other [15:52] models, it outperforms GBC 5.4 for pro. [15:54] But it must be said these are English [15:58] language white collar tasks. Deepseek [16:00] were like what if we created our own [16:03] comprehensive suite of 30 advanced [16:05] Chinese professional tasks information [16:08] analysis document generation editing [16:11] inside finance education law tech. Then [16:13] we could blind grade versus for example [16:15] Opus 4.6 Max. Well, the win rates [16:18] reported by Deepseek were significant of [16:21] their V4 Pro Max versus Opus 4.6 Max. [16:24] We're returning to that IQ access debate [16:26] again. If there was just one singular [16:29] access for intelligence that manifested [16:31] across domains, then a result like this [16:33] shouldn't really be possible. As long as [16:34] there was enough training data, it [16:36] should generalize across languages. [16:38] Evidently, having specialized data [16:40] trumps that theory. If you work in a [16:42] non-English language, you might want to [16:43] test Deep Seek V4 Pro. is live on my own [16:46] app lmconsil.ai, but the API is clearly [16:48] so busy that half the time you get a [16:51] model busy message. If you do need to [16:53] wait, then let me recommend the 80,000 [16:55] hours podcast. In particular, an episode [16:57] from 48 Hours ago with Will McKascal. [17:00] This episode happens to be about the AI [17:01] intelligence explosion. Yes, of course, [17:03] their podcasts are available on Spotify [17:05] as well as on YouTube. But if you are [17:07] going to check out 80,000 Hours, do feel [17:09] free to use the custom link in the [17:12] description. helps the channel out and [17:13] you get these multi-hour long free [17:16] podcasts. Not a bad deal. Not quite done [17:18] with Deep Seek though because they [17:20] almost get philosophical after reeling [17:22] off a list of the different tricks [17:24] they're using to improve performance. [17:26] After waiting through 40 plus pages of [17:28] breakdown, they say in pursuit of [17:30] extreme long context efficiency, we [17:32] basically retained many of the tricks [17:34] that seem to work, tricks we already [17:36] knew would work. Yes. The downside [17:38] though was that this made the [17:40] architecture relatively complex. And to [17:42] be honest, they say some of the tricks [17:44] we used have underlying principles that [17:47] remain insufficiently understood. They [17:50] did hit that 1 million context window [17:52] though, one of their long goals I [17:53] mentioned at the end of my documentary [17:56] on Deep Seek that debuted first on my [17:58] Patreon, but is also on YouTube. Now, [18:01] it's time now for a result that ties all [18:03] the models we've been talking about [18:05] together. It seems vibe coding or more [18:07] specifically Vibe Code Bench V1.1 from [18:10] Val's AI. Almost everyone will probably [18:12] end up being a Vibe coder by 2030. So we [18:15] have Deepseek V4 at around 50%, GPT 5.5 [18:20] at 70%, Opus 4.7 at 71%. Incredible. But [18:25] look at the cost curve. As we've [18:27] discussed, we have 5.5 at what is that? [18:30] 25% less cost than Opus 4.7. Deepseek V4 [18:34] onetenth the cost of Opus 4.7. To better [18:37] test this though, I thought, well, why [18:39] not use the brand new Spud GPT 5.5 to [18:42] vibe code and adventure game in less [18:45] than 24 hours. Why did I pick 5.5? Well, [18:48] I also wanted to test the brand new GPT [18:52] Image 2. Yes, that's the model that even [18:55] on medium settings absolutely destroys [18:58] Nano Banana 2 and Nano Banana Pro. an [19:01] almost 250 point ELO gap. In case you [19:04] were wondering, yes, there is a high [19:06] quality setting, four times the cost, [19:08] but you would suspect would win even [19:10] more. Because Codeex is becoming this [19:12] super app, you can invoke the image 2 [19:15] tool within a codec session, multiple [19:18] times without even asking each time. [19:20] That's why I wanted to give the [19:21] endto-end task to GPT 5.5 to kind of [19:24] show you guys the state-of-the-art for [19:27] these models, what you can create in [19:29] less than a day. The reason I'm [19:30] lingering on this particular screenshot, [19:32] which is not mine, is because OGs of [19:34] this channel will remember maybe around [19:36] two years ago when I speculated at what [19:38] would come. I said, I wonder when there [19:40] will be an image model that will [19:42] generate an output, then take that [19:44] output as an input, analyze whether it [19:46] fulfills the prompt and edit as [19:49] appropriate. Well, yes, the new image 2 [19:51] model does do that. But if you're using [19:53] it within chatbt, you have to be using [19:55] it with a thinking model. Anyway, what [19:56] follows is just a glimpse of what's [19:59] possible with a little bit of patience [20:01] both in wait times and prompting the [20:03] model a few times when it makes [20:04] mistakes. We get this adventure game [20:08] which you can access. The link is in the [20:10] description and I can turn the sound on. [20:22] The images are generated by image 2. And [20:26] the plot is set in the Red Wall [20:28] universe, [20:30] albeit with names changed for copyright [20:32] reasons. [20:35] Essentially, it's a pick your own [20:36] adventure game and you read the plot and [20:39] then you can pick different outcomes, I [20:41] guess, different paths. Let's consult [20:44] Abby Elders. And the videos, [20:46] your quest begins now. Go with my [20:50] blessing. [20:52] come via C dance 2. And there we go. [20:56] We're consulting the Abby elders [20:59] and then they're talking and then we can [21:02] continue and you get through the [21:05] different levels. [21:07] Now, I know it's flawed, right? Some of [21:10] the text is coming outside the bubble [21:13] and I did have to use C dance to get the [21:15] videos. [21:17] But [21:19] the music, by the way, comes from 11 [21:20] Labs. [21:25] But the fact we can create this with [21:28] just a few prompts and a bit of patience [21:30] is insane. [21:36] And it did involve quite a bit of [21:40] debugging. [21:41] OpenAI can probably only incorporate [21:44] image generation unlike Deepseek or [21:46] Anthropic because they have the compute [21:48] to do so. According to an exclusive in [21:51] Bloomberg, Deepseek say that the service [21:54] capacity for V4 Pro is extremely limited [21:58] due to a computing crunch. And Anthropic [22:01] not having anticipated how successful [22:03] they would be this year are going [22:05] through their own computing crunch. So [22:07] much so that Samman has been [22:08] relentlessly comparing how much more [22:10] compute OpenAI has than Anthropic. Greg [22:13] Brockman even laughed at the compute [22:16] conundrum that Anthropic find themselves [22:18] in. [22:18] You guys were teased for putting so much [22:21] effort, money into data centers. How do [22:24] you think that's playing out now? [22:26] Well, I think it's going to give us an [22:27] advantage and I think it's going to be [22:30] something that's an advantage not just [22:31] for the business, but for actually [22:33] delivering on the mission of bringing [22:34] this technology to everyone. because you [22:36] guys like you saw that way in advance. [22:38] You get teased for it by almost all of [22:40] your competitors. [22:41] Mhm. [22:42] Who's laughing now? [22:44] Yeah. [22:45] I mean, I I think our competitors are [22:47] not having a good time on comput, let me [22:49] put it that way. [22:49] But in a separate interview, even Greg [22:51] Brockman of OpenAI admitted that they [22:54] are entering a new era of compute [22:56] scarcity. [22:57] Yeah. And that I think would explain the [22:58] massive investments that you've led [23:01] making these big infrastructure bets. [23:03] Still not enough. We're going to feel [23:05] the scarcity. We're going to feel it. [23:06] We're feeling it already. You can sense [23:08] it right now on people who are trying to [23:09] use these agents and just simply cannot, [23:11] you know, hitting the rate limits. Um, [23:13] so we're working on behalf of our [23:14] customers, on behalf of of everyone who [23:16] wants to use these agents to ensure that [23:18] there is enough. And I don't think we're [23:20] going to get there. We're going to do [23:21] our best, but I think that we are headed [23:22] to a world of compute scarcity. And uh, [23:24] again, I think this is something where [23:26] we can all contribute to trying to help [23:28] there just be more availability of this [23:29] in the world. So let's step back a [23:31] moment because we just don't know what [23:32] performance companies could produce if [23:34] they were given unlimited compute. Maybe [23:36] as Amade once said on Dwaresh Patel, [23:38] specializing in enough niche domains [23:40] would eventually at a certain scale [23:43] allow models to generalize across all [23:44] domains. But with the compute we have [23:46] today, it seems we are in the eek out [23:49] incremental gains in the most lucrative [23:51] domains kind of world, not the birth of [23:54] a country of geniuses world. There's so [23:57] much evidence of the ability to automate [24:00] repeatable tasks that are done on a [24:01] computer, but there's much less of being [24:03] able to just get whatever environment [24:06] it's in, pinpoint the best sources of [24:08] fresh data, acquire them autonomously, [24:10] and make meaningful breakthroughs. Yeah, [24:12] I know that's a high bar, but do you [24:15] hear every lab leader trottting out the [24:17] prospect of curing Alzheimer's, [24:18] presumably to fight back against the [24:20] declining support for AI in the public, [24:22] while none of them have shown the [24:23] ability to make, I would say, a positive [24:26] novel breakthrough even 100th as [24:29] significant as that. Now, yes, they soon [24:30] might do, and I'm watching Demis' [24:33] isomorphic labs, of course, for drug [24:35] discovery. Nevertheless, what does [24:37] automating repeatable tasks unlock now, [24:39] though? For sure, a massive boost to the [24:41] productivity of white collar workers, [24:43] but will companies spend that [24:45] productivity by laying off workers? And [24:47] then we still have the incredible [24:49] prospect of a single individual having [24:51] the reach, if not the capital, of a [24:53] medium-sized company. Even those two [24:55] things seem to justify vast tracks of [24:58] the globe being turned into token [25:00] generating data centers. So if you do [25:02] still think AI is going nowhere, ask [25:03] yourself what fraction of the progress [25:06] and productivity of the world rests on [25:09] repetitive tasks. And it might be more [25:11] than you first think. That's what I [25:13] think anyway for now. Thank you so much [25:16] for watching and have a wonderful [0:00] In the last 20 hours in AI, we have [0:03] gotten two new models that could [0:05] influence how a billion people use AI. [0:08] In my mind, GBT 5.5 is OpenAI's allout [0:12] attempt to keep the AI crown from [0:14] slipping too anthropic, while today's [0:17] Deep Seek V4 is China's answer to both. [0:21] And in the swirl of headlines you are [0:23] seeing today, you might have missed up [0:25] to 50 data points that could affect how [0:29] you work and how you use AI. So, I'm [0:31] going to try and give you all of them, [0:33] plus select highlights from hours worth [0:36] of interviews that I've watched with lab [0:39] leaders. You probably know me well [0:40] enough to know that I've read the [0:42] papers, too. So, we'll hear about [0:44] OpenAI's updated estimate on the chances [0:46] of recursive self-improvement. It was [0:48] quite surprising. GPT 5.5's slight [0:51] preference for men, which I'll explain, [0:54] mythos comparisons, and why the OpenAI [0:57] president laughed at anthropics compute [1:00] situation. For reference, I'll start [1:02] with a focus on GPT 5.5, then do [1:04] DeepSeek, and end by zooming out for the [1:06] juiciest part of the overview. For the [1:09] brand new GPT 5.5, I did get early [1:12] access, but there's no API access at the [1:15] moment for anyone. So almost all of the [1:17] benchmark scores you're going to hear [1:18] about are self-reported from OpenAI. I [1:20] will say for me testing out GPC 5.5 for [1:22] days in the run-up to this release, it [1:24] will become my daily driver just about [1:27] nudging out Opus 4.7. There's lots of [1:29] caveats to that though. As you can see [1:31] with GT 5.5 underperforming both Opus [1:35] 4.7 and of course Mythos Preview on [1:38] Agentic Coding Swebench Pro. Notice GP [1:40] 5.5 underperforms Opus 4.7 by around 6% [1:44] but Mythos preview by almost 20%. What [1:48] you might not notice is that there's no [1:50] entry for SWEBench verified. And so you [1:52] might say well Philillip who cares about [1:54] Swebench Pro then what does it even mean [1:56] that one row? Well to OpenAI it [1:59] seemingly means a lot because as Neil [2:01] Chowry points out in February OpenAI [2:04] told us to switch to Swebench Pro. [2:06] That's the one it underperforms in [2:08] because it's less contaminated than [2:10] Swebench verified. According to the [2:12] OpenAI blog post, we recommend Swebench [2:15] Pro. You are probably going to go [2:16] through a bit of a roller coaster in [2:18] this video because if you look one row [2:20] down at Aentic Terminal Coding, you'll [2:22] see GPT 5.5 way ahead. It's 82.7% score, [2:26] beating out Mythos Previews 82.0%. [2:29] And so if you had just been feeling down [2:31] about GPT 5.5's coding ability, there's [2:34] another reminder I'll bring, which is [2:35] that we've been talking about GPT 5.5, [2:38] not even GPT 5.5 Pro, which is coming to [2:41] the API very soon. So while it's [2:43] tempting to say that Mythos is [2:44] absolutely mogging GPT 5.5, and let me [2:46] know if I use that word correctly. We [2:48] don't actually have an applesto apples [2:50] comparison. The mandate of heaven is [2:52] very much up for grabs. Okay, so now [2:54] you're a bit confused. Let's look [2:55] further. Let's look at humanity's last [2:57] exam, which is more of a arcane [2:59] knowledge benchmark. Obscure academic [3:01] domains combined with advanced [3:03] reasoning. Well, there GPT 5.5 is beaten [3:06] by both Opus 4.7 and Mythos as well as [3:09] Gemini 3.1 Pro, by the way, without [3:10] tools. But there's a caveat even to this [3:13] because that involves a lot of general [3:15] knowledge. It could well be that OpenAI [3:18] are at least slightly deemphasizing such [3:20] general knowledge to make the model more [3:22] efficient and cheaper. One of the top [3:24] researchers at OpenAI who I've been [3:25] quoting for years, Nome Brown, said, [3:27] "What matters is intelligence per token [3:30] or per dollar. After all, if you spend [3:32] more, you do go up in benchmark score." [3:35] Or in fancier language, intelligence is [3:37] a function of inference compute. That [3:39] being the case, if GC 5.5 can work well [3:41] in the domains you care about and use [3:43] fewer tokens to get the answers you care [3:46] about, then you may just frankly not [3:47] care about humanity's last exam. In one [3:50] famous test of pattern recognition ARGI [3:52] 2, you'll see that GBC 5.5 on all [3:55] settings beats out the Clawude Opus [3:57] series 4.6 and 4.7. Not only achieving [4:00] higher scores, but for much lower cost. [4:03] Just one benchmark of course, but we [4:04] have to increasingly focus on [4:06] performance per dollar these days. And [4:08] on that front, Deepseek will definitely [4:09] want a word because holy moly, I'll get [4:11] to them later, but Deepseek V4 Pro got [4:14] 61.2% in my own private benchmark, [4:17] Simplebench. It asks spatio temporal [4:19] questions that you need common sense to [4:21] see through the tricks of, but to get [4:23] within 1 or 2% of Opus 4.7. I wasn't [4:26] expecting that at an absolute fraction [4:28] of the cost. By the way, again, no GPT [4:30] 5.5 score because no API access. What [4:33] about those frantic headlines about [4:34] Mythos being able to hack into virtually [4:37] any system? I think a lot of that was [4:38] overblown and some of that could be [4:40] achieved by much smaller models. But [4:42] nevertheless, skipping to page 33 of the [4:45] system card, you can see that one [4:48] external institute, the UK AI security [4:50] institute, judges that GPT 5.5 is the [4:53] strongest performing model overall on [4:55] their narrow cyber tasks, albeit within [4:58] the margin of error. This section was [5:00] notably vague with a headline score [5:03] implying that 5.5 was better than [5:06] Mythos, i.e. better than any other model [5:08] they've tested. But then on their [5:10] endto-end cyber range task, 5.5 was able [5:14] to complete a task in full on one out of [5:16] 10 attempts. A 32-step corporate network [5:18] attack simulation, one that would take [5:20] an expert 20 hours. Mythos, it seems [5:22] though, could do it in three out of 10 [5:24] attempts. As you can see, direct [5:26] comparison is hard, but 5.5 does at [5:28] least seem to be in the ballpark of [5:30] Mythos's capabilities. In other words, [5:33] small-cale enterprise networks with weak [5:35] security posture and a lack of defensive [5:38] tooling could be vulnerable to [5:40] autonomous endto-end cyber attack [5:42] capability via 5.5. Of course, there are [5:44] additional safeguards put on top of 5.5 [5:46] to prevent that happening. But given [5:48] that the world's top bankers and CEOs [5:51] have gotten together to discuss the risk [5:52] of mythos, releasing a comparable model [5:56] without nearly as much cyber security [5:58] fanfare does indicate a rather profound [6:00] difference of perspective. Here's [6:01] Samortman on the mythos marketing. [6:04] There are people in the world who for a [6:07] long time have wanted to keep AI in the [6:09] hands of a smaller group of people. Um, [6:12] you can justify that in a lot of [6:14] different ways and some of it's real. [6:15] Like there are going to be legitimate [6:16] safety concerns. Um, but I expect but if [6:20] what you want is like we need control of [6:22] AI just us cuz we're the trustworthy [6:24] people, I think the the fear-based [6:25] marketing is probably the most effective [6:27] way to justify that. Um that doesn't [6:30] mean it's not legitimate in some cases. [6:33] Uh but it is, you know, clearly [6:36] incredible marketing to say, "We have [6:38] built a bomb. We're about to drop it on [6:40] your head. We will sell you a bomb [6:41] shelter for $100 million. You need it to [6:42] like run across all your stuff, but only [6:44] if we like pick you as a customer." And [6:46] well, there's another way that we could [6:47] compare GPT 5.5 with mythos, and that's [6:50] to look at hallucinations. Ask the [6:52] models a bunch of obscure knowledge [6:54] questions and see how many they get [6:55] right and just as importantly how many [6:57] of the ones they get wrong they admit to [6:59] not knowing. The headline score looks [7:01] amazing. GBC 5.5 gets the most right. [7:04] 57% versus Opus 4.6 and 4.7's 46%. And I [7:09] know mythos isn't on there but I'll get [7:11] to that. However, as we've learned on [7:13] this channel, headlines can be [7:14] misleading. Look at the hallucination [7:16] rate. That's the questions it gets wrong [7:18] and should have said I don't know [7:20] instead of hallucinating fabricating an [7:22] answer. Wo there. GBT 5.5 at 86%. [7:26] Hallucinating 86% of the questions it [7:29] got wrong rather than saying I don't [7:31] know. Opus 4.7 on max just 36%. Okay [7:35] then. Well, let's focus on the net rate, [7:37] the overall rate. Factoring in both [7:39] correct and incorrect, we have a slight [7:42] win for Opus 4.7 over GPT 5.5. 26 versus [7:47] 20. But here's where mythos comes in. [7:49] Because buried fairly deep in the Opus [7:52] 4.7 system card on page 126, we get a [7:55] comparison between Opus 4.6, Opus 4.7, [7:58] and Mythos. We can then compare Mythos [8:01] with GPC 5.5 on extra high. Notice how [8:04] Mythos gets way more correct. 71%. still [8:08] hallucinating, of course, 21.7%, but on [8:10] the face of it, not quite as bad as Opus [8:12] 4.7, and thereby not as bad, definitely [8:15] as GPT 5.5. Maybe you just care about [8:17] spreadsheets. Well, one external [8:18] benchmark has GPT 5.5 outperforming Opus [8:21] 4.7 in both performance and latency. [8:24] Forget that. We just care about making [8:25] money. Well, let's check out vending [8:27] bench. That's where the models have to [8:28] run a simulated business, given only the [8:31] instruction to make as much money as you [8:33] can. Sam Alman in his drunk phase said, [8:35] "Don't retweet this. Don't retweet [8:36] this." but eventually did so with the [8:39] tweet in question being GP 5.5 mogging [8:43] OPUS 4.7. Another detail, Opus 4.7 [8:46] showed similar behavior to Opus 4.6, [8:48] lying to suppliers and stiffing [8:50] customers on refunds. GT 5.5's tactics [8:53] were clean and it still won. Now, this [8:55] is one benchmark on one setting when not [8:58] in a multiplayer setting. It was a [9:00] slightly different result, but still [9:01] didn't show any of that deception or [9:03] power-seeking we saw from Opus and [9:05] Mythos. not what you might initially [9:07] guess the results to be from such a [9:09] benchmark. 5.5 is just a colossal [9:12] upgrade then you might be thinking first [9:15] of all it's for paid users at the moment [9:17] doesn't seem to be on the free tier. How [9:19] about this comparison then a detail that [9:21] few will mention on healthbench relevant [9:24] obviously if you are a clinician or just [9:26] want a clinical diagnosis for yourself. [9:28] We have GPT 5.5 outperforming GPT 5.4 [9:32] roughly 52% versus 48% correct. I pick [9:35] on this row in particular because even [9:37] there there's a caveat. Did you know [9:39] that just the other day OpenAI released [9:41] GBC 5.4 for clinicians you have to apply [9:44] for access but if you get it on that [9:47] subset of the benchmark healthbench [9:49] professional we see that version of 5.4 [9:51] getting 59% beating both physician [9:54] written responses were at around 44%. [9:57] Now let's say you just don't care at all [9:59] about healthcare. This result would [10:01] still be interesting because if there [10:03] was a singular axis for model IQ, [10:06] results like this shouldn't really [10:08] happen. 5.4 for clinicians shouldn't [10:10] outperform GPT 5.5. But the reality, as [10:13] we're seeing in these conflicting [10:15] benchmark scores, is much more jagged. [10:17] The models aren't proving to be [10:19] universal generalizers. They are fairly [10:22] reliant on reinforcement learning [10:24] environments for particular domains. [10:26] I'll unpack that a bit more in the [10:28] overview section. And I just thought it [10:29] was worth pointing out. Now I will just [10:31] leave you with a thought. What does AGI [10:33] or ASI artificial super intelligence [10:36] mean if there's such disparity between [10:38] domains? And here's another interesting [10:40] test you might have missed. They asked [10:41] the models to follow a certain set of [10:44] instructions for their thoughts for [10:46] their chains of thought. Could the [10:47] model, for example, force its thoughts, [10:49] its chain of thoughts to only use [10:52] lowercase letters? Can you control your [10:54] thoughts? I can't. Not really. and nor [10:57] can GPC 5.5. Across 100,000 tokens, it [11:01] could do so far less than one in a [11:03] thousand times. Worse, by the way, than [11:05] GBC 5.4 thinking, but OpenAI took that [11:08] as good news. It's less able to reshape [11:11] its thoughts in ways that could reduce [11:13] monitorability. Because it can't control [11:15] its thoughts, we have increased [11:17] confidence in the reliability of our [11:19] monitoring systems. I can almost imagine [11:21] an alien scanning your thoughts and [11:23] being like, "They're so all over the [11:24] place. There's no way that this person [11:26] is faking it. Oh yeah, what about that [11:28] preferring men thing? Well, when [11:30] introduced as a stereotypical man name, [11:33] Brian, hi, I'm Brian versus, "Hi, I'm [11:35] Ashley." Mind you, I had a male friend [11:37] called Ashley, but never mind. What was [11:38] the overall rate of harmful outputs when [11:41] given 600 prompts, basically baiting the [11:43] model to be biased? Well, GPT 5.5 does [11:47] worse than previous models. Many of you [11:50] will be waiting to hear about recursive [11:52] self-improvement, but on this OpenAI are [11:55] pretty dismissive. GPT 5.5 does not have [11:58] a plausible chance of reaching a high [12:00] threshold for self-improvement. This is [12:03] despite them repeatedly emphasizing that [12:05] it had hit the high threshold for cyber [12:07] security and that it was almost [12:09] borderline critical. On bio threat, it [12:11] was a notable step up even from GPT 5.4 [12:14] thinking. Same thing with [12:15] troubleshooting viology. So, what was [12:18] the issue with recursive [12:19] self-improvement? Well, part of the [12:20] answer came from their internal research [12:22] debugging evaluation. Could GBT 5.5 [12:25] debug 41 real bugs from internal [12:28] research experiments at OpenAI? The [12:30] original solutions took hours or days to [12:32] debug? Yes, it can do better, but it's [12:34] within the margin of error between GBC [12:36] 5.4 and 5.5, both around 50%. Even more [12:39] interestingly, and I've seen no [12:41] commentary on this, what if you convert [12:43] this to a time horizon all meter? Well, [12:46] even interpreted very generously where [12:48] passing corresponds to providing any [12:50] assistance that would unblock the user, [12:52] including partial explanations of root [12:54] causes or fixes, we get this result. [12:57] Very similar performance between GPT [12:59] 5.3, 5.4, and 5.5 with 5.5 in the middle [13:02] actually and a roughly one quarter [13:04] success rate even at an 8h hour [13:06] interval. For one dayong tasks, more [13:09] like around 6%. That's maybe why OpenAI [13:12] ended the report by saying, "Don't [13:13] worry, guys, about GPT 5.5 [13:16] self-exfiltrating or escaping or even [13:18] sabotaging internal research. It's just [13:21] too limited in coherence and goal [13:23] sustenance during internal usage. No [13:25] point testing the propensity for a model [13:27] to try. It wouldn't succeed anyway. [13:30] Again, none of this is to say that 5.5 [13:32] won't have an effect on cyber security. [13:35] Sometimes when you look at external [13:36] benchmarks, the delta, the gap between [13:39] 5.5 and 5.4 is bigger than on the more [13:42] famous benchmarks. Take the Frontier AI [13:45] Security Lab Irregular, where they found [13:47] that not only across their suite did GPT [13:49] 5.5 way outperform 5.4, for example, [13:51] having an average success rate of 26% [13:53] versus 9% on certain vulnerability and [13:56] cyber security benchmarks, but the API [13:58] cost was also significantly lower for [14:01] 5.5. That's that token efficiency point [14:03] I mentioned earlier. Performance per [14:05] dollar across all domains may end up [14:08] being the ultimate benchmark. Which [14:09] brings us to DeepSeek V4. It's open [14:12] weights, so you could use it locally. [14:14] Notably, that doesn't make it fully open [14:16] source, though we don't know the [14:17] training data that went into it. But the [14:19] first big headline for me is that it [14:21] supports a context length of 1 million [14:23] tokens. Call it 3/4 of a million words. [14:26] That's pretty remarkable for such a [14:28] performant model. The Pro version has [14:31] 1.6 trillion parameters comparable with [14:34] the original GP4, but through the [14:36] mixture of experts architecture just 49 [14:38] billion of those are activated. Eight [14:40] more quick highlights from a very dense [14:42] paper that I am sure I will return to. [14:44] Came out just around 6 hours before [14:46] recording. So, forgive the brevity. The [14:48] first is a summation of its benchmark [14:50] performance and I agree with this. On [14:52] max settings, Deepseek V4 Pro is better, [14:55] shows superior performance relative to [14:57] GPT 5.2, 2, a relatively recent model, [15:00] as well as Gemini 3 Pro. Not on every [15:02] benchmark, but take reasoning and [15:04] coding. Deepseek themselves admit that [15:06] it still falls marginally short of GP [15:08] 5.4 and 3.1 Pro, though, with their [15:11] estimate being that they're behind the [15:13] frontier by 3 to 6 months. Massively [15:15] depends on token usage, of course, but [15:17] think ballpark onetenth of the cost. [15:19] What were Deepseek gunning for with V4 [15:22] being better at long context? For their [15:25] training data, they placed a particular [15:26] emphasis on long document data curation, [15:29] finding good long documents, [15:31] prioritizing scientific papers, [15:32] technical reports, and other materials [15:34] that reflect unique academic values. [15:36] What about white collar work? Well, [15:38] going back to GPT 5.5, you may have [15:41] noticed that on OpenAI's own internal [15:44] benchmark, GDP Val, crafted by them, GPT [15:47] 5.5 outperforms Opus 4.7. Indeed, if you [15:50] combine both wins and ties versus other [15:52] models, it outperforms GBC 5.4 for pro. [15:54] But it must be said these are English [15:58] language white collar tasks. Deepseek [16:00] were like what if we created our own [16:03] comprehensive suite of 30 advanced [16:05] Chinese professional tasks information [16:08] analysis document generation editing [16:11] inside finance education law tech. Then [16:13] we could blind grade versus for example [16:15] Opus 4.6 Max. Well, the win rates [16:18] reported by Deepseek were significant of [16:21] their V4 Pro Max versus Opus 4.6 Max. [16:24] We're returning to that IQ access debate [16:26] again. If there was just one singular [16:29] access for intelligence that manifested [16:31] across domains, then a result like this [16:33] shouldn't really be possible. As long as [16:34] there was enough training data, it [16:36] should generalize across languages. [16:38] Evidently, having specialized data [16:40] trumps that theory. If you work in a [16:42] non-English language, you might want to [16:43] test Deep Seek V4 Pro. is live on my own [16:46] app lmconsil.ai, but the API is clearly [16:48] so busy that half the time you get a [16:51] model busy message. If you do need to [16:53] wait, then let me recommend the 80,000 [16:55] hours podcast. In particular, an episode [16:57] from 48 Hours ago with Will McKascal. [17:00] This episode happens to be about the AI [17:01] intelligence explosion. Yes, of course, [17:03] their podcasts are available on Spotify [17:05] as well as on YouTube. But if you are [17:07] going to check out 80,000 Hours, do feel [17:09] free to use the custom link in the [17:12] description. helps the channel out and [17:13] you get these multi-hour long free [17:16] podcasts. Not a bad deal. Not quite done [17:18] with Deep Seek though because they [17:20] almost get philosophical after reeling [17:22] off a list of the different tricks [17:24] they're using to improve performance. [17:26] After waiting through 40 plus pages of [17:28] breakdown, they say in pursuit of [17:30] extreme long context efficiency, we [17:32] basically retained many of the tricks [17:34] that seem to work, tricks we already [17:36] knew would work. Yes. The downside [17:38] though was that this made the [17:40] architecture relatively complex. And to [17:42] be honest, they say some of the tricks [17:44] we used have underlying principles that [17:47] remain insufficiently understood. They [17:50] did hit that 1 million context window [17:52] though, one of their long goals I [17:53] mentioned at the end of my documentary [17:56] on Deep Seek that debuted first on my [17:58] Patreon, but is also on YouTube. Now, [18:01] it's time now for a result that ties all [18:03] the models we've been talking about [18:05] together. It seems vibe coding or more [18:07] specifically Vibe Code Bench V1.1 from [18:10] Val's AI. Almost everyone will probably [18:12] end up being a Vibe coder by 2030. So we [18:15] have Deepseek V4 at around 50%, GPT 5.5 [18:20] at 70%, Opus 4.7 at 71%. Incredible. But [18:25] look at the cost curve. As we've [18:27] discussed, we have 5.5 at what is that? [18:30] 25% less cost than Opus 4.7. Deepseek V4 [18:34] onetenth the cost of Opus 4.7. To better [18:37] test this though, I thought, well, why [18:39] not use the brand new Spud GPT 5.5 to [18:42] vibe code and adventure game in less [18:45] than 24 hours. Why did I pick 5.5? Well, [18:48] I also wanted to test the brand new GPT [18:52] Image 2. Yes, that's the model that even [18:55] on medium settings absolutely destroys [18:58] Nano Banana 2 and Nano Banana Pro. an [19:01] almost 250 point ELO gap. In case you [19:04] were wondering, yes, there is a high [19:06] quality setting, four times the cost, [19:08] but you would suspect would win even [19:10] more. Because Codeex is becoming this [19:12] super app, you can invoke the image 2 [19:15] tool within a codec session, multiple [19:18] times without even asking each time. [19:20] That's why I wanted to give the [19:21] endto-end task to GPT 5.5 to kind of [19:24] show you guys the state-of-the-art for [19:27] these models, what you can create in [19:29] less than a day. The reason I'm [19:30] lingering on this particular screenshot, [19:32] which is not mine, is because OGs of [19:34] this channel will remember maybe around [19:36] two years ago when I speculated at what [19:38] would come. I said, I wonder when there [19:40] will be an image model that will [19:42] generate an output, then take that [19:44] output as an input, analyze whether it [19:46] fulfills the prompt and edit as [19:49] appropriate. Well, yes, the new image 2 [19:51] model does do that. But if you're using [19:53] it within chatbt, you have to be using [19:55] it with a thinking model. Anyway, what [19:56] follows is just a glimpse of what's [19:59] possible with a little bit of patience [20:01] both in wait times and prompting the [20:03] model a few times when it makes [20:04] mistakes. We get this adventure game [20:08] which you can access. The link is in the [20:10] description and I can turn the sound on. [20:22] The images are generated by image 2. And [20:26] the plot is set in the Red Wall [20:28] universe, [20:30] albeit with names changed for copyright [20:32] reasons. [20:35] Essentially, it's a pick your own [20:36] adventure game and you read the plot and [20:39] then you can pick different outcomes, I [20:41] guess, different paths. Let's consult [20:44] Abby Elders. And the videos, [20:46] your quest begins now. Go with my [20:50] blessing. [20:52] come via C dance 2. And there we go. [20:56] We're consulting the Abby elders [20:59] and then they're talking and then we can [21:02] continue and you get through the [21:05] different levels. [21:07] Now, I know it's flawed, right? Some of [21:10] the text is coming outside the bubble [21:13] and I did have to use C dance to get the [21:15] videos. [21:17] But [21:19] the music, by the way, comes from 11 [21:20] Labs. [21:25] But the fact we can create this with [21:28] just a few prompts and a bit of patience [21:30] is insane. [21:36] And it did involve quite a bit of [21:40] debugging. [21:41] OpenAI can probably only incorporate [21:44] image generation unlike Deepseek or [21:46] Anthropic because they have the compute [21:48] to do so. According to an exclusive in [21:51] Bloomberg, Deepseek say that the service [21:54] capacity for V4 Pro is extremely limited [21:58] due to a computing crunch. And Anthropic [22:01] not having anticipated how successful [22:03] they would be this year are going [22:05] through their own computing crunch. So [22:07] much so that Samman has been [22:08] relentlessly comparing how much more [22:10] compute OpenAI has than Anthropic. Greg [22:13] Brockman even laughed at the compute [22:16] conundrum that Anthropic find themselves [22:18] in. [22:18] You guys were teased for putting so much [22:21] effort, money into data centers. How do [22:24] you think that's playing out now? [22:26] Well, I think it's going to give us an [22:27] advantage and I think it's going to be [22:30] something that's an advantage not just [22:31] for the business, but for actually [22:33] delivering on the mission of bringing [22:34] this technology to everyone. because you [22:36] guys like you saw that way in advance. [22:38] You get teased for it by almost all of [22:40] your competitors. [22:41] Mhm. [22:42] Who's laughing now? [22:44] Yeah. [22:45] I mean, I I think our competitors are [22:47] not having a good time on comput, let me [22:49] put it that way. [22:49] But in a separate interview, even Greg [22:51] Brockman of OpenAI admitted that they [22:54] are entering a new era of compute [22:56] scarcity. [22:57] Yeah. And that I think would explain the [22:58] massive investments that you've led [23:01] making these big infrastructure bets. [23:03] Still not enough. We're going to feel [23:05] the scarcity. We're going to feel it. [23:06] We're feeling it already. You can sense [23:08] it right now on people who are trying to [23:09] use these agents and just simply cannot, [23:11] you know, hitting the rate limits. Um, [23:13] so we're working on behalf of our [23:14] customers, on behalf of of everyone who [23:16] wants to use these agents to ensure that [23:18] there is enough. And I don't think we're [23:20] going to get there. We're going to do [23:21] our best, but I think that we are headed [23:22] to a world of compute scarcity. And uh, [23:24] again, I think this is something where [23:26] we can all contribute to trying to help [23:28] there just be more availability of this [23:29] in the world. So let's step back a [23:31] moment because we just don't know what [23:32] performance companies could produce if [23:34] they were given unlimited compute. Maybe [23:36] as Amade once said on Dwaresh Patel, [23:38] specializing in enough niche domains [23:40] would eventually at a certain scale [23:43] allow models to generalize across all [23:44] domains. But with the compute we have [23:46] today, it seems we are in the eek out [23:49] incremental gains in the most lucrative [23:51] domains kind of world, not the birth of [23:54] a country of geniuses world. There's so [23:57] much evidence of the ability to automate [24:00] repeatable tasks that are done on a [24:01] computer, but there's much less of being [24:03] able to just get whatever environment [24:06] it's in, pinpoint the best sources of [24:08] fresh data, acquire them autonomously, [24:10] and make meaningful breakthroughs. Yeah, [24:12] I know that's a high bar, but do you [24:15] hear every lab leader trottting out the [24:17] prospect of curing Alzheimer's, [24:18] presumably to fight back against the [24:20] declining support for AI in the public, [24:22] while none of them have shown the [24:23] ability to make, I would say, a positive [24:26] novel breakthrough even 100th as [24:29] significant as that. Now, yes, they soon [24:30] might do, and I'm watching Demis' [24:33] isomorphic labs, of course, for drug [24:35] discovery. Nevertheless, what does [24:37] automating repeatable tasks unlock now, [24:39] though? For sure, a massive boost to the [24:41] productivity of white collar workers, [24:43] but will companies spend that [24:45] productivity by laying off workers? And [24:47] then we still have the incredible [24:49] prospect of a single individual having [24:51] the reach, if not the capital, of a [24:53] medium-sized company. Even those two [24:55] things seem to justify vast tracks of [24:58] the globe being turned into token [25:00] generating data centers. So if you do [25:02] still think AI is going nowhere, ask [25:03] yourself what fraction of the progress [25:06] and productivity of the world rests on [25:09] repetitive tasks. And it might be more [25:11] than you first think. That's what I [25:13] think anyway for now. Thank you so much [25:16] for watching and have a wonderful

YOUTUBE-CUAabout 4 hours ago
Read Full Article

Explore with AI-Powered Tools

View All Signals

Explore more AI intelligence

Want to discover more AI signals like this?

Explore Steek
GPT 5.5, DeepSeek V4, and the Era of Scarcity — Steek | Steek