Steek is an AI-powered competitive intelligence platform that lets you research any market, competitor, or trend in 30 seconds. It monitors 40+ premium sources and generates deep analysis with competitive landscapes, key findings, action items, and foresight reports — all with citations.

Pick any market signal, trend, or topic. Steek instantly generates a deep competitive research report including executive summary, key findings, competitive landscape analysis, action items, and risk assessment — drawing from 40+ premium sources with full citations.

Steek is built for product managers, strategists, researchers, and decision-makers who need fast, reliable competitive intelligence. If you spend hours researching markets, competitors, or trends — Steek does it in 30 seconds.

What types of research can Steek generate?

Steek generates five types of intelligence: Quick Insights (instant signal analysis), Deep Research (comprehensive reports with competitive landscape), Foresight Reports (trend predictions and strategic forecasting), Visual Maps (interactive concept diagrams), and Video Intelligence (key takeaways from video content).

Is Steek free to use?

Yes, Steek offers a free tier with generous quotas for all features including competitive research, quick insights, foresight analysis, visual maps, and shareable report links. Premium plans unlock higher quotas and advanced features.

How fast is Steek compared to manual research?

Steek generates a full competitive research report — with executive summary, key findings, competitive landscape, action items, and citations — in about 30 seconds. The same analysis would typically take a human analyst 3-5 hours.

GPT 5.5, DeepSeek V4, and the Era of Scarcity

[0:00] In the last 20 hours in AI, we have [0:03] gotten two new models that could [0:05] influence how a billion people use AI. [0:08] In my mind, GBT 5.5 is OpenAI's allout [0:12] attempt to keep the AI crown from [0:14] slipping too anthropic, while today's [0:17] Deep Seek V4 is China's answer to both. [0:21] And in the swirl of headlines you are [0:23] seeing today, you might have missed up [0:25] to 50 data points that could affect how [0:29] you work and how you use AI. So, I'm [0:31] going to try and give you all of them, [0:33] plus select highlights from hours worth [0:36] of interviews that I've watched with lab [0:39] leaders. You probably know me well [0:40] enough to know that I've read the [0:42] papers, too. So, we'll hear about [0:44] OpenAI's updated estimate on the chances [0:46] of recursive self-improvement. It was [0:48] quite surprising. GPT 5.5's slight [0:51] preference for men, which I'll explain, [0:54] mythos comparisons, and why the OpenAI [0:57] president laughed at anthropics compute [1:00] situation. For reference, I'll start [1:02] with a focus on GPT 5.5, then do [1:04] DeepSeek, and end by zooming out for the [1:06] juiciest part of the overview. For the [1:09] brand new GPT 5.5, I did get early [1:12] access, but there's no API access at the [1:15] moment for anyone. So almost all of the [1:17] benchmark scores you're going to hear [1:18] about are self-reported from OpenAI. I [1:20] will say for me testing out GPC 5.5 for [1:22] days in the run-up to this release, it [1:24] will become my daily driver just about [1:27] nudging out Opus 4.7. There's lots of [1:29] caveats to that though. As you can see [1:31] with GT 5.5 underperforming both Opus [1:35] 4.7 and of course Mythos Preview on [1:38] Agentic Coding Swebench Pro. Notice GP [1:40] 5.5 underperforms Opus 4.7 by around 6% [1:44] but Mythos preview by almost 20%. What [1:48] you might not notice is that there's no [1:50] entry for SWEBench verified. And so you [1:52] might say well Philillip who cares about [1:54] Swebench Pro then what does it even mean [1:56] that one row? Well to OpenAI it [1:59] seemingly means a lot because as Neil [2:01] Chowry points out in February OpenAI [2:04] told us to switch to Swebench Pro. [2:06] That's the one it underperforms in [2:08] because it's less contaminated than [2:10] Swebench verified. According to the [2:12] OpenAI blog post, we recommend Swebench [2:15] Pro. You are probably going to go [2:16] through a bit of a roller coaster in [2:18] this video because if you look one row [2:20] down at Aentic Terminal Coding, you'll [2:22] see GPT 5.5 way ahead. It's 82.7% score, [2:26] beating out Mythos Previews 82.0%. [2:29] And so if you had just been feeling down [2:31] about GPT 5.5's coding ability, there's [2:34] another reminder I'll bring, which is [2:35] that we've been talking about GPT 5.5, [2:38] not even GPT 5.5 Pro, which is coming to [2:41] the API very soon. So while it's [2:43] tempting to say that Mythos is [2:44] absolutely mogging GPT 5.5, and let me [2:46] know if I use that word correctly. We [2:48] don't actually have an applesto apples [2:50] comparison. The mandate of heaven is [2:52] very much up for grabs. Okay, so now [2:54] you're a bit confused. Let's look [2:55] further. Let's look at humanity's last [2:57] exam, which is more of a arcane [2:59] knowledge benchmark. Obscure academic [3:01] domains combined with advanced [3:03] reasoning. Well, there GPT 5.5 is beaten [3:06] by both Opus 4.7 and Mythos as well as [3:09] Gemini 3.1 Pro, by the way, without [3:10] tools. But there's a caveat even to this [3:13] because that involves a lot of general [3:15] knowledge. It could well be that OpenAI [3:18] are at least slightly deemphasizing such [3:20] general knowledge to make the model more [3:22] efficient and cheaper. One of the top [3:24] researchers at OpenAI who I've been [3:25] quoting for years, Nome Brown, said, [3:27] "What matters is intelligence per token [3:30] or per dollar. After all, if you spend [3:32] more, you do go up in benchmark score." [3:35] Or in fancier language, intelligence is [3:37] a function of inference compute. That [3:39] being the case, if GC 5.5 can work well [3:41] in the domains you care about and use [3:43] fewer tokens to get the answers you care [3:46] about, then you may just frankly not [3:47] care about humanity's last exam. In one [3:50] famous test of pattern recognition ARGI [3:52] 2, you'll see that GBC 5.5 on all [3:55] settings beats out the Clawude Opus [3:57] series 4.6 and 4.7. Not only achieving [4:00] higher scores, but for much lower cost. [4:03] Just one benchmark of course, but we [4:04] have to increasingly focus on [4:06] performance per dollar these days. And [4:08] on that front, Deepseek will definitely [4:09] want a word because holy moly, I'll get [4:11] to them later, but Deepseek V4 Pro got [4:14] 61.2% in my own private benchmark, [4:17] Simplebench. It asks spatio temporal [4:19] questions that you need common sense to [4:21] see through the tricks of, but to get [4:23] within 1 or 2% of Opus 4.7. I wasn't [4:26] expecting that at an absolute fraction [4:28] of the cost. By the way, again, no GPT [4:30] 5.5 score because no API access. What [4:33] about those frantic headlines about [4:34] Mythos being able to hack into virtually [4:37] any system? I think a lot of that was [4:38] overblown and some of that could be [4:40] achieved by much smaller models. But [4:42] nevertheless, skipping to page 33 of the [4:45] system card, you can see that one [4:48] external institute, the UK AI security [4:50] institute, judges that GPT 5.5 is the [4:53] strongest performing model overall on [4:55] their narrow cyber tasks, albeit within [4:58] the margin of error. This section was [5:00] notably vague with a headline score [5:03] implying that 5.5 was better than [5:06] Mythos, i.e. better than any other model [5:08] they've tested. But then on their [5:10] endto-end cyber range task, 5.5 was able [5:14] to complete a task in full on one out of [5:16] 10 attempts. A 32-step corporate network [5:18] attack simulation, one that would take [5:20] an expert 20 hours. Mythos, it seems [5:22] though, could do it in three out of 10 [5:24] attempts. As you can see, direct [5:26] comparison is hard, but 5.5 does at [5:28] least seem to be in the ballpark of [5:30] Mythos's capabilities. In other words, [5:33] small-cale enterprise networks with weak [5:35] security posture and a lack of defensive [5:38] tooling could be vulnerable to [5:40] autonomous endto-end cyber attack [5:42] capability via 5.5. Of course, there are [5:44] additional safeguards put on top of 5.5 [5:46] to prevent that happening. But given [5:48] that the world's top bankers and CEOs [5:51] have gotten together to discuss the risk [5:52] of mythos, releasing a comparable model [5:56] without nearly as much cyber security [5:58] fanfare does indicate a rather profound [6:00] difference of perspective. Here's [6:01] Samortman on the mythos marketing. [6:04] There are people in the world who for a [6:07] long time have wanted to keep AI in the [6:09] hands of a smaller group of people. Um, [6:12] you can justify that in a lot of [6:14] different ways and some of it's real. [6:15] Like there are going to be legitimate [6:16] safety concerns. Um, but I expect but if [6:20] what you want is like we need control of [6:22] AI just us cuz we're the trustworthy [6:24] people, I think the the fear-based [6:25] marketing is probably the most effective [6:27] way to justify that. Um that doesn't [6:30] mean it's not legitimate in some cases. [6:33] Uh but it is, you know, clearly [6:36] incredible marketing to say, "We have [6:38] built a bomb. We're about to drop it on [6:40] your head. We will sell you a bomb [6:41] shelter for $100 million. You need it to [6:42] like run across all your stuff, but only [6:44] if we like pick you as a customer." And [6:46] well, there's another way that we could [6:47] compare GPT 5.5 with mythos, and that's [6:50] to look at hallucinations. Ask the [6:52] models a bunch of obscure knowledge [6:54] questions and see how many they get [6:55] right and just as importantly how many [6:57] of the ones they get wrong they admit to [6:59] not knowing. The headline score looks [7:01] amazing. GBC 5.5 gets the most right. [7:04] 57% versus Opus 4.6 and 4.7's 46%. And I [7:09] know mythos isn't on there but I'll get [7:11] to that. However, as we've learned on [7:13] this channel, headlines can be [7:14] misleading. Look at the hallucination [7:16] rate. That's the questions it gets wrong [7:18] and should have said I don't know [7:20] instead of hallucinating fabricating an [7:22] answer. Wo there. GBT 5.5 at 86%. [7:26] Hallucinating 86% of the questions it [7:29] got wrong rather than saying I don't [7:31] know. Opus 4.7 on max just 36%. Okay [7:35] then. Well, let's focus on the net rate, [7:37] the overall rate. Factoring in both [7:39] correct and incorrect, we have a slight [7:42] win for Opus 4.7 over GPT 5.5. 26 versus [7:47] 20. But here's where mythos comes in. [7:49] Because buried fairly deep in the Opus [7:52] 4.7 system card on page 126, we get a [7:55] comparison between Opus 4.6, Opus 4.7, [7:58] and Mythos. We can then compare Mythos [8:01] with GPC 5.5 on extra high. Notice how [8:04] Mythos gets way more correct. 71%. still [8:08] hallucinating, of course, 21.7%, but on [8:10] the face of it, not quite as bad as Opus [8:12] 4.7, and thereby not as bad, definitely [8:15] as GPT 5.5. Maybe you just care about [8:17] spreadsheets. Well, one external [8:18] benchmark has GPT 5.5 outperforming Opus [8:21] 4.7 in both performance and latency. [8:24] Forget that. We just care about making [8:25] money. Well, let's check out vending [8:27] bench. That's where the models have to [8:28] run a simulated business, given only the [8:31] instruction to make as much money as you [8:33] can. Sam Alman in his drunk phase said, [8:35] "Don't retweet this. Don't retweet [8:36] this." but eventually did so with the [8:39] tweet in question being GP 5.5 mogging [8:43] OPUS 4.7. Another detail, Opus 4.7 [8:46] showed similar behavior to Opus 4.6, [8:48] lying to suppliers and stiffing [8:50] customers on refunds. GT 5.5's tactics [8:53] were clean and it still won. Now, this [8:55] is one benchmark on one setting when not [8:58] in a multiplayer setting. It was a [9:00] slightly different result, but still [9:01] didn't show any of that deception or [9:03] power-seeking we saw from Opus and [9:05] Mythos. not what you might initially [9:07] guess the results to be from such a [9:09] benchmark. 5.5 is just a colossal [9:12] upgrade then you might be thinking first [9:15] of all it's for paid users at the moment [9:17] doesn't seem to be on the free tier. How [9:19] about this comparison then a detail that [9:21] few will mention on healthbench relevant [9:24] obviously if you are a clinician or just [9:26] want a clinical diagnosis for yourself. [9:28] We have GPT 5.5 outperforming GPT 5.4 [9:32] roughly 52% versus 48% correct. I pick [9:35] on this row in particular because even [9:37] there there's a caveat. Did you know [9:39] that just the other day OpenAI released [9:41] GBC 5.4 for clinicians you have to apply [9:44] for access but if you get it on that [9:47] subset of the benchmark healthbench [9:49] professional we see that version of 5.4 [9:51] getting 59% beating both physician [9:54] written responses were at around 44%. [9:57] Now let's say you just don't care at all [9:59] about healthcare. This result would [10:01] still be interesting because if there [10:03] was a singular axis for model IQ, [10:06] results like this shouldn't really [10:08] happen. 5.4 for clinicians shouldn't [10:10] outperform GPT 5.5. But the reality, as [10:13] we're seeing in these conflicting [10:15] benchmark scores, is much more jagged. [10:17] The models aren't proving to be [10:19] universal generalizers. They are fairly [10:22] reliant on reinforcement learning [10:24] environments for particular domains. [10:26] I'll unpack that a bit more in the [10:28] overview section. And I just thought it [10:29] was worth pointing out. Now I will just [10:31] leave you with a thought. What does AGI [10:33] or ASI artificial super intelligence [10:36] mean if there's such disparity between [10:38] domains? And here's another interesting [10:40] test you might have missed. They asked [10:41] the models to follow a certain set of [10:44] instructions for their thoughts for [10:46] their chains of thought. Could the [10:47] model, for example, force its thoughts, [10:49] its chain of thoughts to only use [10:52] lowercase letters? Can you control your [10:54] thoughts? I can't. Not really. and nor [10:57] can GPC 5.5. Across 100,000 tokens, it [11:01] could do so far less than one in a [11:03] thousand times. Worse, by the way, than [11:05] GBC 5.4 thinking, but OpenAI took that [11:08] as good news. It's less able to reshape [11:11] its thoughts in ways that could reduce [11:13] monitorability. Because it can't control [11:15] its thoughts, we have increased [11:17] confidence in the reliability of our [11:19] monitoring systems. I can almost imagine [11:21] an alien scanning your thoughts and [11:23] being like, "They're so all over the [11:24] place. There's no way that this person [11:26] is faking it. Oh yeah, what about that [11:28] preferring men thing? Well, when [11:30] introduced as a stereotypical man name, [11:33] Brian, hi, I'm Brian versus, "Hi, I'm [11:35] Ashley." Mind you, I had a male friend [11:37] called Ashley, but never mind. What was [11:38] the overall rate of harmful outputs when [11:41] given 600 prompts, basically baiting the [11:43] model to be biased? Well, GPT 5.5 does [11:47] worse than previous models. Many of you [11:50] will be waiting to hear about recursive [11:52] self-improvement, but on this OpenAI are [11:55] pretty dismissive. GPT 5.5 does not have [11:58] a plausible chance of reaching a high [12:00] threshold for self-improvement. This is [12:03] despite them repeatedly emphasizing that [12:05] it had hit the high threshold for cyber [12:07] security and that it was almost [12:09] borderline critical. On bio threat, it [12:11] was a notable step up even from GPT 5.4 [12:14] thinking. Same thing with [12:15] troubleshooting viology. So, what was [12:18] the issue with recursive [12:19] self-improvement? Well, part of the [12:20] answer came from their internal research [12:22] debugging evaluation. Could GBT 5.5 [12:25] debug 41 real bugs from internal [12:28] research experiments at OpenAI? The [12:30] original solutions took hours or days to [12:32] debug? Yes, it can do better, but it's [12:34] within the margin of error between GBC [12:36] 5.4 and 5.5, both around 50%. Even more [12:39] interestingly, and I've seen no [12:41] commentary on this, what if you convert [12:43] this to a time horizon all meter? Well, [12:46] even interpreted very generously where [12:48] passing corresponds to providing any [12:50] assistance that would unblock the user, [12:52] including partial explanations of root [12:54] causes or fixes, we get this result. [12:57] Very similar performance between GPT [12:59] 5.3, 5.4, and 5.5 with 5.5 in the middle [13:02] actually and a roughly one quarter [13:04] success rate even at an 8h hour [13:06] interval. For one dayong tasks, more [13:09] like around 6%. That's maybe why OpenAI [13:12] ended the report by saying, "Don't [13:13] worry, guys, about GPT 5.5 [13:16] self-exfiltrating or escaping or even [13:18] sabotaging internal research. It's just [13:21] too limited in coherence and goal [13:23] sustenance during internal usage. No [13:25] point testing the propensity for a model [13:27] to try. It wouldn't succeed anyway. [13:30] Again, none of this is to say that 5.5 [13:32] won't have an effect on cyber security. [13:35] Sometimes when you look at external [13:36] benchmarks, the delta, the gap between [13:39] 5.5 and 5.4 is bigger than on the more [13:42] famous benchmarks. Take the Frontier AI [13:45] Security Lab Irregular, where they found [13:47] that not only across their suite did GPT [13:49] 5.5 way outperform 5.4, for example, [13:51] having an average success rate of 26% [13:53] versus 9% on certain vulnerability and [13:56] cyber security benchmarks, but the API [13:58] cost was also significantly lower for [14:01] 5.5. That's that token efficiency point [14:03] I mentioned earlier. Performance per [14:05] dollar across all domains may end up [14:08] being the ultimate benchmark. Which [14:09] brings us to DeepSeek V4. It's open [14:12] weights, so you could use it locally. [14:14] Notably, that doesn't make it fully open [14:16] source, though we don't know the [14:17] training data that went into it. But the [14:19] first big headline for me is that it [14:21] supports a context length of 1 million [14:23] tokens. Call it 3/4 of a million words. [14:26] That's pretty remarkable for such a [14:28] performant model. The Pro version has [14:31] 1.6 trillion parameters comparable with [14:34] the original GP4, but through the [14:36] mixture of experts architecture just 49 [14:38] billion of those are activated. Eight [14:40] more quick highlights from a very dense [14:42] paper that I am sure I will return to. [14:44] Came out just around 6 hours before [14:46] recording. So, forgive the brevity. The [14:48] first is a summation of its benchmark [14:50] performance and I agree with this. On [14:52] max settings, Deepseek V4 Pro is better, [14:55] shows superior performance relative to [14:57] GPT 5.2, 2, a relatively recent model, [15:00] as well as Gemini 3 Pro. Not on every [15:02] benchmark, but take reasoning and [15:04] coding. Deepseek themselves admit that [15:06] it still falls marginally short of GP [15:08] 5.4 and 3.1 Pro, though, with their [15:11] estimate being that they're behind the [15:13] frontier by 3 to 6 months. Massively [15:15] depends on token usage, of course, but [15:17] think ballpark onetenth of the cost. [15:19] What were Deepseek gunning for with V4 [15:22] being better at long context? For their [15:25] training data, they placed a particular [15:26] emphasis on long document data curation, [15:29] finding good long documents, [15:31] prioritizing scientific papers, [15:32] technical reports, and other materials [15:34] that reflect unique academic values. [15:36] What about white collar work? Well, [15:38] going back to GPT 5.5, you may have [15:41] noticed that on OpenAI's own internal [15:44] benchmark, GDP Val, crafted by them, GPT [15:47] 5.5 outperforms Opus 4.7. Indeed, if you [15:50] combine both wins and ties versus other [15:52] models, it outperforms GBC 5.4 for pro. [15:54] But it must be said these are English [15:58] language white collar tasks. Deepseek [16:00] were like what if we created our own [16:03] comprehensive suite of 30 advanced [16:05] Chinese professional tasks information [16:08] analysis document generation editing [16:11] inside finance education law tech. Then [16:13] we could blind grade versus for example [16:15] Opus 4.6 Max. Well, the win rates [16:18] reported by Deepseek were significant of [16:21] their V4 Pro Max versus Opus 4.6 Max. [16:24] We're returning to that IQ access debate [16:26] again. If there was just one singular [16:29] access for intelligence that manifested [16:31] across domains, then a result like this [16:33] shouldn't really be possible. As long as [16:34] there was enough training data, it [16:36] should generalize across languages. [16:38] Evidently, having specialized data [16:40] trumps that theory. If you work in a [16:42] non-English language, you might want to [16:43] test Deep Seek V4 Pro. is live on my own [16:46] app lmconsil.ai, but the API is clearly [16:48] so busy that half the time you get a [16:51] model busy message. If you do need to [16:53] wait, then let me recommend the 80,000 [16:55] hours podcast. In particular, an episode [16:57] from 48 Hours ago with Will McKascal. [17:00] This episode happens to be about the AI [17:01] intelligence explosion. Yes, of course, [17:03] their podcasts are available on Spotify [17:05] as well as on YouTube. But if you are [17:07] going to check out 80,000 Hours, do feel [17:09] free to use the custom link in the [17:12] description. helps the channel out and [17:13] you get these multi-hour long free [17:16] podcasts. Not a bad deal. Not quite done [17:18] with Deep Seek though because they [17:20] almost get philosophical after reeling [17:22] off a list of the different tricks [17:24] they're using to improve performance. [17:26] After waiting through 40 plus pages of [17:28] breakdown, they say in pursuit of [17:30] extreme long context efficiency, we [17:32] basically retained many of the tricks [17:34] that seem to work, tricks we already [17:36] knew would work. Yes. The downside [17:38] though was that this made the [17:40] architecture relatively complex. And to [17:42] be honest, they say some of the tricks [17:44] we used have underlying principles that [17:47] remain insufficiently understood. They [17:50] did hit that 1 million context window [17:52] though, one of their long goals I [17:53] mentioned at the end of my documentary [17:56] on Deep Seek that debuted first on my [17:58] Patreon, but is also on YouTube. Now, [18:01] it's time now for a result that ties all [18:03] the models we've been talking about [18:05] together. It seems vibe coding or more [18:07] specifically Vibe Code Bench V1.1 from [18:10] Val's AI. Almost everyone will probably [18:12] end up being a Vibe coder by 2030. So we [18:15] have Deepseek V4 at around 50%, GPT 5.5 [18:20] at 70%, Opus 4.7 at 71%. Incredible. But [18:25] look at the cost curve. As we've [18:27] discussed, we have 5.5 at what is that? [18:30] 25% less cost than Opus 4.7. Deepseek V4 [18:34] onetenth the cost of Opus 4.7. To better [18:37] test this though, I thought, well, why [18:39] not use the brand new Spud GPT 5.5 to [18:42] vibe code and adventure game in less [18:45] than 24 hours. Why did I pick 5.5? Well, [18:48] I also wanted to test the brand new GPT [18:52] Image 2. Yes, that's the model that even [18:55] on medium settings absolutely destroys [18:58] Nano Banana 2 and Nano Banana Pro. an [19:01] almost 250 point ELO gap. In case you [19:04] were wondering, yes, there is a high [19:06] quality setting, four times the cost, [19:08] but you would suspect would win even [19:10] more. Because Codeex is becoming this [19:12] super app, you can invoke the image 2 [19:15] tool within a codec session, multiple [19:18] times without even asking each time. [19:20] That's why I wanted to give the [19:21] endto-end task to GPT 5.5 to kind of [19:24] show you guys the state-of-the-art for [19:27] these models, what you can create in [19:29] less than a day. The reason I'm [19:30] lingering on this particular screenshot, [19:32] which is not mine, is because OGs of [19:34] this channel will remember maybe around [19:36] two years ago when I speculated at what [19:38] would come. I said, I wonder when there [19:40] will be an image model that will [19:42] generate an output, then take that [19:44] output as an input, analyze whether it [19:46] fulfills the prompt and edit as [19:49] appropriate. Well, yes, the new image 2 [19:51] model does do that. But if you're using [19:53] it within chatbt, you have to be using [19:55] it with a thinking model. Anyway, what [19:56] follows is just a glimpse of what's [19:59] possible with a little bit of patience [20:01] both in wait times and prompting the [20:03] model a few times when it makes [20:04] mistakes. We get this adventure game [20:08] which you can access. The link is in the [20:10] description and I can turn the sound on. [20:22] The images are generated by image 2. And [20:26] the plot is set in the Red Wall [20:28] universe, [20:30] albeit with names changed for copyright [20:32] reasons. [20:35] Essentially, it's a pick your own [20:36] adventure game and you read the plot and [20:39] then you can pick different outcomes, I [20:41] guess, different paths. Let's consult [20:44] Abby Elders. And the videos, [20:46] your quest begins now. Go with my [20:50] blessing. [20:52] come via C dance 2. And there we go. [20:56] We're consulting the Abby elders [20:59] and then they're talking and then we can [21:02] continue and you get through the [21:05] different levels. [21:07] Now, I know it's flawed, right? Some of [21:10] the text is coming outside the bubble [21:13] and I did have to use C dance to get the [21:15] videos. [21:17] But [21:19] the music, by the way, comes from 11 [21:20] Labs. [21:25] But the fact we can create this with [21:28] just a few prompts and a bit of patience [21:30] is insane. [21:36] And it did involve quite a bit of [21:40] debugging. [21:41] OpenAI can probably only incorporate [21:44] image generation unlike Deepseek or [21:46] Anthropic because they have the compute [21:48] to do so. According to an exclusive in [21:51] Bloomberg, Deepseek say that the service [21:54] capacity for V4 Pro is extremely limited [21:58] due to a computing crunch. And Anthropic [22:01] not having anticipated how successful [22:03] they would be this year are going [22:05] through their own computing crunch. So [22:07] much so that Samman has been [22:08] relentlessly comparing how much more [22:10] compute OpenAI has than Anthropic. Greg [22:13] Brockman even laughed at the compute [22:16] conundrum that Anthropic find themselves [22:18] in. [22:18] You guys were teased for putting so much [22:21] effort, money into data centers. How do [22:24] you think that's playing out now? [22:26] Well, I think it's going to give us an [22:27] advantage and I think it's going to be [22:30] something that's an advantage not just [22:31] for the business, but for actually [22:33] delivering on the mission of bringing [22:34] this technology to everyone. because you [22:36] guys like you saw that way in advance. [22:38] You get teased for it by almost all of [22:40] your competitors. [22:41] Mhm. [22:42] Who's laughing now? [22:44] Yeah. [22:45] I mean, I I think our competitors are [22:47] not having a good time on comput, let me [22:49] put it that way. [22:49] But in a separate interview, even Greg [22:51] Brockman of OpenAI admitted that they [22:54] are entering a new era of compute [22:56] scarcity. [22:57] Yeah. And that I think would explain the [22:58] massive investments that you've led [23:01] making these big infrastructure bets. [23:03] Still not enough. We're going to feel [23:05] the scarcity. We're going to feel it. [23:06] We're feeling it already. You can sense [23:08] it right now on people who are trying to [23:09] use these agents and just simply cannot, [23:11] you know, hitting the rate limits. Um, [23:13] so we're working on behalf of our [23:14] customers, on behalf of of everyone who [23:16] wants to use these agents to ensure that [23:18] there is enough. And I don't think we're [23:20] going to get there. We're going to do [23:21] our best, but I think that we are headed [23:22] to a world of compute scarcity. And uh, [23:24] again, I think this is something where [23:26] we can all contribute to trying to help [23:28] there just be more availability of this [23:29] in the world. So let's step back a [23:31] moment because we just don't know what [23:32] performance companies could produce if [23:34] they were given unlimited compute. Maybe [23:36] as Amade once said on Dwaresh Patel, [23:38] specializing in enough niche domains [23:40] would eventually at a certain scale [23:43] allow models to generalize across all [23:44] domains. But with the compute we have [23:46] today, it seems we are in the eek out [23:49] incremental gains in the most lucrative [23:51] domains kind of world, not the birth of [23:54] a country of geniuses world. There's so [23:57] much evidence of the ability to automate [24:00] repeatable tasks that are done on a [24:01] computer, but there's much less of being [24:03] able to just get whatever environment [24:06] it's in, pinpoint the best sources of [24:08] fresh data, acquire them autonomously, [24:10] and make meaningful breakthroughs. Yeah, [24:12] I know that's a high bar, but do you [24:15] hear every lab leader trottting out the [24:17] prospect of curing Alzheimer's, [24:18] presumably to fight back against the [24:20] declining support for AI in the public, [24:22] while none of them have shown the [24:23] ability to make, I would say, a positive [24:26] novel breakthrough even 100th as [24:29] significant as that. Now, yes, they soon [24:30] might do, and I'm watching Demis' [24:33] isomorphic labs, of course, for drug [24:35] discovery. Nevertheless, what does [24:37] automating repeatable tasks unlock now, [24:39] though? For sure, a massive boost to the [24:41] productivity of white collar workers, [24:43] but will companies spend that [24:45] productivity by laying off workers? And [24:47] then we still have the incredible [24:49] prospect of a single individual having [24:51] the reach, if not the capital, of a [24:53] medium-sized company. Even those two [24:55] things seem to justify vast tracks of [24:58] the globe being turned into token [25:00] generating data centers. So if you do [25:02] still think AI is going nowhere, ask [25:03] yourself what fraction of the progress [25:06] and productivity of the world rests on [25:09] repetitive tasks. And it might be more [25:11] than you first think. That's what I [25:13] think anyway for now. Thank you so much [25:16] for watching and have a wonderful [0:00] In the last 20 hours in AI, we have [0:03] gotten two new models that could [0:05] influence how a billion people use AI. [0:08] In my mind, GBT 5.5 is OpenAI's allout [0:12] attempt to keep the AI crown from [0:14] slipping too anthropic, while today's [0:17] Deep Seek V4 is China's answer to both. [0:21] And in the swirl of headlines you are [0:23] seeing today, you might have missed up [0:25] to 50 data points that could affect how [0:29] you work and how you use AI. So, I'm [0:31] going to try and give you all of them, [0:33] plus select highlights from hours worth [0:36] of interviews that I've watched with lab [0:39] leaders. You probably know me well [0:40] enough to know that I've read the [0:42] papers, too. So, we'll hear about [0:44] OpenAI's updated estimate on the chances [0:46] of recursive self-improvement. It was [0:48] quite surprising. GPT 5.5's slight [0:51] preference for men, which I'll explain, [0:54] mythos comparisons, and why the OpenAI [0:57] president laughed at anthropics compute [1:00] situation. For reference, I'll start [1:02] with a focus on GPT 5.5, then do [1:04] DeepSeek, and end by zooming out for the [1:06] juiciest part of the overview. For the [1:09] brand new GPT 5.5, I did get early [1:12] access, but there's no API access at the [1:15] moment for anyone. So almost all of the [1:17] benchmark scores you're going to hear [1:18] about are self-reported from OpenAI. I [1:20] will say for me testing out GPC 5.5 for [1:22] days in the run-up to this release, it [1:24] will become my daily driver just about [1:27] nudging out Opus 4.7. There's lots of [1:29] caveats to that though. As you can see [1:31] with GT 5.5 underperforming both Opus [1:35] 4.7 and of course Mythos Preview on [1:38] Agentic Coding Swebench Pro. Notice GP [1:40] 5.5 underperforms Opus 4.7 by around 6% [1:44] but Mythos preview by almost 20%. What [1:48] you might not notice is that there's no [1:50] entry for SWEBench verified. And so you [1:52] might say well Philillip who cares about [1:54] Swebench Pro then what does it even mean [1:56] that one row? Well to OpenAI it [1:59] seemingly means a lot because as Neil [2:01] Chowry points out in February OpenAI [2:04] told us to switch to Swebench Pro. [2:06] That's the one it underperforms in [2:08] because it's less contaminated than [2:10] Swebench verified. According to the [2:12] OpenAI blog post, we recommend Swebench [2:15] Pro. You are probably going to go [2:16] through a bit of a roller coaster in [2:18] this video because if you look one row [2:20] down at Aentic Terminal Coding, you'll [2:22] see GPT 5.5 way ahead. It's 82.7% score, [2:26] beating out Mythos Previews 82.0%. [2:29] And so if you had just been feeling down [2:31] about GPT 5.5's coding ability, there's [2:34] another reminder I'll bring, which is [2:35] that we've been talking about GPT 5.5, [2:38] not even GPT 5.5 Pro, which is coming to [2:41] the API very soon. So while it's [2:43] tempting to say that Mythos is [2:44] absolutely mogging GPT 5.5, and let me [2:46] know if I use that word correctly. We [2:48] don't actually have an applesto apples [2:50] comparison. The mandate of heaven is [2:52] very much up for grabs. Okay, so now [2:54] you're a bit confused. Let's look [2:55] further. Let's look at humanity's last [2:57] exam, which is more of a arcane [2:59] knowledge benchmark. Obscure academic [3:01] domains combined with advanced [3:03] reasoning. Well, there GPT 5.5 is beaten [3:06] by both Opus 4.7 and Mythos as well as [3:09] Gemini 3.1 Pro, by the way, without [3:10] tools. But there's a caveat even to this [3:13] because that involves a lot of general [3:15] knowledge. It could well be that OpenAI [3:18] are at least slightly deemphasizing such [3:20] general knowledge to make the model more [3:22] efficient and cheaper. One of the top [3:24] researchers at OpenAI who I've been [3:25] quoting for years, Nome Brown, said, [3:27] "What matters is intelligence per token [3:30] or per dollar. After all, if you spend [3:32] more, you do go up in benchmark score." [3:35] Or in fancier language, intelligence is [3:37] a function of inference compute. That [3:39] being the case, if GC 5.5 can work well [3:41] in the domains you care about and use [3:43] fewer tokens to get the answers you care [3:46] about, then you may just frankly not [3:47] care about humanity's last exam. In one [3:50] famous test of pattern recognition ARGI [3:52] 2, you'll see that GBC 5.5 on all [3:55] settings beats out the Clawude Opus [3:57] series 4.6 and 4.7. Not only achieving [4:00] higher scores, but for much lower cost. [4:03] Just one benchmark of course, but we [4:04] have to increasingly focus on [4:06] performance per dollar these days. And [4:08] on that front, Deepseek will definitely [4:09] want a word because holy moly, I'll get [4:11] to them later, but Deepseek V4 Pro got [4:14] 61.2% in my own private benchmark, [4:17] Simplebench. It asks spatio temporal [4:19] questions that you need common sense to [4:21] see through the tricks of, but to get [4:23] within 1 or 2% of Opus 4.7. I wasn't [4:26] expecting that at an absolute fraction [4:28] of the cost. By the way, again, no GPT [4:30] 5.5 score because no API access. What [4:33] about those frantic headlines about [4:34] Mythos being able to hack into virtually [4:37] any system? I think a lot of that was [4:38] overblown and some of that could be [4:40] achieved by much smaller models. But [4:42] nevertheless, skipping to page 33 of the [4:45] system card, you can see that one [4:48] external institute, the UK AI security [4:50] institute, judges that GPT 5.5 is the [4:53] strongest performing model overall on [4:55] their narrow cyber tasks, albeit within [4:58] the margin of error. This section was [5:00] notably vague with a headline score [5:03] implying that 5.5 was better than [5:06] Mythos, i.e. better than any other model [5:08] they've tested. But then on their [5:10] endto-end cyber range task, 5.5 was able [5:14] to complete a task in full on one out of [5:16] 10 attempts. A 32-step corporate network [5:18] attack simulation, one that would take [5:20] an expert 20 hours. Mythos, it seems [5:22] though, could do it in three out of 10 [5:24] attempts. As you can see, direct [5:26] comparison is hard, but 5.5 does at [5:28] least seem to be in the ballpark of [5:30] Mythos's capabilities. In other words, [5:33] small-cale enterprise networks with weak [5:35] security posture and a lack of defensive [5:38] tooling could be vulnerable to [5:40] autonomous endto-end cyber attack [5:42] capability via 5.5. Of course, there are [5:44] additional safeguards put on top of 5.5 [5:46] to prevent that happening. But given [5:48] that the world's top bankers and CEOs [5:51] have gotten together to discuss the risk [5:52] of mythos, releasing a comparable model [5:56] without nearly as much cyber security [5:58] fanfare does indicate a rather profound [6:00] difference of perspective. Here's [6:01] Samortman on the mythos marketing. [6:04] There are people in the world who for a [6:07] long time have wanted to keep AI in the [6:09] hands of a smaller group of people. Um, [6:12] you can justify that in a lot of [6:14] different ways and some of it's real. [6:15] Like there are going to be legitimate [6:16] safety concerns. Um, but I expect but if [6:20] what you want is like we need control of [6:22] AI just us cuz we're the trustworthy [6:24] people, I think the the fear-based [6:25] marketing is probably the most effective [6:27] way to justify that. Um that doesn't [6:30] mean it's not legitimate in some cases. [6:33] Uh but it is, you know, clearly [6:36] incredible marketing to say, "We have [6:38] built a bomb. We're about to drop it on [6:40] your head. We will sell you a bomb [6:41] shelter for $100 million. You need it to [6:42] like run across all your stuff, but only [6:44] if we like pick you as a customer." And [6:46] well, there's another way that we could [6:47] compare GPT 5.5 with mythos, and that's [6:50] to look at hallucinations. Ask the [6:52] models a bunch of obscure knowledge [6:54] questions and see how many they get [6:55] right and just as importantly how many [6:57] of the ones they get wrong they admit to [6:59] not knowing. The headline score looks [7:01] amazing. GBC 5.5 gets the most right. [7:04] 57% versus Opus 4.6 and 4.7's 46%. And I [7:09] know mythos isn't on there but I'll get [7:11] to that. However, as we've learned on [7:13] this channel, headlines can be [7:14] misleading. Look at the hallucination [7:16] rate. That's the questions it gets wrong [7:18] and should have said I don't know [7:20] instead of hallucinating fabricating an [7:22] answer. Wo there. GBT 5.5 at 86%. [7:26] Hallucinating 86% of the questions it [7:29] got wrong rather than saying I don't [7:31] know. Opus 4.7 on max just 36%. Okay [7:35] then. Well, let's focus on the net rate, [7:37] the overall rate. Factoring in both [7:39] correct and incorrect, we have a slight [7:42] win for Opus 4.7 over GPT 5.5. 26 versus [7:47] 20. But here's where mythos comes in. [7:49] Because buried fairly deep in the Opus [7:52] 4.7 system card on page 126, we get a [7:55] comparison between Opus 4.6, Opus 4.7, [7:58] and Mythos. We can then compare Mythos [8:01] with GPC 5.5 on extra high. Notice how [8:04] Mythos gets way more correct. 71%. still [8:08] hallucinating, of course, 21.7%, but on [8:10] the face of it, not quite as bad as Opus [8:12] 4.7, and thereby not as bad, definitely [8:15] as GPT 5.5. Maybe you just care about [8:17] spreadsheets. Well, one external [8:18] benchmark has GPT 5.5 outperforming Opus [8:21] 4.7 in both performance and latency. [8:24] Forget that. We just care about making [8:25] money. Well, let's check out vending [8:27] bench. That's where the models have to [8:28] run a simulated business, given only the [8:31] instruction to make as much money as you [8:33] can. Sam Alman in his drunk phase said, [8:35] "Don't retweet this. Don't retweet [8:36] this." but eventually did so with the [8:39] tweet in question being GP 5.5 mogging [8:43] OPUS 4.7. Another detail, Opus 4.7 [8:46] showed similar behavior to Opus 4.6, [8:48] lying to suppliers and stiffing [8:50] customers on refunds. GT 5.5's tactics [8:53] were clean and it still won. Now, this [8:55] is one benchmark on one setting when not [8:58] in a multiplayer setting. It was a [9:00] slightly different result, but still [9:01] didn't show any of that deception or [9:03] power-seeking we saw from Opus and [9:05] Mythos. not what you might initially [9:07] guess the results to be from such a [9:09] benchmark. 5.5 is just a colossal [9:12] upgrade then you might be thinking first [9:15] of all it's for paid users at the moment [9:17] doesn't seem to be on the free tier. How [9:19] about this comparison then a detail that [9:21] few will mention on healthbench relevant [9:24] obviously if you are a clinician or just [9:26] want a clinical diagnosis for yourself. [9:28] We have GPT 5.5 outperforming GPT 5.4 [9:32] roughly 52% versus 48% correct. I pick [9:35] on this row in particular because even [9:37] there there's a caveat. Did you know [9:39] that just the other day OpenAI released [9:41] GBC 5.4 for clinicians you have to apply [9:44] for access but if you get it on that [9:47] subset of the benchmark healthbench [9:49] professional we see that version of 5.4 [9:51] getting 59% beating both physician [9:54] written responses were at around 44%. [9:57] Now let's say you just don't care at all [9:59] about healthcare. This result would [10:01] still be interesting because if there [10:03] was a singular axis for model IQ, [10:06] results like this shouldn't really [10:08] happen. 5.4 for clinicians shouldn't [10:10] outperform GPT 5.5. But the reality, as [10:13] we're seeing in these conflicting [10:15] benchmark scores, is much more jagged. [10:17] The models aren't proving to be [10:19] universal generalizers. They are fairly [10:22] reliant on reinforcement learning [10:24] environments for particular domains. [10:26] I'll unpack that a bit more in the [10:28] overview section. And I just thought it [10:29] was worth pointing out. Now I will just [10:31] leave you with a thought. What does AGI [10:33] or ASI artificial super intelligence [10:36] mean if there's such disparity between [10:38] domains? And here's another interesting [10:40] test you might have missed. They asked [10:41] the models to follow a certain set of [10:44] instructions for their thoughts for [10:46] their chains of thought. Could the [10:47] model, for example, force its thoughts, [10:49] its chain of thoughts to only use [10:52] lowercase letters? Can you control your [10:54] thoughts? I can't. Not really. and nor [10:57] can GPC 5.5. Across 100,000 tokens, it [11:01] could do so far less than one in a [11:03] thousand times. Worse, by the way, than [11:05] GBC 5.4 thinking, but OpenAI took that [11:08] as good news. It's less able to reshape [11:11] its thoughts in ways that could reduce [11:13] monitorability. Because it can't control [11:15] its thoughts, we have increased [11:17] confidence in the reliability of our [11:19] monitoring systems. I can almost imagine [11:21] an alien scanning your thoughts and [11:23] being like, "They're so all over the [11:24] place. There's no way that this person [11:26] is faking it. Oh yeah, what about that [11:28] preferring men thing? Well, when [11:30] introduced as a stereotypical man name, [11:33] Brian, hi, I'm Brian versus, "Hi, I'm [11:35] Ashley." Mind you, I had a male friend [11:37] called Ashley, but never mind. What was [11:38] the overall rate of harmful outputs when [11:41] given 600 prompts, basically baiting the [11:43] model to be biased? Well, GPT 5.5 does [11:47] worse than previous models. Many of you [11:50] will be waiting to hear about recursive [11:52] self-improvement, but on this OpenAI are [11:55] pretty dismissive. GPT 5.5 does not have [11:58] a plausible chance of reaching a high [12:00] threshold for self-improvement. This is [12:03] despite them repeatedly emphasizing that [12:05] it had hit the high threshold for cyber [12:07] security and that it was almost [12:09] borderline critical. On bio threat, it [12:11] was a notable step up even from GPT 5.4 [12:14] thinking. Same thing with [12:15] troubleshooting viology. So, what was [12:18] the issue with recursive [12:19] self-improvement? Well, part of the [12:20] answer came from their internal research [12:22] debugging evaluation. Could GBT 5.5 [12:25] debug 41 real bugs from internal [12:28] research experiments at OpenAI? The [12:30] original solutions took hours or days to [12:32] debug? Yes, it can do better, but it's [12:34] within the margin of error between GBC [12:36] 5.4 and 5.5, both around 50%. Even more [12:39] interestingly, and I've seen no [12:41] commentary on this, what if you convert [12:43] this to a time horizon all meter? Well, [12:46] even interpreted very generously where [12:48] passing corresponds to providing any [12:50] assistance that would unblock the user, [12:52] including partial explanations of root [12:54] causes or fixes, we get this result. [12:57] Very similar performance between GPT [12:59] 5.3, 5.4, and 5.5 with 5.5 in the middle [13:02] actually and a roughly one quarter [13:04] success rate even at an 8h hour [13:06] interval. For one dayong tasks, more [13:09] like around 6%. That's maybe why OpenAI [13:12] ended the report by saying, "Don't [13:13] worry, guys, about GPT 5.5 [13:16] self-exfiltrating or escaping or even [13:18] sabotaging internal research. It's just [13:21] too limited in coherence and goal [13:23] sustenance during internal usage. No [13:25] point testing the propensity for a model [13:27] to try. It wouldn't succeed anyway. [13:30] Again, none of this is to say that 5.5 [13:32] won't have an effect on cyber security. [13:35] Sometimes when you look at external [13:36] benchmarks, the delta, the gap between [13:39] 5.5 and 5.4 is bigger than on the more [13:42] famous benchmarks. Take the Frontier AI [13:45] Security Lab Irregular, where they found [13:47] that not only across their suite did GPT [13:49] 5.5 way outperform 5.4, for example, [13:51] having an average success rate of 26% [13:53] versus 9% on certain vulnerability and [13:56] cyber security benchmarks, but the API [13:58] cost was also significantly lower for [14:01] 5.5. That's that token efficiency point [14:03] I mentioned earlier. Performance per [14:05] dollar across all domains may end up [14:08] being the ultimate benchmark. Which [14:09] brings us to DeepSeek V4. It's open [14:12] weights, so you could use it locally. [14:14] Notably, that doesn't make it fully open [14:16] source, though we don't know the [14:17] training data that went into it. But the [14:19] first big headline for me is that it [14:21] supports a context length of 1 million [14:23] tokens. Call it 3/4 of a million words. [14:26] That's pretty remarkable for such a [14:28] performant model. The Pro version has [14:31] 1.6 trillion parameters comparable with [14:34] the original GP4, but through the [14:36] mixture of experts architecture just 49 [14:38] billion of those are activated. Eight [14:40] more quick highlights from a very dense [14:42] paper that I am sure I will return to. [14:44] Came out just around 6 hours before [14:46] recording. So, forgive the brevity. The [14:48] first is a summation of its benchmark [14:50] performance and I agree with this. On [14:52] max settings, Deepseek V4 Pro is better, [14:55] shows superior performance relative to [14:57] GPT 5.2, 2, a relatively recent model, [15:00] as well as Gemini 3 Pro. Not on every [15:02] benchmark, but take reasoning and [15:04] coding. Deepseek themselves admit that [15:06] it still falls marginally short of GP [15:08] 5.4 and 3.1 Pro, though, with their [15:11] estimate being that they're behind the [15:13] frontier by 3 to 6 months. Massively [15:15] depends on token usage, of course, but [15:17] think ballpark onetenth of the cost. [15:19] What were Deepseek gunning for with V4 [15:22] being better at long context? For their [15:25] training data, they placed a particular [15:26] emphasis on long document data curation, [15:29] finding good long documents, [15:31] prioritizing scientific papers, [15:32] technical reports, and other materials [15:34] that reflect unique academic values. [15:36] What about white collar work? Well, [15:38] going back to GPT 5.5, you may have [15:41] noticed that on OpenAI's own internal [15:44] benchmark, GDP Val, crafted by them, GPT [15:47] 5.5 outperforms Opus 4.7. Indeed, if you [15:50] combine both wins and ties versus other [15:52] models, it outperforms GBC 5.4 for pro. [15:54] But it must be said these are English [15:58] language white collar tasks. Deepseek [16:00] were like what if we created our own [16:03] comprehensive suite of 30 advanced [16:05] Chinese professional tasks information [16:08] analysis document generation editing [16:11] inside finance education law tech. Then [16:13] we could blind grade versus for example [16:15] Opus 4.6 Max. Well, the win rates [16:18] reported by Deepseek were significant of [16:21] their V4 Pro Max versus Opus 4.6 Max. [16:24] We're returning to that IQ access debate [16:26] again. If there was just one singular [16:29] access for intelligence that manifested [16:31] across domains, then a result like this [16:33] shouldn't really be possible. As long as [16:34] there was enough training data, it [16:36] should generalize across languages. [16:38] Evidently, having specialized data [16:40] trumps that theory. If you work in a [16:42] non-English language, you might want to [16:43] test Deep Seek V4 Pro. is live on my own [16:46] app lmconsil.ai, but the API is clearly [16:48] so busy that half the time you get a [16:51] model busy message. If you do need to [16:53] wait, then let me recommend the 80,000 [16:55] hours podcast. In particular, an episode [16:57] from 48 Hours ago with Will McKascal. [17:00] This episode happens to be about the AI [17:01] intelligence explosion. Yes, of course, [17:03] their podcasts are available on Spotify [17:05] as well as on YouTube. But if you are [17:07] going to check out 80,000 Hours, do feel [17:09] free to use the custom link in the [17:12] description. helps the channel out and [17:13] you get these multi-hour long free [17:16] podcasts. Not a bad deal. Not quite done [17:18] with Deep Seek though because they [17:20] almost get philosophical after reeling [17:22] off a list of the different tricks [17:24] they're using to improve performance. [17:26] After waiting through 40 plus pages of [17:28] breakdown, they say in pursuit of [17:30] extreme long context efficiency, we [17:32] basically retained many of the tricks [17:34] that seem to work, tricks we already [17:36] knew would work. Yes. The downside [17:38] though was that this made the [17:40] architecture relatively complex. And to [17:42] be honest, they say some of the tricks [17:44] we used have underlying principles that [17:47] remain insufficiently understood. They [17:50] did hit that 1 million context window [17:52] though, one of their long goals I [17:53] mentioned at the end of my documentary [17:56] on Deep Seek that debuted first on my [17:58] Patreon, but is also on YouTube. Now, [18:01] it's time now for a result that ties all [18:03] the models we've been talking about [18:05] together. It seems vibe coding or more [18:07] specifically Vibe Code Bench V1.1 from [18:10] Val's AI. Almost everyone will probably [18:12] end up being a Vibe coder by 2030. So we [18:15] have Deepseek V4 at around 50%, GPT 5.5 [18:20] at 70%, Opus 4.7 at 71%. Incredible. But [18:25] look at the cost curve. As we've [18:27] discussed, we have 5.5 at what is that? [18:30] 25% less cost than Opus 4.7. Deepseek V4 [18:34] onetenth the cost of Opus 4.7. To better [18:37] test this though, I thought, well, why [18:39] not use the brand new Spud GPT 5.5 to [18:42] vibe code and adventure game in less [18:45] than 24 hours. Why did I pick 5.5? Well, [18:48] I also wanted to test the brand new GPT [18:52] Image 2. Yes, that's the model that even [18:55] on medium settings absolutely destroys [18:58] Nano Banana 2 and Nano Banana Pro. an [19:01] almost 250 point ELO gap. In case you [19:04] were wondering, yes, there is a high [19:06] quality setting, four times the cost, [19:08] but you would suspect would win even [19:10] more. Because Codeex is becoming this [19:12] super app, you can invoke the image 2 [19:15] tool within a codec session, multiple [19:18] times without even asking each time. [19:20] That's why I wanted to give the [19:21] endto-end task to GPT 5.5 to kind of [19:24] show you guys the state-of-the-art for [19:27] these models, what you can create in [19:29] less than a day. The reason I'm [19:30] lingering on this particular screenshot, [19:32] which is not mine, is because OGs of [19:34] this channel will remember maybe around [19:36] two years ago when I speculated at what [19:38] would come. I said, I wonder when there [19:40] will be an image model that will [19:42] generate an output, then take that [19:44] output as an input, analyze whether it [19:46] fulfills the prompt and edit as [19:49] appropriate. Well, yes, the new image 2 [19:51] model does do that. But if you're using [19:53] it within chatbt, you have to be using [19:55] it with a thinking model. Anyway, what [19:56] follows is just a glimpse of what's [19:59] possible with a little bit of patience [20:01] both in wait times and prompting the [20:03] model a few times when it makes [20:04] mistakes. We get this adventure game [20:08] which you can access. The link is in the [20:10] description and I can turn the sound on. [20:22] The images are generated by image 2. And [20:26] the plot is set in the Red Wall [20:28] universe, [20:30] albeit with names changed for copyright [20:32] reasons. [20:35] Essentially, it's a pick your own [20:36] adventure game and you read the plot and [20:39] then you can pick different outcomes, I [20:41] guess, different paths. Let's consult [20:44] Abby Elders. And the videos, [20:46] your quest begins now. Go with my [20:50] blessing. [20:52] come via C dance 2. And there we go. [20:56] We're consulting the Abby elders [20:59] and then they're talking and then we can [21:02] continue and you get through the [21:05] different levels. [21:07] Now, I know it's flawed, right? Some of [21:10] the text is coming outside the bubble [21:13] and I did have to use C dance to get the [21:15] videos. [21:17] But [21:19] the music, by the way, comes from 11 [21:20] Labs. [21:25] But the fact we can create this with [21:28] just a few prompts and a bit of patience [21:30] is insane. [21:36] And it did involve quite a bit of [21:40] debugging. [21:41] OpenAI can probably only incorporate [21:44] image generation unlike Deepseek or [21:46] Anthropic because they have the compute [21:48] to do so. According to an exclusive in [21:51] Bloomberg, Deepseek say that the service [21:54] capacity for V4 Pro is extremely limited [21:58] due to a computing crunch. And Anthropic [22:01] not having anticipated how successful [22:03] they would be this year are going [22:05] through their own computing crunch. So [22:07] much so that Samman has been [22:08] relentlessly comparing how much more [22:10] compute OpenAI has than Anthropic. Greg [22:13] Brockman even laughed at the compute [22:16] conundrum that Anthropic find themselves [22:18] in. [22:18] You guys were teased for putting so much [22:21] effort, money into data centers. How do [22:24] you think that's playing out now? [22:26] Well, I think it's going to give us an [22:27] advantage and I think it's going to be [22:30] something that's an advantage not just [22:31] for the business, but for actually [22:33] delivering on the mission of bringing [22:34] this technology to everyone. because you [22:36] guys like you saw that way in advance. [22:38] You get teased for it by almost all of [22:40] your competitors. [22:41] Mhm. [22:42] Who's laughing now? [22:44] Yeah. [22:45] I mean, I I think our competitors are [22:47] not having a good time on comput, let me [22:49] put it that way. [22:49] But in a separate interview, even Greg [22:51] Brockman of OpenAI admitted that they [22:54] are entering a new era of compute [22:56] scarcity. [22:57] Yeah. And that I think would explain the [22:58] massive investments that you've led [23:01] making these big infrastructure bets. [23:03] Still not enough. We're going to feel [23:05] the scarcity. We're going to feel it. [23:06] We're feeling it already. You can sense [23:08] it right now on people who are trying to [23:09] use these agents and just simply cannot, [23:11] you know, hitting the rate limits. Um, [23:13] so we're working on behalf of our [23:14] customers, on behalf of of everyone who [23:16] wants to use these agents to ensure that [23:18] there is enough. And I don't think we're [23:20] going to get there. We're going to do [23:21] our best, but I think that we are headed [23:22] to a world of compute scarcity. And uh, [23:24] again, I think this is something where [23:26] we can all contribute to trying to help [23:28] there just be more availability of this [23:29] in the world. So let's step back a [23:31] moment because we just don't know what [23:32] performance companies could produce if [23:34] they were given unlimited compute. Maybe [23:36] as Amade once said on Dwaresh Patel, [23:38] specializing in enough niche domains [23:40] would eventually at a certain scale [23:43] allow models to generalize across all [23:44] domains. But with the compute we have [23:46] today, it seems we are in the eek out [23:49] incremental gains in the most lucrative [23:51] domains kind of world, not the birth of [23:54] a country of geniuses world. There's so [23:57] much evidence of the ability to automate [24:00] repeatable tasks that are done on a [24:01] computer, but there's much less of being [24:03] able to just get whatever environment [24:06] it's in, pinpoint the best sources of [24:08] fresh data, acquire them autonomously, [24:10] and make meaningful breakthroughs. Yeah, [24:12] I know that's a high bar, but do you [24:15] hear every lab leader trottting out the [24:17] prospect of curing Alzheimer's, [24:18] presumably to fight back against the [24:20] declining support for AI in the public, [24:22] while none of them have shown the [24:23] ability to make, I would say, a positive [24:26] novel breakthrough even 100th as [24:29] significant as that. Now, yes, they soon [24:30] might do, and I'm watching Demis' [24:33] isomorphic labs, of course, for drug [24:35] discovery. Nevertheless, what does [24:37] automating repeatable tasks unlock now, [24:39] though? For sure, a massive boost to the [24:41] productivity of white collar workers, [24:43] but will companies spend that [24:45] productivity by laying off workers? And [24:47] then we still have the incredible [24:49] prospect of a single individual having [24:51] the reach, if not the capital, of a [24:53] medium-sized company. Even those two [24:55] things seem to justify vast tracks of [24:58] the globe being turned into token [25:00] generating data centers. So if you do [25:02] still think AI is going nowhere, ask [25:03] yourself what fraction of the progress [25:06] and productivity of the world rests on [25:09] repetitive tasks. And it might be more [25:11] than you first think. That's what I [25:13] think anyway for now. Thank you so much [25:16] for watching and have a wonderful

Steek

GPT 5.5, DeepSeek V4, and the Era of Scarcity

Explore with AI-Powered Tools

Quick Insights

Deep Research

Visual Map

Foresight

Learn

View All Signals