Hey folks, Alex here, writing this from the beautiful Vancouver BC, Canada. I'm here for NeurIPS 2024, the biggest ML conferences of the year, and let me tell you, this was one hell of a week to not be glued to the screen.
After last week banger week, with OpenAI kicking off their 12 days of releases, with releasing o1 full and pro mode during ThursdAI, things went parabolic. It seems that all the AI labs decided to just dump EVERYTHING they have before the holidays? đ
A day after our show, on Friday, Google announced a new Gemini 1206 that became the #1 leading model on LMarena and Meta released LLama 3.3, then on Saturday Xai releases their new image model code named Aurora.
On a regular week, the above Fri-Sun news would be enough for a full 2 hour ThursdAI show on it's own, but not this week, this week this was barely a 15 minute segment đ because so MUCH happened starting Monday, we were barely able to catch our breath, so lets dive into it!
As always, the TL;DR and full show notes at the end đ and this newsletter is sponsored by W&B Weave, if you're building with LLMs in production, and want to switch to the new Gemini 2.0 today, how will you know if your app is not going to degrade? Weave is the best way! Give it a try for free.
Gemini 2.0 Flash - a new gold standard of fast multimodal LLMs
Google has absolutely taken the crown away from OpenAI with Gemini 2.0 believe it or not this week with this incredible release. All of us on the show were in agreement that this is a phenomenal release from Google for the 1 year anniversary of Gemini.
Gemini 2.0 Flash is beating Pro 002 and Flash 002 on all benchmarks, while being 2x faster than Pro, having 1M context window, and being fully multimodal!
Multimodality on input and output
This model was announced to be fully multimodal on inputs AND outputs, which means in can natively understand text, images, audio, video, documents and output text, text + images and audio (so it can speak!). Some of these capabilities are restricted for beta users for now, but we know they exists. If you remember project Astra, this is what powers that project. In fact, we had Matt Wolfe join the show, and he demoed had early access to Project Astra and demoed it live on the show (see above) which is powered by Gemini 2.0 Flash.
The most amazing thing is, this functionality, that was just 8 months ago, presented to us in Google IO, in a premium Booth experience, is now available to all, in Google AI studio, for free!
Really, you can try out right now, yourself at https://aistudio.google.com/live but here's a demo of it, helping me proof read this exact paragraph by watching the screen and talking me through it.
Performance out of the box
This model beating Sonnet 3.5 on Swe-bench Verified completely blew away the narrative on my timeline, nobody was ready for that. This is a flash model, that's outperforming o1 on code!?
So having a Flash MMIO model with 1M context that is accessible via with real time streaming option available via APIs from the release time is honestly quite amazing to begin with, not to mention that during the preview phase, this is currently free, but if we consider the previous prices of Flash, this model is going to considerably undercut the market on price/performance/speed matrix.
You can see why this release is taking the crown this week. đ
Agentic is coming with Project Mariner
An additional thing that was announced by Google is an Agentic approach of theirs is project Mariner, which is an agent in the form of a Chrome extension completing webtasks, breaking SOTA on the WebVoyager with 83.5% score with a single agent setup.
We've seen agents attempts from Adept to Claude Computer User to Runner H, but this breaking SOTA from Google seems very promising. Can't wait to give this a try.
OpenAI gives us SORA, Vision and other stuff from the bag of goodies
Ok so now let's talk about the second winner of this week, OpenAI amazing stream of innovations, which would have taken the crown, if not for, well... âïž
SORA is finally here (for those who got in)
Open AI has FINALLY released SORA, their long promised text to video and image to video (and video to video) model (nee, world simulator) to general availability, including a new website - sora.com and a completely amazing UI to come with it.
SORA can generate images of various quality from 480p up to 1080p and up to 20 seconds long, and they promised that those will be generating fast, as what they released is actually SORA turbo! (apparently SORA 2 is already in the works and will be even more amazing, more on this later)
New accounts paused for now
OpenAI seemed to have severely underestimated how many people would like to generate the 50 images per month allowed on the plus account (pro account gets you 10x more for $200 + longer durations whatever that means), and since the time of writing these words on ThursdAI afternoon, I still am not able to create a sora.com account and try out SORA myself (as I was boarding a plane when they launched it)
SORA magical UI
I've invited one of my favorite video creators, Blaine Brown to the show, who does incredible video experiments, that always go viral, and had time to play with SORA to tell us what he thinks both from a video perspective and from a interface perspective.
Blaine had a great take that we all collectively got so much HYPE over the past 8 months of getting teased, that many folks expected SORA to just be an incredible text to video 1 prompt to video generator and it's not that really, in fact, if you just send prompts, it's more like a slot machine (which is also confirmed by another friend of the pod Bilawal)
But the magic starts to come when the additional tools like blend are taken into play. One example that Blaine talked about is the Remix feature, where you can Remix videos and adjust the remix strength (Strong, Mild)
Another amazing insight Blaine shared is a that SORA can be used by fusing two videos that were not even generated with SORA, but SORA is being used as a creative tool to combine them into one.
And lastly, just like Midjourney (and StableDiffusion before that), SORA has a featured and a recent wall of video generations, that show you videos and prompts that others used to create those videos with, for inspiration and learning, so you can remix those videos and learn to prompt better + there are prompting extension tools that OpenAI has built in.
One more thing.. this model thinks
I love this discovery and wanted to share this with you, the prompt is "A man smiles to the camera, then holds up a sign. On the sign, there is only a single digit number (the number of 'r's in 'strawberry')"
Advanced Voice mode now with Video!
I personally have been waiting for Voice mode with Video for such a long time, since the that day in the spring, where the first demo of advanced voice mode talked to an OpenAI employee called Rocky, in a very flirty voice, that in no way resembled Scarlet Johannson, and told him to run a comb through his hair.
Well today OpenAI have finally announced that they are rolling out this option soon to everyone, and in chatGPT, we'll all going to have the camera button, and be able to show chatGPT what we're seeing via camera or the screen of our phone and have it have the context.
If you're feeling a bit of a deja-vu, yes, this is very similar to what Google just launched (for free mind you) with Gemini 2.0 just yesterday in AI studio, and via APIs as well.
This is an incredible feature, it will not only see your webcam, it will also see your IOS screen, so youâd be able to reason about an email with it, or other things, I honestly canât wait to have it already!
They also announced Santa mode, which is also super cool, tho I donât quite know how to .. tell my kids about it? Do I⊠tell them this IS Santa? Do I tell them this is an AI pretending to be Santa? Where is the lie end exactly?
And in one of his funniest jailbreaks (and maybe one of the toughest ones) Pliny the liberator just posted a Santa jailbreak that will definitely make you giggle (and him get Coal this X-mas)
The other stuff (with 6 days to go)
OpenAI has 12 days of releases, and the other amazing things we got obviously got overshadowed but they are still cool, Canvas can now run code and have custom GPTs, GPT in Apple Intelligence is now widely supported with the public release of iOS 18.2 and they have announced fine tuning with reinforcement learning, allowing to funetune o1-mini to outperform o1 on specific tasks with a few examples.
There's 6 more work days to go, and they promised to "end with a bang" so... we'll keep you updated!
This weeks Buzz - Guard Rail Genie
Alright, it's time for "This Week's Buzz," our weekly segment brought to you by Weights & Biases! This week I hosted Soumik Rakshit from the Weights and Biases AI Team (The team I'm also on btw!).
Soumik gave us a deep dive into Guardrails, our new set of features in Weave for ensuring reliability in GenAI production! Guardrails serve as a "safety net" for your LLM powered applications, filtering out inputs or llm responses that trigger a certain criteria or boundary.
Types of guardrails include prompt injection attacks, PII leakage, jailbreaking attempts and toxic language as well, but can also include a competitor mention, or selling a product at $0 or a policy your company doesn't have.
As part of developing the guardrails Soumik also developed and open sourced an app to test prompts against those guardrails "Guardrails Genie" and we're going to host it to allow folks to test their prompts against our guardrails, and also are developing it and the guardrails in the open so please check out our Github
Apple iOS 18.2 Apple Intelligence + ChatGPT integration
Apple Intelligence is finally here, you can download it if you have iPhone 15 pro and pro Max and iPhone 16 all series.
If you have one of those phones, you will get the following new additional features that have been in Beta for a while, features like Image Playground with the ability to create images based on your face or faces that you have stored in your photo library.
You can also create GenMoji and those are actually pretty cool!
The highlight and the connection with OpenAI's release is of course the ChatGPT integration, where in if Siri is too dumdum to answer any real AI questions, and let's face it, it's most of the time, a user will get a button and chatGPT will take over upon user approval. This will not require an account!
Grok New Image Generation Codename "Aurora"
Oh, Space Uncle is back at it again! The team at XAI launched its image generation model with the codename "Aurora" and briefly made it public only to pull it and launch it again (this time, the model is simply "Grok"). Apparently, they've trained their own image model from scratch in like three months but they pulled it back a day after, I think because they forgot to add watermarks đ but it's still unconfirmed why the removal occurred in the first place, Regardless of the reason, many folks, such as Wolfram, found it was not on the same level as their Flux integration.
It is really good at realism and faces, and is really unrestricted in terms of generating celebrities or TV shows form the 90's or cartoons. They really don't care about copyright.
The model however does appear to generate fairly realistic images with its autoregressive model approach where generation occurs pixel-by-pixel instead of diffusion. But as I said on the show "It's really hard to get a good sense for the community vibe about anything that Elon Musk does because there's so much d**k riding on X for Elon Musk..." Many folks post only positive things on anything X or Xai does in the hopes that space uncle will notice them or reposts them, it's really hard to get an honest "vibes check" on Xai stuff.
All jokes aside we'll hopefully have some better comparisons on sites such as image LmArena who just today launched ImgArena but until that day comes we'll just have to wait and see what other new iterations and announcements follow!
NeurIPS Drama: Best Paper Controversy!
Now, no week in AI would be complete without a little drama. This time around itâs with the biggest machine learning engineering conference of the year, NeurIPS. This year's "Best Paper" award went to a work entitled Visual Auto Aggressive Modeling (VAR). This paper apparently introduced an innovative way to outperform traditional diffusion models when it comes to image generation! Great right? well not so fast because hereâs where things get spicy. This is where Keyu Tian comes in, the main author of this work and a former intern of ByteDance who are getting their fair share of the benefits with their co-signing on the paper but their lawsuit may derail its future. ByteDance is currently suing Keyu Tian for a whopping one million dollars citing alleged sabotage on the work in a coordinated series of events that compromised other colleagues work.
Specifically, according to some reports "He modified source code to changes random seeds and optimizes which, uh, lead to disrupting training processes...Security attacks. He gained unauthorized access to the system. Login backdoors to checkpoints allowing him to launch automated attacks that interrupted processes to colleagues training jobs." Basically, they believe that he "gained unauthorized access to the system" and hacked other systems. Now the paper is legit and it introduces potentially very innovative solutions but we have an ongoing legal situation. Also to note is despite firing him they did not withdraw the paper which could speak volumes to its future! As always, if it bleeds, it leads and drama is usually at the top of the trends, so definitely a story that will stay in everyone's mind when they look back at NeurIPS this year.
Phew.. what a week folks, what a week!
I think with 6 more days of OpenAI gifts, there's going to be plenty more to come next week, so share this newsletter with a friend or two, and if you found this useful, consider subscribing to our other channels as well and checkout Weave if you've building with GenAI, it's really helpful!
TL;DR and show notes
* Meta llama 3.3 (X, Model Card)
* OpenAI 12 days of Gifts (Blog)
* Apple ios 18.2 - Image Playground, GenMoji, ChatGPT integration (X)
* đ„ Google Gemini 2.0 Flash - the new gold standard of LLMs (X, AI Studio)
* Google Project Mariner - Agent that browsers for you (X)
* This weeks Buzz - chat with Soumik Rakshit from AI Team at W&B (Github)
* NeurIPS Drama - Best Paper Controversy - VAR author is sued by ByteDance (X, Blog)
* Xai new image generation codename Aurora (Blog)
* Cognition launched Devin AI developer assistant - $500/mo
* LMArena launches txt2img Arena for Diffusion models (X)
This is a public episode. If youâd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe