A stylish woman walks down a Tokyo street…

SORA, a revolutionary AI model by OpenAI (yes, those guys again) that can create realistic and imaginative scenes from text instructions has just been launched. The dust hasn’t settled yet, and oh boy it was a massive cloud, but what we have learned so far is that the release of it has shaken up the Internet world like never before. Is it the Midjourney 5 moment for the chatGPT guys? Time will tell, for now we would like to take a closer look at this breathtaking novelty.

‘Sora’ from Japanese, means ‘sky’, as in ‘sky is the limit’, of course, but… did you know it has at least two other meanings, they both will for sure make the AI-skeptics a bit happier than on the day of the tech launch. Sora can have various meanings depending on the context, one including “emptiness,” or “void.” Pick your favorite and come on in.

Sora is based on a LLM, like its older brother ChatGPT, which is an advanced software designed to achieve general-purpose language generation and understanding. In addition to it, it also has a very unique type of skill; to transform text into videos. Nothing new, you would think, probably right, but OpenAI shocked everyone with the hyper-realism and incredibly high quality of what their generative AI returns as videos.

A single prompt like this:

“A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.”

Returned video like that:

Yeah, we know her legs switch sides in 15+ seconds, but either way ,it is jaw-dropping result. There are more fascinating clips out there. Since the launch, the company has been sharing more and more of the amazing videos created with the use of their new engine. You can obviously check all of them out on the official X profile here, their website, or watch the ones shared by the CEO Sam Altman here. They are cinematic, just out of the box. Incredible, to say the least.

prompt: The camera rotates around a large stack of vintage televisions all showing different programs — 1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery.

The length, the quality, the character and scene consistency, the everything… however, there are some “out-of-the-void” voices that call for being extra cautious with getting Sora even more capable of reaching the near-perfect status. Except for the ethical and societal implications, like 2024’s potentially fraught US election cycle, there are also serious issues about any propaganda and misinformation attempts some people may use Sora in their favor. Luckily it’s not up to politicians to call the shots, OpenAI said it was taking important safety steps before making Sora available widely to the public.

“We are working with red teamers – domain experts in areas like misinformation, hateful content, and bias – who will be adversarially testing the model,”

By now, Sora is still available only to the chosen testers, and nobody really knows what was the real power of the software presented by OpenAI. Thus, not all that glitters is gold. On top of that, we have barely any knowledge on what imagery and video sources were used to train Sora. Looking forward to seeing what happens when the general public get their hands on their keyboards. This is going to be a year of video!

In the meantime, while waiting for Sora to reveal its papers, we have stumbled upon some really interesting analysis of how it’s done put together by Brett Goldstein, not the one from Ted Lasso, but a former M&A guy at Google, investor and an entrepreneur. In his latest thread on X Brett factorized the software in a very smart and thorough analysis.

“Sora’s video quality seems impossible so I dug into how it works under the hood it uses both diffusion (starting with noise, refining towards a desired video) and transformer architectures (handling sequential video frames)”

– Brett on X

source: X

This is just the introduction, Brett goes wild on this and that must have been really hard to put together. Well done Sir! You definitely want to see this, head over here.