SambaNova breaks Llama 3 speed record with 1,000 tokens per second

Time’s almost up! There’s only one week left to request an invite to The AI Impact Tour on June 5th. Don’t miss out on this incredible opportunity to explore various methods for auditing AI models. Find out how you can attend here.

There is no one simple speedometer to measure the speed of a generative AI model, but one of the leading approaches is by measuring how many tokens per second a model handles.

Today, SambaNova Systems announced that it has achieved a new milestone in terms of gen AI performance, hitting a whopping 1,000 tokens per second with the Llama 3 8B parameter instruct model. Until now the fastest benchmark for Llama 3 had been claimed by Groq at 800 tokens per seconds. The 1,000 token per second milestone was independently validated by testing firm Artificial Analysis. The faster speed has numerous enterprise implications that can potentially lead to significant business benefits, such as faster response times, better hardware utilization and lower costs.

“We are seeing the AI chip race accelerate at a faster rate than most expected and we were excited to validate SambaNova’s claims in our benchmarks which were conducted independently and which focus on benchmarking real-world performance,” George Cameron, Co-Founder at Artificial Analysis, told VentureBeat. “AI developers now have more hardware options to choose from and it is particularly exciting for those with speed-dependent use-cases including AI agents, consumer AI applications which demand low response times and high volume document interpretation.”

How SambaNova uses software and hardware to accelerate Llama 3 and gen AI

SambaNova is an enterprise focussed gen AI vendor, with both hardware and software assets.

VB Event

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

Request an invite

On the hardware side, the company develops a type of AI chip it refers to as a reconfigurable dataflow unit (RDU). An RDU, much like an Nvidia AI accelerator, can be used for both training as well as inference. SambaNova has a specific focus on enabling its RDU for enterprise workloads and fine tuning of models. The company’s latest chip is the SN40L which was announced in Sept. 2023.

On top of the silicon, SambaNova has built out its own software stack, which includes the Samba-1 model, which was first released on Feb. 28. Samba-1 is a 1-trillion parameter model that is also known as Samba-CoE (Combination of Experts). The CoE approach enables enterprises to use multiple models in combination, or alone, and have the model fine tuned and trained on corporate data.

For the 1000t/s speed, SambaNova actually used its Samba-1 Turbo model, which is just the API version that has been made available for testing. The company plans on incorporating the speed updates into its mainline model for enterprise in the coming weeks. Cameron cautioned that Groq’s measurement of 800 t/s is of its public API shared endpoint while SambaNova is of a dedicated private endpoint. As such, he noted that his firm does not suggest comparing them directly as it is not exactly apples to apples. 

“That being said, this is more than 8X the median output tokens/s speed of the API providers we benchmark, and is multiple times faster than the typical output tokens/s speed achievable on Nvidia H100s,” Cameron said.

Reconfigurable dataflow enables iterative optimization

The key to SambaNova’s performance is its reconfigurable dataflow architecture, which is at the heart of the company’s RDU silicon technology.

The reconfigurable dataflow architecture enables SambaNova to optimize resource allocation for individual neural network layers and kernels through compiler mapping.

“With dataflow you can continuously improve on the mappings of these models, because it’s fully reconfigurable,” Rodrigo Liang CEO and Founder of SambaNova told VentureBeat. “So you’re able to get not incremental, but fairly significant gains, both in terms of efficiency and in terms of performance as your software improves.”

When Llama 3 first came out, Liang’s team ran it and initially had a performance of 330 tokens per second on Samba-1.  Liang said that through a series of optimizations over the last few months, that speed has tripled to the current high of 1000 tokens per second. Liang explained that optimization was a process of balancing resource allocation between kernels to avoid bottlenecks and maximize throughput across the entire neural network pipeline. It’s the same basic approach that SambaNova uses as part of its software stack to help enterprises optimize their own fine tuning efforts.

Enterprise quality and customization with more speed

Liang emphasized that SambaNova is using 16 bit precision to get its speed milestone, which provides a higher level of quality that enterprises demand.

He noted that dropping to 8-bit precision is not an option for enterprise users.

“For our customer base, we’ve been shipping 16 bit, because they care a lot about quality, and  we want to make sure that we minimize hallucination.”

The speed is particularly important to enterprise users for several reasons. As organizations increasingly move to an AI agent based workflow, where one model flows into the next, speed matters more than ever. There is also an economic incentive to speed things up as well.

“The faster we can generate  the more it frees up the machine for other people to use,” he said. “So it’s really ultimately the compaction of the infrastructure to reduce costs.”

Source link