Together AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments. (ReadTogether AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments. (Read

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

2026/04/02 03:17
Okuma süresi: 4 dk
Bu içerikle ilgili geri bildirim veya endişeleriniz için lütfen crypto.news@mexc.com üzerinden bizimle iletişime geçin.

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

Timothy Morano Apr 01, 2026 19:17

Together AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments.

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

The team behind FlashAttention has quietly become one of the most consequential groups in AI infrastructure. Together AI's kernel research unit, now about 15 engineers strong, is solving a problem most people don't even know exists: the massive performance gap between AI models and the hardware running them.

Their latest win? Taking a voice AI company's time-to-first-token from 281ms down to 77ms—a 3.6x improvement that translated to 7.2x better unit economics.

The Hidden Bottleneck

Here's what most AI discourse misses: having great models and expensive GPUs doesn't guarantee performance. The bottleneck sits in between—the kernel layer that translates mathematical operations into actual silicon instructions.

"The gap between what researchers design and what actually runs fast on hardware is vast," explains Dan Fu, who leads a parallel research lab at UCSD. Get kernels right and you unlock hardware's full potential. Get them wrong and your expensive GPUs sit partially idle.

For companies building AI-native products, this isn't academic. When inference costs run 2x higher than necessary, or when latency breaks the user experience, kernel optimization becomes existential.

One Week Versus One Year

The team's capabilities showed clearly when NVIDIA's Blackwell GPUs arrived in March 2025. NVIDIA had spent a year with dozens of engineers optimizing kernels for the new architecture. Together AI had a week.

Their secret weapon: ThunderKittens, a library developed with Stanford researchers that reduces kernel code from 1,000+ lines of CUDA to roughly 100-200 lines. The abstraction layer is built around NVIDIA's tensor cores, the specialized matrix multiplication units on modern GPUs.

Within seven days of hardware access, the team had some of the fastest FP4 and FP8 GEMM kernels available for Blackwell, achieving up to 2x speedups over cuBLAS on H100s.

Real-World Impact

The voice AI case study illustrates what this means in production. The customer had a hard constraint: time-to-first-64-tokens above roughly 100ms breaks conversational flow. Their B200 deployment was hitting 281ms.

Together's team hand-optimized a "Megakernel" implementation—running an entire model in a single kernel, targeting the HBM bandwidth ceiling of NVIDIA H100s. Results on Llama-3.2-1B: 77ms. On Qwen 2.5 1.5B: 127ms, down from 292ms.

The approach traces back to FlashAttention's original insight. That Memorial Day 2022 paper proved the AI establishment wrong about attention being fully optimized. By applying database systems principles—data locality, memory hierarchies—to transformer attention, the team achieved 2-3x speedups where previous sparsity methods showed only 10% real gains.

Academic-Industry Pipeline

The team operates through an unusual model. Dan Fu runs his UCSD lab on higher-risk fundamental research. Together AI co-founder Tri Dao is at Princeton. Simran Arora is at Caltech. Ideas get de-risked in academia, then productionized at Together AI. PhD students join the company. Interns work on longer-term research in academic labs.

This produces engineers who bridge theory and production—people who, as Fu puts it, "lose sleep over memory access patterns" and "find beauty in data flow diagrams."

The work isn't glamorous. No announcements when a kernel optimization lands. Just faster training times, lower costs, higher throughput. But these margins determine whether AI-native products feel instant or sluggish, whether unit economics work or don't, whether companies scale to millions of users or plateau at thousands.

For enterprise AI deployments where every millisecond matters—and every percentage point of efficiency translates to significant cost savings—this invisible infrastructure layer may be where the real competitive advantage lies.

Image source: Shutterstock
  • together ai
  • gpu optimization
  • nvidia
  • ai infrastructure
  • machine learning
Piyasa Fırsatı
Major Logosu
Major Fiyatı(MAJOR)
$0.06036
$0.06036$0.06036
+0.54%
USD
Major (MAJOR) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen crypto.news@mexc.com ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

Siren (SIREN) Crashes 68% in 24 Hours: On-Chain Data Reveals Selling Pressure

Siren (SIREN) Crashes 68% in 24 Hours: On-Chain Data Reveals Selling Pressure

Siren (SIREN) experienced a catastrophic 68.3% price collapse in 24 hours, falling from $0.807 to $0.245. Our analysis of on-chain data and trading patterns reveals
Paylaş
Blockchainmagazine2026/04/02 05:04
This U.S. politician’s suspicious stock trade just returned over 200% in weeks

This U.S. politician’s suspicious stock trade just returned over 200% in weeks

The post This U.S. politician’s suspicious stock trade just returned over 200% in weeks appeared on BitcoinEthereumNews.com. United States Representative Cloe Fields has seen his stake in Opendoor Technologies (NASDAQ: OPEN) stock return over 200% in just a matter of weeks. According to congressional trade filings, the lawmaker purchased a stake in the online real estate company on July 21, 2025, investing between $1,001 and $15,000. At the time, the stock was trading around $2 and had been largely stagnant for months. Receive Signals on US Congress Members’ Stock Trades Stocks Stay up-to-date on the trading activity of US Congress members. The signal triggers based on updates from the House disclosure reports, notifying you of their latest stock transactions. Enable signal The trade has since paid off, with Opendoor surging to $10, a gain of nearly 220% in under two months. By comparison, the broader S&P 500 index rose less than 5% during the same period. OPEN one-week stock price chart. Source: Finbold Assuming he invested a minimum of $1,001, the purchase would now be worth about $3,200, while a $15,000 stake would have grown to nearly $48,000, generating profits of roughly $2,200 and $33,000, respectively. OPEN’s stock rally Notably, Opendoor’s rally has been fueled by major corporate shifts and market speculation. For instance, in August, the company named former Shopify COO Kaz Nejatian as CEO, while co-founders Keith Rabois and Eric Wu rejoined the board, moves seen as a return to the company’s early innovative spirit.  Outgoing CEO Carrie Wheeler’s resignation and sale of millions in stock reinforced the sense of a new chapter. Beyond leadership changes, Opendoor’s surge has taken on meme-stock characteristics. In this case, retail investors piled in as shares climbed, while short sellers scrambled to cover, pushing prices higher.  However, the stock is still not without challenges, where its iBuying model is untested at scale, margins are thin, and debt tied to…
Paylaş
BitcoinEthereumNews2025/09/18 04:02
DigiByte Price Prediction 2026, 2027 and 2030: Is DGB Ready to See a Pump?

DigiByte Price Prediction 2026, 2027 and 2030: Is DGB Ready to See a Pump?

DigiByte DGB price prediction 2026–2030: $0.004, Arizona reserve bill, DigiDollar testnet, Taproot upgrade. Can DGB pump? Full honest analyst forecast 2026.
Paylaş
Blockchainreporter2026/04/02 05:00

Trade GOLD, Share 1,000,000 USDT

Trade GOLD, Share 1,000,000 USDTTrade GOLD, Share 1,000,000 USDT

0 fees, up to 1,000x leverage, deep liquidity