As agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a way to bake 3x throughput gains directly into a model's weights.Unlike speculative decoding, which requires a separate drafting model, this approach requires no additional infrastructure — just a single special token added to the model's existing architecture.The limits of next-token predictionNext-token prediction — generating text one token per forward pass — creates a throughput ceiling that becomes painfully expensive when models need to produce thousands of tokens. This bottleneck is especially problematic in reasoning models, which frequently generate thousands of “cha [...]
Baseten, the AI infrastructure company recently valued at $2.15 billion, is making its most significant product pivot yet: a full-scale push into model training that could reshape how enterprises wean [...]
For more than three decades, modern CPUs have relied on speculative execution to keep pipelines full. When it emerged in the 1990s, speculation was hailed as a breakthrough — just as pipelining and [...]
Enterprises expanding AI deployments are hitting an invisible performance wall. The culprit? Static speculators that can't keep up with shifting workloads.Speculators are smaller AI models that w [...]
Lowering the cost of inference is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four leading inference providers are reporting 4x to 10x redu [...]
Every GPU cluster has dead time. Training jobs finish, workloads shift and hardware sits dark while power and cooling costs keep running. For neocloud operators, those empty cycles are lost margin.The [...]
Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique [...]
The standard guidelines for building large language models (LLMs) optimize only for training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-tim [...]
A new study from researchers at Stanford University and Nvidia proposes a way for AI models to keep learning after deployment — without increasing inference costs. For enterprise agents that have to [...]
Market researchers have embraced artificial intelligence at a staggering pace, with 98% of professionals now incorporating AI tools into their work and 72% using them daily or more frequently, accordi [...]