MoE - AI Infrastructure Intelligence Search

NVIDIA Other High Signal 2026-04-15

NVIDIA Shifts AI Infrastructure Metric from FLOPS to Cost Per Token

NVIDIA advocates for "cost per token" as the primary economic metric for AI infrastructure, replacing "FLOPS per dollar." This shift moves the focus from computational inputs to business outputs, requiring full-stack optimization across hardware, software, and networking to lower enterprise AI inference TCO.

Google Other High Signal 2026-04-03

Google Launches Gemma 4 Open Models, Targeting Edge Inference and AI Agent Architecture

Google introduces the Gemma 4 open model family, with four sizes from 2B to 31B parameters, emphasizing breakthrough intelligence-per-parameter and native support for agentic workflows, multimodality, and long context. The small models are engineered for edge devices, aiming to bring frontier reasoning to mobile and IoT scenarios.

Google Other Medium Signal 2026-04-03

Google Launches Gemma 4 Open Model Family

Google introduces Gemma 4 open model family with four size variants, optimized for edge and mobile devices. The series supports multimodal processing, long context windows and 140+ languages under Apache 2.0 license.

NVIDIA Other High Signal 2026-03-12

Nvidia Launches Nemotron 3 Super for Agentic AI Inference Optimization

Nvidia releases Nemotron 3 Super, a 120B parameter model with hybrid MoE architecture combining Mamba and Transformer layers, delivering 5x throughput improvement. Designed for multi-agent workflows with 1M token context window to prevent task drift. Open weights and cloud deployment lower enterprise adoption barriers.

NVIDIA Other 2025-06-01

NVIDIA RTX Spark and Nemotron-3 Ultra: AI Control Shifts from Cloud to Personal Edge

NVIDIA launched RTX Spark personal AI supercomputer (co-developed with MediaTek) and Nemotron-3 Ultra open-source model at GTC Taipei 2026. The N1X chip delivers 1 PFLOPS local AI compute, bringing LLM inference to PCs. This marks NVIDIA's pivot from cloud GPU vendor to edge AI infrastructure monopolist, redefining the PC as an AI-native device.

Research Other 1970-01-01

Z.ai GLM-5.2 Open-Source: 744B MoE, 1M Context, MIT License as Geopolitical Shield

Z.ai releases GLM-5.2: 744B MoE with 40B activated parameters, 1M input and 131K output context, under MIT license. Released one day after Anthropic Fable 5's government takedown, it offers a downloadable, unbanable alternative with Anthropic API compatibility for zero-code migration, giving enterprises a sovereign AI option.

NVIDIA Other 1970-01-01

SGLang 0.5.13: Two-Stage MoE Routing Prefetch & Sparse KV Cache Deliver 25x Inference Speedup

SGLang 0.5.13 introduces MoE-specific two-stage routing prefetch (lightweight proxy network to preload top-k expert weights) and sparse KV cache (grouped by activation path), achieving 25x inference speedup on NVIDIA GB300 NVL72. On A100, throughput +65%, latency -40%, memory -10%, routing overhead -62%, outperforming vLLM.

Reports

Filter