

GPT-5.4, Gemini 3 Flash, and the New Reality Check for “Research Agents” (Mar 2026)
GPT-5.4’s launch pairs frontier capability claims with explicit risk classification, a dedicated system card, and a firm retirement date for GPT-5.2 Thinking. Meanwhile, Gemini 3 Flash intensifies benchmark-driven competition in agentic coding, and new research benchmarks like PaperBench and MLR-Bench quantify how often “research agents” fail—often by fabricating results. With EU AI Act timelines and California SB 53 now in force, documentation and reproducibility are becomin
2 hours ago4 min read


Agentic “Computer Use” Becomes the New Battleground: GPT‑5.4, OSWorld‑Verified, and Fast Model Turnover
Recent releases and documentation show a clear pivot from chat-centric models to agentic “computer use.” GPT‑5.4’s OSWorld‑Verified 75.0% and Anthropic’s 72.5% make benchmarks a public scoreboard—but not a substitute for internal evaluation. Meanwhile, OpenAI’s ChatGPT retirements and realtime API deprecations highlight faster model turnover, raising the operational bar for teams shipping tool-using agents.
2 hours ago4 min read


Mar 2025–Mar 2026: Frontier AI Updates That Actually Improved Reproducibility (and What Teams Should Do Next)
From Feb 2026 system cards (OpenAI GPT-5.3‑Codex; Anthropic Claude Opus 4.6) to Google DeepMind’s Gemini 3 benchmark tables and Gemini 3.1 Pro model card, the last year’s frontier-model story is unusually documentation-heavy. In parallel, the EU AI Act timeline and new governance artifacts (EU GPAI Code of Practice; NIST’s draft Cyber AI Profile) are turning “responsible AI” into evidence-driven engineering. This post focuses on what changed, why it matters now, and how teams
3 hours ago4 min read


The Evaluation Gap Gets Named: What Recent Benchmarks, Leaderboards, and NIST Guidance Change for AI Teams (2025–2026)
In the past year, the most important shift isn’t a single new model—it’s a clearer admission that our evaluations don’t match reality. The International AI Safety Report 2026 names an “evaluation gap,” while new efforts like ACE and TabArena push toward grounded, reproducible, continuously maintained measurement. Meanwhile, LMArena’s changelog underscores that leaderboard movement needs provenance, and NIST’s draft Cyber AI Profile signals that AI security expectations are ab
4 hours ago4 min read


