“Agent” became the word of the year in AI, and like most words that get that popular, it stopped meaning much. A polished launch video shows software booking a trip, filing an expense report, and refactoring a codebase while you sip coffee. The reality on most teams is quieter, and more useful, than that.
What actually works right now
The agents earning their keep today are narrow. They do one job inside one system with a clear definition of done: triage a support inbox, reconcile a spreadsheet against a database, draft a first pass at a report from a template. When the task is bounded and the tools are reliable, an agent that can plan a few steps and check its own work is a real productivity gain.
Where they still fall over
Open-ended tasks are where the demos and the reality split. The longer the chain of steps, the more chances there are for a small mistake to compound into a confidently wrong result. Agents still struggle to notice when they’re stuck, to ask for help at the right moment, and to recover gracefully from a tool that returns something unexpected.
How to read a demo
Two questions cut through most of the hype. First, was the demo run once and recorded, or does it work on the tenth try with messy real input? Second, what happens when it fails — does it stop and flag the problem, or does it quietly hand you something broken? An honest answer to those two questions tells you more than any benchmark.