AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
This post describes work by Dzung Pham, Kleomenis Katevas, Ali Shahin Shamsabadi, and Hamed Haddadi, from Brave.
Summary
Running LLM-based agents locally protects privacy—your prompts and reasoning traces never leave your device and stay out of cloud logs. But there’s a hidden cost that rarely gets talked about: local AI agents, regardless of the size of their LLMs, drain battery fast by sometimes wasting enormous amounts of LLM inference, reasoning, and tool calls that never lead anywhere. This matters because it creates frustration and in many cases even anxiety to the users. To address this, we designed AgentStop, a lightweight efficiency supervisor that monitors the agent’s LLM backend in real time, predicts when the agent is heading toward wasted computation, and terminates unpromising inference chains before they drain your battery.
AgentStop is accepted at 1st ACM Conference on AI and Agentic Systems (ACM CAIS 2026), comes with a fully open-source implementation, and has been awarded with three reproducibility badges by the Artifact Evaluation Committee: Artifact Available, Artifact Functional, and Results Reproduced. AgentStop will be presented at the conference between May 27th and 29th in San Jose, California.
Local AI Agents are Necessary for Privacy
Imagine asking an AI agent to fix a bug in your codebase. If the agent is running in the cloud, your entire codebase leaves your machine and is sent to a third-party server. Local AI agents reduce this privacy risk by keeping inference and data processing on your own device, so sensitive files no longer need to be uploaded to an external infrastructure. Moreover, they eliminate API costs and reduce dependence on an internet connection.
Recent advances in model efficiency, including 4-bit quantization and Mixture-of-Experts (MoE) architectures 1, have made local deployment far more practical, enabling 30-billion-parameter models to run on consumer hardware such as a 24GB laptop. Truly capable on-device agents are no longer a distant dream. However, running those agents comes at a cost that is easy to overlook until you’re staring at a 20% battery warning.
But Local AI Agents Are Expensive to Run
especially on mobile devices where battery anxiety is already a real concern
LLM-based agents differ fundamentally from simple LLM chats in how they consume resources. Agents operate through multi-step cycles in which each step requires a new inference: reasoning, tool calling, taking actions, observing outcomes, and reasoning again. This iterative process means agentic workloads consume vastly more resources. In addition to this, a significant portion of that compute is spent on steps that were never likely to succeed.
This cycle of inference, tool use, and retry loops means that agentic workloads consume vastly more compute than a simple LLM chat interaction. And a significant portion of that compute is spent on runs that were never going to succeed.
While testing these interactions on a MacBook Pro M1 Max (as shown in Figure 1), a single coding task could:
- Last over 10 minutes
- Trigger 30+ separate LLM inference calls
- Push GPU power draw past 40 watts
- Sustain GPU temperatures above 90°C for extended periods
That’s not a quick query, that’s your laptop working as hard as possible, for ten minutes straight, on a task that may ultimately fail.
For context, a single failed coding task can drain roughly 3% of a laptop’s entire battery. That might sound small, but consider what it means in practice. If you’re running an agent to help debug a complex piece of software, it might fail five or ten times before finding a working solution, or give up entirely. Those failed attempts alone could cost 15-30% of your battery, before you’ve even seen a useful result.
Web-based question answering is lighter, but the same principle applies: fail ten web search tasks in a row, and you’ve burned through another 3-7% of battery for nothing.
That adds up fast, especially on mobile devices where battery anxiety is already a well-documented, real psychological concern. Research has shown that low battery levels are a meaningful source of stress for many users, a phenomenon sometimes called nomophobia 2, and that this anxiety directly affects how people use and trust their devices 3.
AgentStop: Brave’s First Step Towards Building An Efficiency Supervisor for Agents
AgentStop is a lightweight efficiency supervisor that watches an agent as it works and predicts, early on, whether it’s likely to succeed. If the outlook looks bleak, it pulls the plug before the agent wastes any more energy.
The key insight is that agents often signal their own failure without realizing it. By monitoring subtle patterns in the agent’s output, AgentStop can spot a doomed run within the first few steps.
The features it watches include:
- Token log-probabilities: a measure of how “confident” the model is in each word it generates. Lower confidence often correlates with a struggling agent.
- Token counts per step: unusually long chains of reasoning can indicate the agent is going in circles.
- Token overlap between steps: if the agent keeps repeating itself, it’s probably stuck in a loop.
These signals are already produced during normal inference, so collecting them adds virtually zero extra energy cost.
AgentStop trains a gradient-boosted decision tree (using XGBoost) on a labelled dataset of successful and failed agent runs. The model is deliberately lightweight as each inference costs less than 0.01 mWh, so the supervisor doesn’t undo its own savings.
When deployed, a single classifier runs once after each agent step and returns a simple verdict: keep going or stop now.
AgentStop reduces energy consumption at minimal utility cost
We evaluated AgentStop across two representative tasks:
Web-Based Question Answering
We tested on two datasets: FRAMES (824 multi-hop reasoning questions) and SimpleQA (4,326 factual questions), with agents powered by Qwen3-30B-A3B and given access to web search via the Brave Search API.
| Dataset | Exit Step Number | Energy Wastage Reduction | Task Utility Drop |
|---|---|---|---|
| FRAMES | 5 | ~22% | <2% |
| SimpleQA | 4 | ~23% | <2% |
On both datasets, AgentStop outperforms simpler baselines (random stopping, min log-prob thresholding, mean log-prob thresholding), particularly in the critical early steps where intervention is most valuable.
Coding
Coding is a much harder problem. Our agent, powered by Qwen3-Coder-30B-A3B, achieves an 18.8% success rate on the 500-task SWE-Bench Verified benchmark, competitive with GPT-4o’s 21.2% in the same setting. Failed runs are expensive: a single failed coding attempt can cost nearly 3,000 mWh, roughly 3% of a 100Wh laptop battery.
| Dataset | Exit Step Number | Energy Wastage Reduction | Task Utility Drop |
|---|---|---|---|
| SWE-Bench Verified | 5 | ~19% | ~3% |
Notably, around 60% of total energy consumption occurs within the first 10 agent steps, meaning early intervention has an outsized impact. AgentStop achieves 0.6-0.7 AUC in classifying success vs. failure within those first 10 steps, which is where it matters most.
Across both tasks, the results are consistent: AgentStop recovers 15-20% of wasted energy with less than a 5% drop in task completion, using a supervisor that costs almost nothing to run.
As local AI agents become more capable and autonomous, efficiency will matter just as much as intelligence. Local AI agents already protect your privacy, eliminate API costs, and reduce dependence on an internet connection; now, AgentStop is an early step toward making on-device agents not only private and useful, but also energy-aware.
The code and datasets for AgentStop are available at https://github.com/brave-experiments/AgentStop.

