Table of Contents

Your AI agent is live. The features are great. The month-end bill is terrifying.

This is the new challenge for AI engineers in 2025-2026: LLM API pricing has dropped, but agent token usage scales multiplicatively. A “simple” workflow agent can rack up hundreds of dollars per day.

TL;DR

AI agent billing spikes have three root causes: (1) using a stronger (and more expensive) model than needed, (2) no depth limit on tool call loops causing infinite iterations, (3) passing the full conversation history on every round causing token waste. The solution isn’t sacrificing quality — it’s precise matching of task complexity to model capability.

Prerequisites

  • An LLM-based agent deployed or in development
  • Using a commercial API (OpenAI, Anthropic, Google, DeepSeek, etc.)
  • Goal: reduce token usage while maintaining functional quality

Steps

Step 1: Assess Actual Task Complexity

First step is not model selection — it’s classifying your agent tasks.

Task TypeExamplesRequired Model Capability
Structured data extractionExtract amounts from receiptsLow: rule-like, fixed format
Classification / routingWhich team should this issue go toLow-medium: understanding needed, not reasoning
Complex code generationImplement an algorithmHigh: multi-step reasoning required
Long document summarizationCompress 20-page reportMedium: comprehension, not complex reasoning
Agent planningDecompose tasks, select tools, handle errorsHigh: reliable tool use required

Common mistake: sending everything to Claude Opus or GPT-5, even for “convert this JSON to a table.”

Step 2: Build a Model Tier

Configure different models for different task complexity levels:

MODELS = {
    "classification": "claude-haiku-4-5",      # cheap, fast
    "extraction": "claude-haiku-4-5",
    "summarization": "claude-sonnet-4-6",      # medium
    "code_generation": "claude-sonnet-4-6",
    "complex_planning": "claude-opus-4-6",     # reserve for truly complex tasks
}

def get_model_for_task(task_type: str) -> str:
    return MODELS.get(task_type, "claude-sonnet-4-6")

Cost reference (May 2026 pricing):

  • Claude Haiku 4.5: ~$0.08/M input tokens
  • Claude Sonnet 4.6: ~$3/M input tokens
  • Claude Opus 4.6: ~$15/M input tokens

Same task, Haiku may cost only 0.5% of Opus.

Step 3: Control Tool Call Depth

An agent without iteration limits can fall into infinite tool call loops when it encounters problems:

# Wrong: no iteration ceiling
while not task_complete:
    response = llm.call(tools=all_tools, messages=history)
    handle_tool_calls(response)

# Correct: set max iterations
MAX_ITERATIONS = 10
for i in range(MAX_ITERATIONS):
    response = llm.call(tools=all_tools, messages=history)
    if not response.tool_calls:
        break  # no tool calls = task complete
    handle_tool_calls(response)
else:
    logger.warning(f"Agent hit max iterations for task: {task_id}")

Each additional tool call loop = one more LLM call fee + the cost of re-sending the entire context.

Step 4: Trim Context Passing

Passing the entire conversation history to every LLM call is one of the biggest token waste sources.

Strategy 1: Sliding window

def trim_history(messages: list, max_turns: int = 5) -> list:
    system_messages = [m for m in messages if m["role"] == "system"]
    conversation = [m for m in messages if m["role"] != "system"]
    recent = conversation[-(max_turns * 2):]
    return system_messages + recent

Strategy 2: Compress intermediate tool results

Tool call returns are often large (e.g., full API responses), but the agent only needs key fields:

def compress_tool_result(tool_name: str, result: dict) -> str:
    if tool_name == "search_web":
        return "\n".join(
            f"- {r['title']}: {r['snippet']}"
            for r in result.get("results", [])[:3]
        )
    return str(result)

Step 5: Cache Repeated Calls

If your agent calls the same tool across different tasks (e.g., fetching user data every time), caching can significantly reduce costs:

import functools

@functools.lru_cache(maxsize=128)
def get_user_profile(user_id: str) -> dict:
    """Cache user data, avoid repeated DB queries and LLM summarization"""
    return db.get_user(user_id)

Common Questions

Q: Will dropping to cheaper models make the agent dumber, causing more errors and retries?

A: Depends on task type. For structured extraction and classification, Haiku accuracy is usually sufficient. For multi-step reasoning tasks, forcing a cheaper model can actually increase total cost through retries. Test accuracy first, then make cost decisions.

Q: How to measure cost savings after optimization?

Log token usage on every LLM call:

response = client.messages.create(...)
logger.info({
    "task_id": task_id,
    "model": model,
    "input_tokens": response.usage.input_tokens,
    "output_tokens": response.usage.output_tokens,
    "cost_usd": calculate_cost(model, response.usage)
})

Establish per-task-type token baselines and continuously track deviations.

References

Tags

Related Articles