Table of Contents
When your LLM application underperforms, the problem usually lives at one of three distinct layers: how the model selects tokens during generation (decoding), how you’ve structured the task into steps (workflow), or whether the model has enough reasoning capability for the problem at hand (reasoning). These three layers are routinely conflated in discussions, but they solve different problems and are optimized differently.
TL;DR
- Decoding: token-level. Controls how the model samples from its probability distribution. Greedy is stable, sampling is creative, beam search finds global optima. For reasoning tasks, temperature=0 usually wins.
- Workflow: task-level. How you decompose a problem into steps, tool calls, and agent coordination. Chain-of-thought, ReAct, multi-agent patterns live here.
- Reasoning: model capability level. Whether the model can self-correct, explore multiple paths, and think longer. Inference-time scaling, ES-CoT, Coconut all apply here.
- Optimize each layer separately. Don’t mix tools across layers.
Layer 1: Decoding Strategies
Decoding is how the model picks each next token from the vocabulary’s probability distribution. It’s the most overlooked optimization lever.
Greedy decoding: always pick the highest-probability token. Fast, deterministic, reproducible. Prone to local optima — one wrong choice can’t be corrected.
Sampling with temperature: sample from the distribution. Low temperature → approaches greedy; high temperature → more random and creative. Good for generation tasks, poor for reasoning.
Beam search: maintain multiple candidate sequences in parallel, select the globally highest-probability sequence. Better for reasoning-heavy tasks; compute cost scales linearly with beam width.
Top-k / Top-p (Nucleus Sampling): restrict sampling to the top-k tokens or the smallest set of tokens whose probability mass exceeds p. Balances quality and diversity.
A key 2025 finding: for RL-trained reasoning models, temperature=0 (greedy) significantly outperforms temperature>0. This is the opposite of best practice for creative generation tasks.
graph LR
A[Model output<br>Logits] --> B[Softmax<br>Probability distribution]
B --> C1[Greedy<br>Take max]
B --> C2[Sampling<br>Sample by probability]
B --> C3[Beam Search<br>Track multiple paths]
C1 --> D[Stable — good for reasoning]
C2 --> E[Diverse — good for creation]
C3 --> F[Global optimum — expensive]
Layer 2: Workflow Design
Workflow is how you decompose a problem at the application layer — independent of which model you use. Good workflow design can significantly improve output quality even with a weaker base model.
Chain-of-Thought (CoT) Prompting: instead of asking for the answer directly, ask the model to write out its reasoning steps. This gives the model a chance to catch and correct errors as it writes. Simple and highly effective for math and logic problems.
ReAct (Reason + Act): interleave reasoning and tool calls. The model reasons about what to do → calls a tool (search, calculation, database query) → continues reasoning from the tool’s output. Best for tasks requiring external information.
Multi-step / Multi-agent Workflows: decompose complex tasks across specialized agents, each handling a subtask, with results aggregated. Good for tasks requiring parallel processing or distinct domain knowledge.
Self-consistency: generate multiple answers to the same question (with high temperature) and take the majority vote. Improves reasoning accuracy by leveraging sampling diversity; cost multiplies by the number of samples.
Layer 3: Reasoning Capability
Reasoning is the model’s intrinsic capability to self-explore, self-correct, and transcend the ceiling of what workflow design alone can achieve.
Inference-time Scaling: give the model more “thinking time” via longer chains of thought or explicit budget tokens. OpenAI o1/o3, Gemini Thinking, and Claude’s extended thinking all use this approach. Returns are approximately log-linear — doubling compute doesn’t double accuracy, but meaningful improvements continue well past the point where larger models stop helping.
ES-CoT (Early Stopping Chain-of-Thought): stop the reasoning chain when the answer has been stable for several consecutive steps. Research shows ~41% token reduction while maintaining accuracy comparable to full CoT. Drop-in, no retraining required.
Coconut (Chain of Continuous Thought): instead of expressing reasoning steps as natural language tokens, the model uses its last hidden state (a “continuous thought”) directly as the next input embedding. Reasoning happens in continuous latent space rather than the discrete vocabulary. Theoretically enables breadth-first search across reasoning paths rather than committing to one path as in standard CoT.
Comparison
| Technique | Layer | Extra Cost | Best For |
|---|---|---|---|
| Greedy decoding | Decoding | None | Reasoning, reproducible output |
| Sampling + temperature | Decoding | None | Creative generation, diversity |
| Chain-of-Thought | Workflow | Low (prompt) | Math, logic problems |
| ReAct | Workflow | Medium (tool calls) | Tasks needing external info |
| Self-consistency | Workflow + Decoding | High (3-10x inference) | High-accuracy reasoning |
| Inference-time scaling | Reasoning | High (longer output) | Difficult reasoning, cost-insensitive |
| ES-CoT | Reasoning | Negative (saves tokens) | Cost/latency-constrained scenarios |
| Coconut | Reasoning | Requires special training | Research stage, not yet deployed widely |
Summary
The most common mistake when optimizing LLM applications is conflating layers. If your reasoning accuracy is low, you don’t necessarily need a bigger model — you might just have temperature set too high, or you’re not using CoT. If your costs are too high, you don’t need to downgrade models — ES-CoT might cut token usage by 40%.
Identify which layer the problem lives in first. Then pick the right tool.
References
- Demystifying Long Chain-of-Thought Reasoning in LLMs (arxiv 2502.03373)
- Early Stopping Chain-of-thoughts in Large Language Models (arxiv 2509.14004)
- Training LLMs to Reason in a Continuous Latent Space / Coconut (arxiv 2412.06769)
- RL of Thoughts: Navigating LLM Reasoning with Inference-time RL (arxiv)
- AI 能自我修正嗎?從 decoding、workflow 到 reasoning 的技術發展整理 (YouTube)
🇺🇸 English
Here's the script:
---
If your LLM application isn't performing the way you want, there's a good chance you're reaching for the wrong fix. And the reason that happens so often is that people treat LLM optimization as one big knob to turn, when actually there are three completely separate layers — each solving a different problem, each optimized differently.
Let's walk through them.
**The first layer is decoding.** This is the most overlooked one. When an LLM generates text, it doesn't just pick words — it's sampling from a probability distribution over its entire vocabulary at every single step. How you do that sampling matters a lot.
The simplest approach is greedy decoding: always pick the highest probability token. It's fast, deterministic, and reproducible. The downside is it can get stuck — one early wrong choice locks you into a suboptimal path.
Sampling with temperature does the opposite: instead of always taking the most likely token, you sample from the distribution. High temperature makes outputs more creative and varied. Low temperature pulls you back toward greedy. This is great for creative writing, but it's actually bad for reasoning.
Beam search is the middle ground for reasoning-heavy tasks: instead of committing to one path, you maintain several candidate sequences in parallel, then return the globally best one. It costs more compute, but it finds better answers.
There's a finding from 2025 research that's worth calling out directly: for reasoning models trained with reinforcement learning, temperature zero — meaning pure greedy — significantly outperforms any sampling. That's the opposite of what most people intuitively do when they want better outputs.
**The second layer is workflow.** This is where you, as the developer, decide how to structure the task. The model doesn't change — your architecture does.
Chain-of-thought prompting is the classic move: instead of asking the model to jump straight to an answer, you prompt it to write out its reasoning steps. This works because errors that would be invisible in a one-shot answer become visible mid-chain, giving the model a chance to catch and correct them. Simple, effective, almost free.
ReAct goes a step further: the model alternates between reasoning and action. It thinks about what it needs, calls a tool — a search engine, a database, a calculator — then continues reasoning from the result. This is the right pattern when the model needs external information it doesn't have in its weights.
Self-consistency is a heavier-weight technique: generate the same question multiple times with sampling turned up, then take the majority vote across answers. It's expensive — you might run the same query five or ten times — but for high-stakes reasoning tasks, the accuracy gains can justify it.
Multi-agent workflows are the most complex: decompose a task across several specialized agents that each handle a subtask, then aggregate their outputs. Best for tasks that genuinely benefit from parallelism or specialized domain knowledge.
**The third layer is reasoning capability itself** — this is about what the model can intrinsically do, not how you prompt it.
Inference-time scaling is the big one. Instead of training a bigger model, you give the model more thinking time: longer chains of thought, explicit budget tokens for exploration. OpenAI's o-series models, Gemini Thinking, Claude's extended thinking — they all work this way. The returns follow a roughly log-linear curve. Doubling compute doesn't double accuracy, but meaningful gains continue well past the point where scaling the model size stops helping.
Early Stopping Chain-of-Thought is a clever efficiency trick: you monitor the reasoning chain, and when the model's answer has been stable for several consecutive steps, you stop early. Research shows this cuts token usage by around 41% while keeping accuracy basically intact. No retraining required. If you're running inference at scale and costs are biting, this is worth knowing about.
And then there's Coconut — Chain of Continuous Thought — which is more experimental but conceptually fascinating. Standard chain-of-thought forces reasoning to happen in natural language tokens. Coconut instead lets the model's internal hidden state serve directly as the input to the next step. Reasoning happens in continuous latent space rather than in words you can read. The theoretical advantage is that it enables something like breadth-first search across reasoning paths, rather than committing to one linear chain. It requires special training and isn't widely deployed yet, but it points at where the research is heading.
---
So, here are the three things to take away from this.
First: diagnose the layer before you reach for a fix. Low reasoning accuracy doesn't always mean you need a bigger model — it might just mean temperature is too high, or you haven't added chain-of-thought. High inference costs don't always require a model downgrade — early stopping might cut your token usage by 40% without touching anything else.
Second: for reasoning tasks, temperature zero is your default. This surprises a lot of people, but the evidence is clear — sampling hurts reasoning models.
Third: workflow design is underrated. A well-structured chain-of-thought or ReAct loop running on a mid-tier model will often outperform an expensive model being queried naively. The architecture matters as much as the model.
These three layers — decoding, workflow, reasoning — are separate tools. Use the right one for the problem you actually have.
🇹🇼 中文
你的 LLM 應用效果不理想,問題通常不是「模型不夠強」——而是你根本搞錯了問題出在哪一層。
LLM 推論有三個完全不同的層次:Decoding、Workflow、Reasoning。這三個詞經常被混在一起講,但它們解決的問題不一樣,最佳化的方向也完全不同。今天我們一層一層拆開來看。
---
**第一層:Decoding,也就是解碼策略。**
這是模型生成每個 token 的時候,怎麼從詞彙表的機率分布中做選擇。聽起來很底層,但它其實是最常被忽視的優化點。
最簡單的叫 Greedy Decoding——每次就選機率最高的那個詞。快、穩定、可重現,但有個問題:一旦選錯一個詞,後面很難修正,容易卡在局部最優。
再來是 Sampling,也就是隨機取樣。不是選最高的,而是按機率分布來抽。這裡有個參數叫 temperature,調低就接近 greedy,調高就更隨機、更有創意。適合寫作任務,但不適合推理。
Beam Search 則是同時追蹤多條候選路徑,最後選整體機率最高的那條。對推理任務有優勢,但計算成本會隨著追蹤的路徑數線性成長。
另外還有 Top-k 和 Top-p,也叫 Nucleus Sampling,就是把取樣範圍限縮在機率最高的前幾個 token,平衡品質跟多樣性。
這裡有一個 2025 年的研究結論,值得特別記一下:**對於推理型模型,temperature 設成 0,也就是直接用 greedy decoding,效果通常顯著優於 temperature 大於 0 的情況。** 這跟創意生成任務的最佳實踐正好相反,很多人在這裡踩坑。
---
**第二層:Workflow,工作流程設計。**
這一層跟模型本身沒有直接關係,是你在應用層決定怎麼把問題拆成多個步驟。即使用一個能力普通的模型,好的 workflow 設計也能大幅提升輸出品質。
Chain-of-Thought,就是不直接要求答案,而是要求模型把推理步驟寫出來。這樣模型有機會在「寫下」錯誤之後,在後續步驟中發現並修正它。
ReAct 是另一種常見模式,全名是 Reason 加 Act——交替進行推理和工具呼叫。模型思考下一步要做什麼,然後去搜尋、查資料庫、做計算,根據結果再繼續推理。適合需要外部資訊的任務。
Multi-agent Workflow 則是把複雜任務切割給多個專門化的 agent,各自負責子任務,最後彙總結果。適合需要並行處理或跨領域知識的場景。
還有一個叫 Self-consistency:同一個問題生成多個答案,然後取多數決。準確率會提升,但代價是推論成本倍增,通常是三到十倍。高精確度需求時才考慮用。
---
**第三層:Reasoning,推理能力本身。**
這一層是模型內部的能力上限,決定它能不能在推論時自我探索、自我修正。
Inference-time Scaling,就是給模型更多「思考時間」,讓它生成更長的推理鏈或分配更多計算預算。OpenAI 的 o1、o3,還有 Gemini Thinking,都是走這個方向。研究顯示在推論時投入更多算力,對複雜推理任務的效益接近對數線性成長——也就是你多付一倍成本,效益不會翻倍,但確實持續提升。
ES-CoT,Early Stopping Chain-of-Thought,是相反方向的優化:當模型的答案在連續幾個推理步驟中保持穩定,就提前停止,不繼續推了。實驗顯示可以在準確率幾乎不變的前提下,省掉大約四成的 token 消耗。成本或延遲敏感的場景很實用。
最後一個比較前沿,叫 Coconut,Chain of Continuous Thought。一般的 CoT 是把推理步驟輸出成文字 token,但 Coconut 不這樣做——它把推理過程留在模型的連續潛在空間裡,直接用隱藏狀態作為下一步的輸入,不受詞彙表的離散限制。理論上可以做廣度優先搜尋,不像標準 CoT 每步只能走一條路。目前還在研究階段,還沒有廣泛部署。
---
**總結三個核心點:**
第一,Decoding、Workflow、Reasoning 是三個獨立的層次,問題出在哪一層,就在那一層解決,不要混用工具。
第二,推理任務準確率不夠,先檢查 decoding strategy——temperature 太高是最常見的問題,不一定要換更大的模型或更複雜的 workflow。
第三,成本太高也不要急著換小模型,ES-CoT 這類方案可能就能省掉大半 token,而且對準確率的影響非常小。
先確認問題在哪一層,再選對工具,通常比暴力堆規模有效率得多。
Tags
Related Articles
KV Cache: The Most Critical Optimization in LLM Inference
KV Cache reduces autoregressive Transformer generation from O(n²) — recomputing the full sequence for every new token — to O(n) per step, which is the core reason modern LLM inference is fast enough to be usable.
CPU vs GPU vs TPU: Picking the Wrong One Is Expensive
CPU for complex control flow, GPU for large-scale parallel computation, TPU for matrix operations pushed to the extreme. For most engineers, the real decision is cloud inference on GPU vs CPU, and when a TPU rental is worth it.
How AI Reshapes How You Think: The Cognitive Shift Beyond the Tool
AI tools change more than your speed — they change how you think. The shift from 'how to do it' to 'what to do' and 'is this right?' has real long-term implications for engineers.