Reasoning Traces

Reasoning Traces

Handling Reasoning Traces in Multi-Turn Conversations

Trinity-Large-Thinking generates explicit chain-of-thought reasoning inside <think>...</think> blocks before producing its response. When served via vLLM, these blocks are parsed into a dedicated reasoning_content field in the API response, separate from content and tool_calls.

Preserving reasoning across turns is critical for reliable multi-step tool use and agentic workflows. This guide covers how to do it correctly.

How reasoning flows through a conversation

Turn 1: User sends message

         Model generates: <think>reasoning</think> content + tool_calls

         vLLM parses into: { reasoning_content, content, tool_calls }

Turn 2: Client appends full assistant message (reasoning_content + content + tool_calls)
         Client appends tool result
         Client sends updated history

         Chat template re-wraps reasoning_content in <think>...</think> during tokenization

         Model sees prior chain-of-thought → generates next step correctly

If the client drops reasoning_content between turns, the model loses its prior chain-of-thought and may produce malformed output.

Quick reference: assistant message shape

For assistant turns that call tools, include all three fields when appending back to history:

Field
Required
Notes

content

Yes

Use "" if the model returned null. Never pass null.

reasoning_content

Strongly recommended

From the API response's reasoning_content field. reasoning is also accepted on input.

tool_calls

Conditional

Include when the model made tool calls (omit for final non-tool assistant turns).

Field name: reasoning vs reasoning_content

The naming can be confusing because different layers use different names:

Layer
Field name
Direction

Arcee API / vLLM API response

reasoning_content

Output

OpenAI Python SDK

reasoning_content

Output (attribute name)

Arcee API input (assistant messages)

reasoning_content (also accepts reasoning)

Input

Self-hosted vLLM input

reasoning (some versions don't accept reasoning_content)

Input

Chat template (Jinja)

both reasoning and reasoning_content

Input

The safe rule: use reasoning_content when constructing assistant messages for input. This matches what the API returns, so you can pass it straight back. The backend also accepts reasoning on input and converts it automatically.

Note for self-hosted vLLM users: Some vLLM versions only read reasoning on input messages (vllm#38488). If you're hitting vLLM directly (not through Arcee's API), you may need to map reasoning_contentreasoning on assistant messages.

Python implementation

Installation

Complete agentic loop

TypeScript implementation

Installation

Complete agentic loop

OpenRouter integration

When using Trinity through OpenRouter, reasoning is returned in a reasoning_details object (OpenRouter's unified reasoning shape). For multi-turn conversations, pass reasoning_details back as-is on assistant messages — OpenRouter handles the model-specific upstream translation automatically.

Debugging upstream requests

To verify that reasoning is being sent upstream correctly, enable echo mode:

See OpenRouter debugging docs for details.

Common pitfalls

1. xml_in_reasoning — tool call XML inside reasoning field

Symptom: The response has tool_calls: [] (empty) and reasoning_content contains raw XML like <function=get_details_by_id><parameter=id>L1001</parameter>...

Cause: The previous assistant turn likely lost reasoning context (and/or had invalid message shape), so the model generated the tool call inside its thinking block instead of as structured output.

Fix: Ensure every assistant message in the conversation history includes reasoning_content from the prior API response.

2. reasoning_content ignored on self-hosted vLLM

Symptom: You're passing reasoning_content on assistant messages to a self-hosted vLLM instance, but the model behaves as if reasoning is missing.

Cause: Some vLLM versions only read reasoning (not reasoning_content) from input messages (vllm#38488). This does not affect Arcee's hosted API, which accepts both.

Fix: If you're hitting vLLM directly, map reasoning_contentreasoning on input:

3. content: null on assistant tool-call turns

Symptom: Degraded tool call quality or malformed output on subsequent turns.

Cause: When the model makes a tool call, content may be null in the API response. Passing null back can contribute to malformed follow-up behavior in some integrations.

Fix: Normalize null to empty string:

4. Missing vLLM serving flags

Symptom: Reasoning leaks into content with visible <think>...</think> tags, or tool calls appear as raw XML instead of structured tool_calls.

Fix: Ensure vLLM is started with the correct flags:

Flag
Purpose

--reasoning-parser deepseek_r1

Separates <think> blocks into the reasoning_content field

--enable-auto-tool-choice

Enables tool call parsing

--tool-call-parser qwen3_coder

Parses tool calls into structured tool_calls array

Note: vLLM versions before 0.18 may also require --enable-reasoning. If the flag is not recognized, --reasoning-parser alone is sufficient.

vLLM serving reference

Minimal serving command

Production serving command (example only)

Context length guidance

Set --max-model-len based on your real conversation lengths, tool-chain depth, and GPU memory budget. Higher values pre-allocate more KV cache at startup. A practical approach is to start conservatively, monitor truncation/finish_reason: "length", then scale up incrementally.

Last updated