# Reasoning Traces

## Handling Reasoning Traces in Multi-Turn Conversations

Trinity-Large-Thinking generates explicit chain-of-thought reasoning inside `<think>...</think>` blocks before producing its response. When served via vLLM, these blocks are parsed into a dedicated reasoning field in the API response (typically `reasoning` in raw API responses, often exposed as `reasoning_content` in SDK objects), separate from `content` and `tool_calls`.

Preserving reasoning across turns is **critical** for reliable multi-step tool use and agentic workflows. This guide covers how to do it correctly.

***

### How reasoning flows through a conversation

```
Turn 1: User sends message
         ↓
         Model generates: <think>reasoning</think> content + tool_calls
         ↓
         vLLM parses into: { reasoning, content, tool_calls }
         ↓
Turn 2: Client appends full assistant message (reasoning + content + tool_calls)
         Client appends tool result
         Client sends updated history
         ↓
         Chat template re-wraps reasoning in <think>...</think> during tokenization
         ↓
         Model sees prior chain-of-thought → generates next step correctly
```

If the client drops the `reasoning` field between turns, the model loses its prior chain-of-thought and may produce malformed output.

***

### Quick reference: assistant message shape

For assistant turns that call tools, include all three fields when appending back to history:

```json
{
  "role": "assistant",
  "content": "",
  "reasoning": "The user wants to cancel. I need their customer_id first, so I'll look them up by email.",
  "tool_calls": [
    {
      "id": "call-1",
      "type": "function",
      "function": {
        "name": "get_customer_by_email",
        "arguments": "{\"email\": \"jane@example.com\"}"
      }
    }
  ]
}
```

| Field        | Required             | Notes                                                                             |
| ------------ | -------------------- | --------------------------------------------------------------------------------- |
| `content`    | Yes                  | Use `""` if the model returned `null`. Never pass `null`.                         |
| `reasoning`  | Strongly recommended | From the API response's `reasoning` (or SDK `reasoning_content`) field.           |
| `tool_calls` | Conditional          | Include when the model made tool calls (omit for final non-tool assistant turns). |

***

### Field name: `reasoning` vs `reasoning_content`

The naming can be confusing because different layers use different names:

| Layer                               | Field name                                           | Direction               |
| ----------------------------------- | ---------------------------------------------------- | ----------------------- |
| vLLM API response                   | `reasoning`                                          | Output                  |
| OpenAI Python SDK                   | `reasoning_content`                                  | Output (attribute name) |
| vLLM API input (assistant messages) | `reasoning`                                          | Input                   |
| Chat template (Jinja)               | `reasoning` and `reasoning_content` (template-level) | Input                   |

**The safe rule**: always use `reasoning` when constructing assistant messages for input. The chat template accepts both, but some vLLM versions only pass `reasoning` through to the template ([vllm#38488](https://github.com/vllm-project/vllm/issues/38488)).

***

### Python implementation

#### Installation

```bash
pip install openai
```

#### Complete agentic loop

```python
import json
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

MODEL = "arcee-ai/Trinity-Large-Thinking"

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_customer_by_email",
            "description": "Look up a customer by email address.",
            "parameters": {
                "type": "object",
                "properties": {
                    "email": {"type": "string", "description": "Customer email address"}
                },
                "required": ["email"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "cancel_subscription",
            "description": "Cancel a customer's subscription. Requires customer_id.",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string"},
                    "reason": {"type": "string"}
                },
                "required": ["customer_id"]
            }
        }
    }
]


def execute_tool(name: str, arguments: str) -> str:
    """Replace with your actual tool implementations."""
    args = json.loads(arguments)
    if name == "get_customer_by_email":
        return json.dumps({"customer_id": "C2001", "name": "Jane Doe", "plan": "Premium"})
    elif name == "cancel_subscription":
        return json.dumps({"success": True, "message": f"Cancelled for {args['customer_id']}"})
    return json.dumps({"error": "Unknown tool"})


def build_assistant_message(msg) -> dict:
    """
    Build an assistant message dict that preserves reasoning for the next turn.

    Key details:
    - Map reasoning_content -> reasoning (vLLM input field name)
    - Use "" instead of None for content (avoids tokenization issues)
    - Preserve the full tool_calls array
    """
    assistant_msg = {
        "role": "assistant",
        "content": msg.content or "",  # never null
    }

    # Preserve reasoning — critical for multi-turn
    reasoning = getattr(msg, "reasoning_content", None) or getattr(msg, "reasoning", None)
    if reasoning:
        assistant_msg["reasoning"] = reasoning

    # Preserve tool calls
    if msg.tool_calls:
        assistant_msg["tool_calls"] = [
            {
                "id": tc.id,
                "type": "function",
                "function": {
                    "name": tc.function.name,
                    "arguments": tc.function.arguments
                }
            }
            for tc in msg.tool_calls
        ]

    return assistant_msg


def run_agent(user_message: str, system_prompt: str = "You are a helpful customer service agent."):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]

    max_steps = 10  # safety limit

    for step in range(max_steps):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools,
            tool_choice="auto",
            temperature=0,
            max_tokens=1000
        )

        msg = response.choices[0].message

        # Append full assistant message with reasoning preserved
        messages.append(build_assistant_message(msg))

        # No tool calls → final response
        if not msg.tool_calls:
            return msg.content or ""

        # Execute each tool call and append results
        for tc in msg.tool_calls:
            result = execute_tool(tc.function.name, tc.function.arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": result
            })

    raise RuntimeError("Agent exceeded maximum steps")


# Usage
response = run_agent("I want to cancel my subscription. My email is jane@example.com")
print(response)
```

***

### TypeScript implementation

#### Installation

```bash
npm install openai
```

#### Complete agentic loop

```typescript
import OpenAI from "openai";
import type {
  ChatCompletionMessageParam,
  ChatCompletionTool,
} from "openai/resources/chat/completions";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: "your-api-key",
});

const MODEL = "arcee-ai/Trinity-Large-Thinking";

const tools: ChatCompletionTool[] = [
  {
    type: "function",
    function: {
      name: "get_customer_by_email",
      description: "Look up a customer by email address.",
      parameters: {
        type: "object",
        properties: {
          email: { type: "string", description: "Customer email address" },
        },
        required: ["email"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "cancel_subscription",
      description: "Cancel a customer's subscription. Requires customer_id.",
      parameters: {
        type: "object",
        properties: {
          customer_id: { type: "string" },
          reason: { type: "string" },
        },
        required: ["customer_id"],
      },
    },
  },
];

async function executeTool(name: string, args: string): Promise<string> {
  // Replace with your actual tool implementations
  const parsed = JSON.parse(args);
  switch (name) {
    case "get_customer_by_email":
      return JSON.stringify({ customer_id: "C2001", name: "Jane Doe", plan: "Premium" });
    case "cancel_subscription":
      return JSON.stringify({ success: true, message: `Cancelled for ${parsed.customer_id}` });
    default:
      return JSON.stringify({ error: "Unknown tool" });
  }
}

/**
 * Build an assistant message that preserves reasoning for the next turn.
 *
 * Key details:
 * - Extract reasoning from the response and pass it as "reasoning" (vLLM input field)
 * - Use "" instead of null for content
 * - Preserve the full tool_calls array
 */
function buildAssistantMessage(
  msg: OpenAI.Chat.Completions.ChatCompletionMessage
): ChatCompletionMessageParam {
  const assistantMsg: Record<string, unknown> = {
    role: "assistant",
    content: msg.content ?? "", // never null
  };

  // Preserve reasoning — critical for multi-turn
  // The SDK exposes it as reasoning_content; vLLM expects "reasoning" on input
  const reasoning = (msg as any).reasoning_content ?? (msg as any).reasoning;
  if (reasoning) {
    assistantMsg.reasoning = reasoning;
  }

  // Preserve tool calls
  if (msg.tool_calls?.length) {
    assistantMsg.tool_calls = msg.tool_calls.map((tc) => ({
      id: tc.id,
      type: "function" as const,
      function: {
        name: tc.function.name,
        arguments: tc.function.arguments,
      },
    }));
  }

  return assistantMsg as ChatCompletionMessageParam;
}

async function runAgent(
  userMessage: string,
  systemPrompt = "You are a helpful customer service agent."
): Promise<string> {
  const messages: ChatCompletionMessageParam[] = [
    { role: "system", content: systemPrompt },
    { role: "user", content: userMessage },
  ];

  const maxSteps = 10; // safety limit

  for (let step = 0; step < maxSteps; step++) {
    const response = await client.chat.completions.create({
      model: MODEL,
      messages,
      tools,
      tool_choice: "auto",
      temperature: 0,
      max_tokens: 1000,
    });

    const msg = response.choices[0].message;

    // Append full assistant message with reasoning preserved
    messages.push(buildAssistantMessage(msg));

    // No tool calls → final response
    if (!msg.tool_calls?.length) {
      return msg.content ?? "";
    }

    // Execute each tool call and append results
    for (const tc of msg.tool_calls) {
      const result = await executeTool(tc.function.name, tc.function.arguments);
      messages.push({
        role: "tool",
        tool_call_id: tc.id,
        content: result,
      });
    }
  }

  throw new Error("Agent exceeded maximum steps");
}

// Usage
const response = await runAgent(
  "I want to cancel my subscription. My email is jane@example.com"
);
console.log(response);
```

***

### OpenRouter integration

When using Trinity through [OpenRouter](https://openrouter.ai/), reasoning is returned in a `reasoning_details` object (OpenRouter's unified reasoning shape). For multi-turn conversations, pass `reasoning_details` back as-is on assistant messages — OpenRouter handles the model-specific upstream translation automatically.

#### Debugging upstream requests

To verify that reasoning is being sent upstream correctly, enable echo mode:

```typescript
const response = await client.chat.completions.create({
  model: "arcee-ai/trinity-large-thinking",
  messages,
  tools,
  // OpenRouter extension field
  // @ts-ignore
  debug: { echo_upstream_body: true },
});
```

See [OpenRouter debugging docs](https://openrouter.ai/docs/api/reference/errors-and-debugging#debugging) for details.

***

### Common pitfalls

#### 1. `xml_in_reasoning` — tool call XML inside reasoning field

**Symptom**: The response has `tool_calls: []` (empty) and `reasoning` contains raw XML like `<function=get_details_by_id><parameter=id>L1001</parameter>...`

**Cause**: The previous assistant turn likely lost reasoning context (and/or had invalid message shape), so the model generated the tool call inside its thinking block instead of as structured output.

**Fix**: Ensure every assistant message in the conversation history includes the `reasoning` field from the prior API response.

```
❌ Broken — no reasoning, null content
{"role": "assistant", "content": null, "tool_calls": [...]}

✅ Fixed — reasoning preserved, content non-null
{"role": "assistant", "content": "", "reasoning": "...", "tool_calls": [...]}
```

#### 2. `reasoning_content` silently ignored on input

**Symptom**: You're passing `reasoning_content` on assistant messages but the model still behaves as if reasoning is missing.

**Cause**: Some vLLM versions only read `reasoning` (not `reasoning_content`) from input messages ([vllm#38488](https://github.com/vllm-project/vllm/issues/38488)).

**Fix**: Always use `reasoning` as the field name on input assistant messages. The OpenAI SDK returns `reasoning_content` on the response — map it to `reasoning` when appending to history.

```python
# The SDK returns reasoning_content on the response object
reasoning = msg.reasoning_content

# But vLLM expects "reasoning" on input
assistant_msg["reasoning"] = reasoning
```

#### 3. `content: null` on assistant tool-call turns

**Symptom**: Degraded tool call quality or malformed output on subsequent turns.

**Cause**: When the model makes a tool call, `content` may be `null` in the API response. Passing `null` back can contribute to malformed follow-up behavior in some integrations.

**Fix**: Normalize `null` to empty string:

```python
"content": msg.content or ""   # Python
```

```typescript
content: msg.content ?? ""     // TypeScript
```

#### 4. Missing vLLM serving flags

**Symptom**: Reasoning leaks into `content` with visible `<think>...</think>` tags, or tool calls appear as raw XML instead of structured `tool_calls`.

**Fix**: Ensure vLLM is started with the correct flags:

```bash
vllm serve arcee-ai/Trinity-Large-Thinking \
  --dtype bfloat16 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
```

| Flag                             | Purpose                                               |
| -------------------------------- | ----------------------------------------------------- |
| `--reasoning-parser deepseek_r1` | Separates `<think>` blocks into the `reasoning` field |
| `--enable-auto-tool-choice`      | Enables tool call parsing                             |
| `--tool-call-parser qwen3_coder` | Parses tool calls into structured `tool_calls` array  |

> **Note**: vLLM versions before 0.18 may also require `--enable-reasoning`. If the flag is not recognized, `--reasoning-parser` alone is sufficient.

***

### vLLM serving reference

#### Minimal serving command

```bash
vllm serve arcee-ai/Trinity-Large-Thinking \
  --dtype bfloat16 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
```

#### Production serving command (example only)

```bash
vllm serve arcee-ai/Trinity-Large-Thinking \
  --served-model-name arcee-ai/Trinity-Large-Thinking \
  --dtype bfloat16 \
  --tensor-parallel-size <set-for-your-hardware> \
  --max-model-len <set-for-your-workload> \
  --gpu-memory-utilization <set-for-your-cluster-policy> \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "$API_KEY" \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
```

#### Context length guidance

Set `--max-model-len` based on your real conversation lengths, tool-chain depth, and GPU memory budget. Higher values pre-allocate more KV cache at startup. A practical approach is to start conservatively, monitor truncation/`finish_reason: "length"`, then scale up incrementally.
