# Reasoning Traces

## Reasoning Traces

### Handling Reasoning Traces in Multi-Turn Conversations

Trinity-Large-Thinking generates explicit chain-of-thought reasoning inside `<think>...</think>` blocks before producing its response. When served via vLLM, these blocks are parsed into a dedicated `reasoning_content` field in the API response, separate from `content` and `tool_calls`.

Preserving reasoning across turns is critical for reliable multi-step tool use and agentic workflows. This guide covers how to do it correctly.

### How reasoning flows through a conversation

```
Turn 1: User sends message
         ↓
         Model generates: <think>reasoning</think> content + tool_calls
         ↓
         vLLM parses into: { reasoning_content, content, tool_calls }
         ↓
Turn 2: Client appends full assistant message (reasoning_content + content + tool_calls)
         Client appends tool result
         Client sends updated history
         ↓
         Chat template re-wraps reasoning_content in <think>...</think> during tokenization
         ↓
         Model sees prior chain-of-thought → generates next step correctly
```

If the client drops `reasoning_content` between turns, the model loses its prior chain-of-thought and may produce malformed output.

### Quick reference: assistant message shape

For assistant turns that call tools, include all three fields when appending back to history:

```json
{
  "role": "assistant",
  "content": "",
  "reasoning_content": "The user wants to cancel. I need their customer_id first, so I'll look them up by email.",
  "tool_calls": [
    {
      "id": "call-1",
      "type": "function",
      "function": {
        "name": "get_customer_by_email",
        "arguments": "{\"email\": \"jane@example.com\"}"
      }
    }
  ]
}
```

| Field               | Required             | Notes                                                                                     |
| ------------------- | -------------------- | ----------------------------------------------------------------------------------------- |
| `content`           | Yes                  | Use `""` if the model returned null. Never pass `null`.                                   |
| `reasoning_content` | Strongly recommended | From the API response's `reasoning_content` field. `reasoning` is also accepted on input. |
| `tool_calls`        | Conditional          | Include when the model made tool calls (omit for final non-tool assistant turns).         |

### Field name: `reasoning` vs `reasoning_content`

The naming can be confusing because different layers use different names:

| Layer                                | Field name                                                   | Direction               |
| ------------------------------------ | ------------------------------------------------------------ | ----------------------- |
| Arcee API / vLLM API response        | `reasoning_content`                                          | Output                  |
| OpenAI Python SDK                    | `reasoning_content`                                          | Output (attribute name) |
| Arcee API input (assistant messages) | `reasoning_content` (also accepts `reasoning`)               | Input                   |
| Self-hosted vLLM input               | `reasoning` (some versions don't accept `reasoning_content`) | Input                   |
| Chat template (Jinja)                | both `reasoning` and `reasoning_content`                     | Input                   |

The safe rule: use `reasoning_content` when constructing assistant messages for input. This matches what the API returns, so you can pass it straight back. The backend also accepts `reasoning` on input and converts it automatically.

> **Note for self-hosted vLLM users:** Some vLLM versions only read `reasoning` on input messages ([vllm#38488](https://github.com/vllm-project/vllm/issues/38488)). If you're hitting vLLM directly (not through Arcee's API), you may need to map `reasoning_content` → `reasoning` on assistant messages.

### Python implementation

#### Installation

```bash
pip install openai
```

#### Complete agentic loop

```python
import json
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

MODEL = "arcee-ai/Trinity-Large-Thinking"

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_customer_by_email",
            "description": "Look up a customer by email address.",
            "parameters": {
                "type": "object",
                "properties": {
                    "email": {"type": "string", "description": "Customer email address"}
                },
                "required": ["email"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "cancel_subscription",
            "description": "Cancel a customer's subscription. Requires customer_id.",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string"},
                    "reason": {"type": "string"}
                },
                "required": ["customer_id"]
            }
        }
    }
]


def execute_tool(name: str, arguments: str) -> str:
    """Replace with your actual tool implementations."""
    args = json.loads(arguments)
    if name == "get_customer_by_email":
        return json.dumps({"customer_id": "C2001", "name": "Jane Doe", "plan": "Premium"})
    elif name == "cancel_subscription":
        return json.dumps({"success": True, "message": f"Cancelled for {args['customer_id']}"})
    return json.dumps({"error": "Unknown tool"})


def build_assistant_message(msg) -> dict:
    """
    Build an assistant message dict that preserves reasoning for the next turn.

    Key details:
    - Pass reasoning_content straight back (matches what the API returns)
    - Use "" instead of None for content (avoids tokenization issues)
    - Preserve the full tool_calls array
    """
    assistant_msg = {
        "role": "assistant",
        "content": msg.content or "",  # never null
    }

    # Preserve reasoning — critical for multi-turn
    reasoning = getattr(msg, "reasoning_content", None) or getattr(msg, "reasoning", None)
    if reasoning:
        assistant_msg["reasoning_content"] = reasoning

    # Preserve tool calls
    if msg.tool_calls:
        assistant_msg["tool_calls"] = [
            {
                "id": tc.id,
                "type": "function",
                "function": {
                    "name": tc.function.name,
                    "arguments": tc.function.arguments
                }
            }
            for tc in msg.tool_calls
        ]

    return assistant_msg


def run_agent(user_message: str, system_prompt: str = "You are a helpful customer service agent."):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]

    max_steps = 10  # safety limit

    for step in range(max_steps):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools,
            tool_choice="auto",
            temperature=0,
            max_tokens=1000
        )

        msg = response.choices[0].message

        # Append full assistant message with reasoning preserved
        messages.append(build_assistant_message(msg))

        # No tool calls → final response
        if not msg.tool_calls:
            return msg.content or ""

        # Execute each tool call and append results
        for tc in msg.tool_calls:
            result = execute_tool(tc.function.name, tc.function.arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": result
            })

    raise RuntimeError("Agent exceeded maximum steps")


# Usage
response = run_agent("I want to cancel my subscription. My email is jane@example.com")
print(response)
```

### TypeScript implementation

#### Installation

```bash
npm install openai
```

#### Complete agentic loop

```typescript
import OpenAI from "openai";
import type {
  ChatCompletionMessageParam,
  ChatCompletionTool,
} from "openai/resources/chat/completions";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: "your-api-key",
});

const MODEL = "arcee-ai/Trinity-Large-Thinking";

const tools: ChatCompletionTool[] = [
  {
    type: "function",
    function: {
      name: "get_customer_by_email",
      description: "Look up a customer by email address.",
      parameters: {
        type: "object",
        properties: {
          email: { type: "string", description: "Customer email address" },
        },
        required: ["email"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "cancel_subscription",
      description: "Cancel a customer's subscription. Requires customer_id.",
      parameters: {
        type: "object",
        properties: {
          customer_id: { type: "string" },
          reason: { type: "string" },
        },
        required: ["customer_id"],
      },
    },
  },
];

async function executeTool(name: string, args: string): Promise<string> {
  // Replace with your actual tool implementations
  const parsed = JSON.parse(args);
  switch (name) {
    case "get_customer_by_email":
      return JSON.stringify({ customer_id: "C2001", name: "Jane Doe", plan: "Premium" });
    case "cancel_subscription":
      return JSON.stringify({ success: true, message: `Cancelled for ${parsed.customer_id}` });
    default:
      return JSON.stringify({ error: "Unknown tool" });
  }
}

/**
 * Build an assistant message that preserves reasoning for the next turn.
 *
 * Key details:
 * - Pass reasoning_content straight back (matches what the API returns)
 * - Use "" instead of null for content
 * - Preserve the full tool_calls array
 */
function buildAssistantMessage(
  msg: OpenAI.Chat.Completions.ChatCompletionMessage
): ChatCompletionMessageParam {
  const assistantMsg: Record<string, unknown> = {
    role: "assistant",
    content: msg.content ?? "", // never null
  };

  // Preserve reasoning — critical for multi-turn
  const reasoning = (msg as any).reasoning_content ?? (msg as any).reasoning;
  if (reasoning) {
    assistantMsg.reasoning_content = reasoning;
  }

  // Preserve tool calls
  if (msg.tool_calls?.length) {
    assistantMsg.tool_calls = msg.tool_calls.map((tc) => ({
      id: tc.id,
      type: "function" as const,
      function: {
        name: tc.function.name,
        arguments: tc.function.arguments,
      },
    }));
  }

  return assistantMsg as ChatCompletionMessageParam;
}

async function runAgent(
  userMessage: string,
  systemPrompt = "You are a helpful customer service agent."
): Promise<string> {
  const messages: ChatCompletionMessageParam[] = [
    { role: "system", content: systemPrompt },
    { role: "user", content: userMessage },
  ];

  const maxSteps = 10; // safety limit

  for (let step = 0; step < maxSteps; step++) {
    const response = await client.chat.completions.create({
      model: MODEL,
      messages,
      tools,
      tool_choice: "auto",
      temperature: 0,
      max_tokens: 1000,
    });

    const msg = response.choices[0].message;

    // Append full assistant message with reasoning preserved
    messages.push(buildAssistantMessage(msg));

    // No tool calls → final response
    if (!msg.tool_calls?.length) {
      return msg.content ?? "";
    }

    // Execute each tool call and append results
    for (const tc of msg.tool_calls) {
      const result = await executeTool(tc.function.name, tc.function.arguments);
      messages.push({
        role: "tool",
        tool_call_id: tc.id,
        content: result,
      });
    }
  }

  throw new Error("Agent exceeded maximum steps");
}

// Usage
const response = await runAgent(
  "I want to cancel my subscription. My email is jane@example.com"
);
console.log(response);
```

### OpenRouter integration

When using Trinity through [OpenRouter](https://openrouter.ai/), reasoning is returned in a `reasoning_details` object (OpenRouter's unified reasoning shape). For multi-turn conversations, pass `reasoning_details` back as-is on assistant messages — OpenRouter handles the model-specific upstream translation automatically.

#### Debugging upstream requests

To verify that reasoning is being sent upstream correctly, enable echo mode:

```json
{
  "debug": { "echo_upstream_body": true }
}
```

See [OpenRouter debugging docs](https://openrouter.ai/docs/api/reference/errors-and-debugging#debugging) for details.

### Common pitfalls

#### 1. `xml_in_reasoning` — tool call XML inside reasoning field

**Symptom:** The response has `tool_calls: []` (empty) and `reasoning_content` contains raw XML like `<function=get_details_by_id><parameter=id>L1001</parameter>...`

**Cause:** The previous assistant turn likely lost reasoning context (and/or had invalid message shape), so the model generated the tool call inside its thinking block instead of as structured output.

**Fix:** Ensure every assistant message in the conversation history includes `reasoning_content` from the prior API response.

```
❌ Broken — no reasoning, null content
{"role": "assistant", "content": null, "tool_calls": [...]}

✅ Fixed — reasoning preserved, content non-null
{"role": "assistant", "content": "", "reasoning_content": "...", "tool_calls": [...]}
```

#### 2. `reasoning_content` ignored on self-hosted vLLM

**Symptom:** You're passing `reasoning_content` on assistant messages to a self-hosted vLLM instance, but the model behaves as if reasoning is missing.

**Cause:** Some vLLM versions only read `reasoning` (not `reasoning_content`) from input messages ([vllm#38488](https://github.com/vllm-project/vllm/issues/38488)). This does **not** affect Arcee's hosted API, which accepts both.

**Fix:** If you're hitting vLLM directly, map `reasoning_content` → `reasoning` on input:

```python
# Self-hosted vLLM workaround
assistant_msg["reasoning"] = msg.reasoning_content
```

#### 3. `content: null` on assistant tool-call turns

**Symptom:** Degraded tool call quality or malformed output on subsequent turns.

**Cause:** When the model makes a tool call, `content` may be `null` in the API response. Passing `null` back can contribute to malformed follow-up behavior in some integrations.

**Fix:** Normalize `null` to empty string:

```python
"content": msg.content or ""   # Python
```

```typescript
content: msg.content ?? ""     // TypeScript
```

#### 4. Missing vLLM serving flags

**Symptom:** Reasoning leaks into `content` with visible `<think>...</think>` tags, or tool calls appear as raw XML instead of structured `tool_calls`.

**Fix:** Ensure vLLM is started with the correct flags:

```bash
vllm serve arcee-ai/Trinity-Large-Thinking \
  --dtype bfloat16 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
```

| Flag                             | Purpose                                                       |
| -------------------------------- | ------------------------------------------------------------- |
| `--reasoning-parser deepseek_r1` | Separates `<think>` blocks into the `reasoning_content` field |
| `--enable-auto-tool-choice`      | Enables tool call parsing                                     |
| `--tool-call-parser qwen3_coder` | Parses tool calls into structured `tool_calls` array          |

Note: vLLM versions before 0.18 may also require `--enable-reasoning`. If the flag is not recognized, `--reasoning-parser` alone is sufficient.

### vLLM serving reference

#### Minimal serving command

```bash
vllm serve arcee-ai/Trinity-Large-Thinking \
  --dtype bfloat16 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
```

#### Production serving command (example only)

```bash
vllm serve arcee-ai/Trinity-Large-Thinking \
  --served-model-name arcee-ai/Trinity-Large-Thinking \
  --dtype bfloat16 \
  --tensor-parallel-size <set-for-your-hardware> \
  --max-model-len <set-for-your-workload> \
  --gpu-memory-utilization <set-for-your-cluster-policy> \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "$API_KEY" \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
```

#### Context length guidance

Set `--max-model-len` based on your real conversation lengths, tool-chain depth, and GPU memory budget. Higher values pre-allocate more KV cache at startup. A practical approach is to start conservatively, monitor `truncation/finish_reason: "length"`, then scale up incrementally.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.arcee.ai/capabilities/reasoning-traces.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
