> For the complete documentation index, see [llms.txt](https://docs.arcee.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.arcee.ai/arcee-orchestra/workflows/workflow-components/knowledge-retrieval.md).

# Knowledge Retrieval

The knowledge retrieval node is used to store your documents and data for use in workflows, enabling retrieval augmented generation (RAG), semantic search, and document retrieval.

The knowledge retrieval node automates the complexities of setting up a vector database, pre-processing your documents, loading the data into the database, and executing queries to retrieve relevant documents.&#x20;

### Retrieval Node Components:

**Vector Database**

Orchestra Enterprise customers are automatically provisioned with a vector database when their organization is created. Each database is isolated to each customer, meaning your data is secure and only you can access the data in your database.

**Data Upload**

To upload data, select the Knowledge Retrieval Node and click on "+ Add Parameter". This will give you the ability to upload documents. Current supported document types include TXT, PDF, JSON, MD, XLSX, DOCX, and PPTX, with an individual file max size of 15MB.

<figure><img src="/files/0AOx1pZpv6PACRMN6hOr" alt="" width="375"><figcaption><p>Data Upload</p></figcaption></figure>

**Data Pre-Processing**

Once data has been uploaded, it is processed to be optimally stored in the vector database. The data is parsed, cleaned, chunked, embedded, and indexed.&#x20;

1. For TXT and PDF, content is parsed into text segments, or "chunks", and any embedded images undergo object character recognition (OCR) to extract any text.&#x20;
2. Deduplication is applied to remove any redundant data for more relevant search.&#x20;
3. Documents are chunked.
4. Text chunks are vectorized using a top [MTEB](https://github.com/embeddings-benchmark/mteb) model.
5. Vectors are indexed using a proprietary vector indexing algorithm.

**Data Storage**

Data is stored in an index within the vector database.

#### Search Prompt

Once all data is uploaded to the knowledge retrieval node, you specify a prompt which is used to search the database using either vector search, semantic search, or a full text search. The most common approach for setting the search prompt is to dynamically pass a prompt either provided by the user or created earlier in the workflow.

<figure><img src="/files/t5b36cYPDVy3M8yeUVa8" alt="" width="375"><figcaption><p>Knowledge Retrieval Node</p></figcaption></figure>

**Inference / Data Retrieval**

When the knowledge retrieval node is invoked:

1. The same model used to embed the data in the vector database, embeds the prompt.&#x20;
2. Vector Search or Full Text Search is used to retrieve the most relevant documents.&#x20;
3. A reranker is used before returning the relevant data to the workflow.&#x20;

**Knowledge Retrieval Node Output**

The knowledge retrieval returns the model's response to the prompt based on the documents retrieved from the vector database.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.arcee.ai/arcee-orchestra/workflows/workflow-components/knowledge-retrieval.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.