Good news! I've extended my lightweight AI orchestrator, **CleverChatty**, to support Retrieval-Augmented Generation (RAG) by integrating it using the **Model Context Protocol (MCP)**.
### Quick Recap
* **RAG (Retrieval-Augmented Generation)** is an AI technique that enhances language models by retrieving relevant external documents (e.g., from databases or vector stores) based on a user’s query. These documents are then used as additional context during response generation, enabling more accurate, up-to-date, and grounded outputs.
* **MCP (Model Context Protocol)** is a standard for how external systems—such as tools, memory, or document retrievers—communicate with language models. It enables structured, portable, and extensible context exchange, making it ideal for building complex AI systems like assistants, copilots, or agents.
* **[CleverChatty](https://github.com/Gelembjuk/cleverchatty)** is a simple AI orchestrator that connects LLMs with tools over MCP and supports external memory. My goal is to expand it to work with modern AI infrastructure—RAG, memory, tools, agent-to-agent (A2A) interaction, and beyond. It’s provided as a library, and you can explore it via the CLI interface: [CleverChatty CLI](https://github.com/Gelembjuk/cleverchatty-cli).
In short: RAG allows AI assistants to tap into external knowledge sources—typically a company’s private data—to operate with more precision and relevance.
---
### Two Main Ways to Integrate RAG into an AI Assistant
One of the defining features of RAG is that it's a **concept**, not a rigid standard. It can be implemented in various ways, depending on the architecture and goals.
#### 1. **Direct Connection**
The RAG system is called **before** the LLM prompt is submitted. It retrieves relevant documents that are injected into the prompt context.
* The request method depends on the RAG system's API—some use REST endpoints, others use SDKs or local modules.
#### 2. **Indirect (Tool-Based) Connection**
The RAG system is implemented as a **tool**—for example, an MCP server. The LLM autonomously decides when and how to invoke it.
* This method enables the LLM to call the RAG system dynamically, only when needed.
* The easiest implementation is to wrap the RAG system as an MCP server.
* However, because the model decides whether to invoke it, there’s always a chance it might not use the RAG tool at all. This behavior depends on the model’s training and configuration.
---
### Trade-offs Between Approaches
* **Direct approach**: More predictable. You ensure the model sees the exact context you want by fetching it ahead of time.
* **Tool-based approach**: More flexible. It allows for dynamic, multi-turn interactions where the model chooses when and how to gather more information.
---
## MCP as a Universal Protocol for RAG
MCP is a natural choice when using RAG as a tool, but it can also serve as a **connector** for direct integration. Since most AI orchestrators and assistants already support MCP, using MCP to connect to a RAG system means you **don’t need to implement a separate API client**—you can reuse the existing MCP client.
This becomes even more powerful if we **standardize the MCP interface** for RAG systems. With a shared interface, you could easily swap one RAG backend for another, or assign different RAG systems to different use cases.
I’ve previously shared the idea of **MCP interfaces** in several communities and received encouraging feedback, though no official standard exists yet. For example, I defined an MCP interface for AI memory in this [post](/blog/post/implementing-ai-chat-memory-with-mcp/).
I propose a similar approach for a **RAG MCP interface**, which could be as simple as a single method:
```plain
knowledge_search(query: string, num: int) -> string
```
Where:
* `query` is the user's search query
* `num` is the number of results to return
* The return value is a string containing summaries of the retrieved pages
In short, if your RAG system works over MCP and implements this interface, it becomes easy to replace it with another RAG system or use it as a tool in any MCP-based AI orchestrator. You don't need to modify the orchestrator or the LLM code—just update the MCP server configuration.
## Example of a RAG System Connected to CleverChatty
Here’s a simple example of a RAG system integrated with **CleverChatty**. It’s an MCP server that provides Wikipedia search results. This server can be used in both **direct connection** and **tool-based** modes.
It also supports two different **MCP transport protocols**: `stdio` and `sse`. This makes it a good demonstration of **multi-transport support** in MCP servers.
For more details on how to implement support for multiple transports in MCP servers, check out this post: [Easily Switch Transport Protocols in MCP Servers](/blog/post/easily-switch-transport-protocols-in-mcp-servers/).
```golang
package main
import (
"context"
"fmt"
"os"
"strings"
"path/filepath"
"github.com/mark3labs/mcp-go/mcp"
"github.com/mark3labs/mcp-go/server"
wiki "github.com/trietmn/go-wiki"
"github.com/trietmn/go-wiki/page"
)
var requireAuth = false
func main() {
if len(os.Args) < 2 {
binaryName := filepath.Base(os.Args[0])
fmt.Println("Usage: ./" + binaryName + " [OPTIONS]")
fmt.Println("Usage: ./" + binaryName + " sse HOST:PORT")
fmt.Println("Usage: ./" + binaryName + " stdio")
os.Exit(1)
}
transport := os.Args[1]
switch transport {
case "stdio":
runAsStdio()
case "sse":
if len(os.Args) < 3 {
fmt.Println("sse transport requires a host and port argument (e.g., 0.0.0.0:8080)")
os.Exit(1)
}
host_and_port := os.Args[2]
runAsSSE(host_and_port)
default:
fmt.Printf("Unknown transport: %s\n", transport)
os.Exit(3)
}
}
func runAsStdio() {
if err := server.ServeStdio(createServer()); err != nil {
fmt.Printf("😡 Server error: %v\n", err)
}
}
func runAsSSE(host_and_port string) {
// for SSE we require auth
requireAuth = true
// Start the stdio server
sObj := server.NewSSEServer(createServer())
fmt.Println("🚀 Server started")
if err := sObj.Start(host_and_port); err != nil {
fmt.Printf("😡 Server error: %v\n", err)
}
fmt.Println("👋 Server stopped")
}
// All the code below is not related to MCP server transport
func createServer() *server.MCPServer {
// Create MCP server
s := server.NewMCPServer(
"Server to provide Wikipedia search results",
"1.0.0",
)
execTool := mcp.NewTool("knowledge_search",
mcp.WithDescription("Search Wikipedia for a given query and return the summaries of the pages. Use this tool to get additional context for your command."),
mcp.WithString("query",
mcp.Required(),
mcp.Description("The query to search for documents and information"),
),
mcp.WithNumber("num",
mcp.Description("The number of results to return"),
),
)
s.AddTool(execTool, cmdKnowledgeSearch)
return s
}
func cmdKnowledgeSearch(ctx context.Context, request mcp.CallToolRequest) (*mcp.CallToolResult, error) {
args, ok := request.Params.Arguments.(map[string]interface{})
if !ok {
return mcp.NewToolResultError("invalid arguments"), nil
}
query, ok := args["query"].(string)
if !ok || query == "" {
return mcp.NewToolResultError("query is required"), nil
}
num, ok := args["num"].(int)
if !ok {
numf, ok := args["num"].(float64)
if !ok {
num = 4
} else {
num = int(numf)
}
}
if num <= 0 {
num = 4
}
pages, err := getPagesFromWiki(query, num)
if err != nil {
return mcp.NewToolResultError(fmt.Sprintf("failed to get pages: %v", err)), nil
}
if len(pages) == 0 {
return mcp.NewToolResultError("no pages found"), nil
}
// Create a string with the page summaries
summaries := ""
for _, page := range pages {
summaries += page + "\n\n"
}
// Return the summaries as a text result
return mcp.NewToolResultText(summaries), nil
}
func getPagesFromWiki(query string, num int) (pages []string, err error) {
// Perform a Wikipedia search
results, _, err := wiki.Search(query, num, false)
if err != nil {
fmt.Println("Error searching Wikipedia: ", err)
return
}
if len(results) == 0 {
err = fmt.Errorf("no results found for query: %s", query)
return
}
pages = []string{}
for _, title := range results {
var p page.WikipediaPage
p, err = wiki.GetPage(title, -1, false, true)
if err != nil {
continue
}
summary, err := p.GetSummary()
if err != nil {
continue
}
// remove any new empty lines in this summary
lines := []string{}
for _, line := range strings.Split(summary, "\n") {
if strings.TrimSpace(line) != "" {
lines = append(lines, line)
}
}
summary = strings.Join(lines, "\n")
pages = append(pages, summary)
}
return
}
```
## Using RAG system with CleverChatty
### Direct connection
The configuration of **CleverChatty CLI** should include the RAG system as an MCP server under the `mcpServers` section. For example:
```json
{
"mcpServers": {
"RAG_Server": {
"url": "http://localhost:8002/sse",
"headers": [
],
"interface": "rag"
},
.... other MCP servers
},
}
```
The `"interface": "rag"` setting is key here — it tells **CleverChatty CLI** that this MCP server is a RAG system and that it should use the `knowledge_search` method to retrieve relevant documents **before** processing the user's prompt.
When the user asks a question, CleverChatty CLI will automatically call the RAG system, fetch the most relevant documents, and inject them into the context before sending the prompt to the language model.
**Note:** Your RAG MCP server can also run over `STDIO`. In that case, the configuration should look like this:
```json
{
"mcpServers": {
"RAG_Server": {
"command": "rag_server_binary_name",
"args": ["stdio"],
"interface": "rag"
},
.... other MCP servers
},
.........
}
```
### Tool-Based Connection
The configuration remains the same, but **you should not include** `"interface": "rag"` in the MCP server settings. In this mode, the RAG system is used as a **tool**, and there's nothing special about it—it's simply another MCP server.
The key difference is that the **LLM decides** whether and when to use the tool. That’s why, unlike in the direct connection approach, the **tool description becomes crucial**. It should be clear and concise, helping the LLM understand what the tool does and how to use it effectively.
```json
{
"mcpServers": {
"RAG_Server": {
"url": "http://localhost:8002/sse",
"headers": [
]
},
.... other MCP servers
},
}
```
### More Configuration Options
RAG usage in **CleverChatty CLI** can be fine-tuned using the `rag_settings` section in the configuration file. These settings apply **only to the direct connection mode**. Tool-based connections do not require any special configuration.
```json
{
"rag_settings": {
"context_prefix": "Context: ",
"require_preprocessing": true,
"preprocessing_prompt": "Extract the most relevant key phrase/key word from the following text"
},
"mcpServers": {
...
}
}
```
* **`context_prefix`** – This prefix is added to the RAG context before the user's prompt. It helps the LLM distinguish between retrieved context and user input. The default is `"Context: "`, but you can customize it to fit your prompt formatting.
* **`require_preprocessing`** – If set to `true`, CleverChatty CLI will **preprocess the user's prompt** using the LLM before calling the RAG system. This can improve retrieval quality by simplifying or refining the query—for example, extracting a key phrase or keyword. It’s especially useful if your RAG backend does not support complex queries (by default, the full user prompt is sent as the query).
* **`preprocessing_prompt`** – This is the prompt used to instruct the LLM on how to preprocess the user's input. It is only used if `require_preprocessing` is `true`.
---
## Example: Using a RAG System with CleverChatty CLI
In this example, I'm using the lightweight local LLM **Qwen 2.5**. It's a small model with limited world knowledge and outdated training data, so it doesn't have information about recent events. This makes it a perfect use case for supplementing knowledge with a RAG system.
### Test Chat Without RAG connected
```
You: What do you know about Trump–Zelenskyy Oval Office meeting
Assistant: The Trump–Zelenskyy Oval Office meeting was an historic encounter that
took place on April 24, 2019, in the East Room of the White House. At this
meeting, then-U.S. President Donald Trump and Ukrainian President
Volodymyr Zelensky discussed a range of issues, including anti-corruption
efforts, Russian interference in the U.S. presidential election, and
NATO's stance on Russia.
Some key points about this meeting include:
1. Purpose: The meeting was initiated by Trump to address concerns raised
by Ukraine's parliament regarding corruption and Russian interference
in Ukraine's 2019 presidential elections.
2. Content of Discussion: Both leaders discussed various issues including
anti-corruption efforts, the role of NATO forces in Europe, and
Russia's actions during the 2016 U.S. presidential election.
......... truncated for brevity .........
```
As we can see the LLM does not have any information about this meeting. It is not in its memory. So, it provides a very poor answer.
### Test Chat With RAG connected with direct connection
When RAG is enables the response is much better. Even if this historical event is not in the memory of this LLM there can be useful response.
I received similar resut when there is extra filtering of a prompt and when not.
```
You: What do you know about Trump–Zelenskyy Oval Office meeting
Assistant: The Trump–Zelenskyy Oval Office meeting on February 28, 2025, was a highly
contentious and controversial event that took place during the ongoing
Russo-Ukrainian War. Here are some key points regarding this meeting:
1. Purpose: The meeting aimed to discuss continued U.S. support for
Ukraine in repelling Russia's invasion of the country.
2. Content and Tone: It became known as a confrontation between President
Donald Trump of the United States and President Volodymyr Zelenskyy of
Ukraine, with Trump and his surrogates, such as Vice President Joe
Biden (through personal lawyer Rudy Giuliani), criticizing Zelenskyy
multiple times.
3. Criticism from Zelenskyy: Zelenskyy repeatedly stated that he was only
supporting the Ukrainian people and fighting against Russian aggression
on their behalf.
4. Media Description: Media outlets described it as an unprecedented
public confrontation between an American president and a foreign head
of state, highlighting its stark nature compared to previous diplomatic
exchanges.
......... truncated for brevity .........
```
This worked exactly as expected. Before to precess user's prompt with LLM we called the RAG system to get the most relevant documents. It found the page with the information about the meeting. We included it in the context before the prompt and the LLM was able to provide a very good answer.
### Test Chat With RAG connected as a tool
And one more example. This time the RAG system is connected but as a "normal" MCP server. We do not call it automatically before the prompt. We allow LLM to call it as a tool when it needs more information. So, LLM decides itself.
```
You: What do you know about Trump–Zelenskyy Oval Office meeting
Assistant: The meeting you are referring to is the controversial Oval Office meeting
held between President Donald Trump and Ukrainian President Volodymyr
Zelensky on February 28, 2025. This meeting took place in the Oval Office
at the White House in Washington D.C., and was televised live.
Key points from this meeting include:
• The main goal of the meeting was to discuss continued U.S. support for
Ukraine during its ongoing conflict with Russia.
• There were tensions between Trump's administration and Zelensky’s
government regarding security guarantees against future Russian
aggression.
• During the meeting, there were criticisms directed at Zelensky by both
Trump and Vance, often drowning out his voice. The confrontation ended
abruptly without a clear resolution.
......... truncated for brevity .........
```
As we can see LLM did corect decision to call the RAG system to get more information about the meeting. It called it as a tool and got the information from the RAG system. This allows to get a correct answer even if the LLM does not have this information in its memory.
In logs of my RAG MCP server i can see that it was called and it was called with correct expected query. So, LLM was smart enought to call the RAG system with the correct query.
## Try it yourself
* Prepare some MCP server with the interface described above - it must have the `knowledge_search` method. You can just copy the code above and run it as a MCP server. It will provide Wikipedia search results.
* Prepare the config file, like this one (presume you have ollama installed and llama3.1 model is available):
```json
{
"log_file_path": "",
"model": "ollama:llama3.1:latest",
"system_instruction": "",
"mcpServers": {
"RAG_Server": {
"url": "http://localhost:8002/sse",
"headers": [],
"interface": "rag"
}
},
"rag_settings": {
"context_prefix": "Context: ",
"require_preprocessing": true,
"preprocessing_prompt": "Extract the most relevan key phrase/key word from the following text"
}
}
```
Save this to a file, for example `config.json`.
* Then run the RAG MCP server and CleverChatty CLI with this config file.
To run ClevrChatty-CLI you need to have Go installed. Then you can run it with the following command:
```bash
go run github.com/gelembjuk/cleverchatty-cli@latest --config config.json
```
Then you can ask questions and see how it works.