Adding Support for Retrieval-Augmented Generation (RAG) to AI Orchestrator

Published on: 12 May 2025

Good news! I've extended my lightweight AI orchestrator, **CleverChatty**, to support Retrieval-Augmented Generation (RAG) by integrating it using the **Model Context Protocol (MCP)**. ### Quick Recap * **RAG (Retrieval-Augmented Generation)** is an AI technique that enhances language models by retrieving relevant external documents (e.g., from databases or vector stores) based on a user’s query. These documents are then used as additional context during response generation, enabling more accurate, up-to-date, and grounded outputs. * **MCP (Model Context Protocol)** is a standard for how external systems—such as tools, memory, or document retrievers—communicate with language models. It enables structured, portable, and extensible context exchange, making it ideal for building complex AI systems like assistants, copilots, or agents. * **[CleverChatty](https://github.com/Gelembjuk/cleverchatty)** is a simple AI orchestrator that connects LLMs with tools over MCP and supports external memory. My goal is to expand it to work with modern AI infrastructure—RAG, memory, tools, agent-to-agent (A2A) interaction, and beyond. It’s provided as a library, and you can explore it via the CLI interface: [CleverChatty CLI](https://github.com/Gelembjuk/cleverchatty-cli). In short: RAG allows AI assistants to tap into external knowledge sources—typically a company’s private data—to operate with more precision and relevance. --- ### Two Main Ways to Integrate RAG into an AI Assistant One of the defining features of RAG is that it's a **concept**, not a rigid standard. It can be implemented in various ways, depending on the architecture and goals. #### 1. **Direct Connection** The RAG system is called **before** the LLM prompt is submitted. It retrieves relevant documents that are injected into the prompt context. * The request method depends on the RAG system's API—some use REST endpoints, others use SDKs or local modules. #### 2. **Indirect (Tool-Based) Connection** The RAG system is implemented as a **tool**—for example, an MCP server. The LLM autonomously decides when and how to invoke it. * This method enables the LLM to call the RAG system dynamically, only when needed. * The easiest implementation is to wrap the RAG system as an MCP server. * However, because the model decides whether to invoke it, there’s always a chance it might not use the RAG tool at all. This behavior depends on the model’s training and configuration. --- ### Trade-offs Between Approaches * **Direct approach**: More predictable. You ensure the model sees the exact context you want by fetching it ahead of time. * **Tool-based approach**: More flexible. It allows for dynamic, multi-turn interactions where the model chooses when and how to gather more information. --- ## MCP as a Universal Protocol for RAG MCP is a natural choice when using RAG as a tool, but it can also serve as a **connector** for direct integration. Since most AI orchestrators and assistants already support MCP, using MCP to connect to a RAG system means you **don’t need to implement a separate API client**—you can reuse the existing MCP client. This becomes even more powerful if we **standardize the MCP interface** for RAG systems. With a shared interface, you could easily swap one RAG backend for another, or assign different RAG systems to different use cases. I’ve previously shared the idea of **MCP interfaces** in several communities and received encouraging feedback, though no official standard exists yet. For example, I defined an MCP interface for AI memory in this [post](/blog/post/implementing-ai-chat-memory-with-mcp/). I propose a similar approach for a **RAG MCP interface**, which could be as simple as a single method: ```plain knowledge_search(query: string, num: int) -> string ``` Where: * `query` is the user's search query * `num` is the number of results to return * The return value is a string containing summaries of the retrieved pages In short, if your RAG system works over MCP and implements this interface, it becomes easy to replace it with another RAG system or use it as a tool in any MCP-based AI orchestrator. You don't need to modify the orchestrator or the LLM code—just update the MCP server configuration. ## Example of a RAG System Connected to CleverChatty Here’s a simple example of a RAG system integrated with **CleverChatty**. It’s an MCP server that provides Wikipedia search results. This server can be used in both **direct connection** and **tool-based** modes. It also supports two different **MCP transport protocols**: `stdio` and `sse`. This makes it a good demonstration of **multi-transport support** in MCP servers. For more details on how to implement support for multiple transports in MCP servers, check out this post: [Easily Switch Transport Protocols in MCP Servers](/blog/post/easily-switch-transport-protocols-in-mcp-servers/). ```golang package main import ( "context" "fmt" "os" "strings" "path/filepath" "github.com/mark3labs/mcp-go/mcp" "github.com/mark3labs/mcp-go/server" wiki "github.com/trietmn/go-wiki" "github.com/trietmn/go-wiki/page" ) var requireAuth = false func main() { if len(os.Args) < 2 { binaryName := filepath.Base(os.Args[0]) fmt.Println("Usage: ./" + binaryName + " [OPTIONS]") fmt.Println("Usage: ./" + binaryName + " sse HOST:PORT") fmt.Println("Usage: ./" + binaryName + " stdio") os.Exit(1) } transport := os.Args[1] switch transport { case "stdio": runAsStdio() case "sse": if len(os.Args) < 3 { fmt.Println("sse transport requires a host and port argument (e.g., 0.0.0.0:8080)") os.Exit(1) } host_and_port := os.Args[2] runAsSSE(host_and_port) default: fmt.Printf("Unknown transport: %s\n", transport) os.Exit(3) } } func runAsStdio() { if err := server.ServeStdio(createServer()); err != nil { fmt.Printf("😡 Server error: %v\n", err) } } func runAsSSE(host_and_port string) { // for SSE we require auth requireAuth = true // Start the stdio server sObj := server.NewSSEServer(createServer()) fmt.Println("🚀 Server started") if err := sObj.Start(host_and_port); err != nil { fmt.Printf("😡 Server error: %v\n", err) } fmt.Println("👋 Server stopped") } // All the code below is not related to MCP server transport func createServer() *server.MCPServer { // Create MCP server s := server.NewMCPServer( "Server to provide Wikipedia search results", "1.0.0", ) execTool := mcp.NewTool("knowledge_search", mcp.WithDescription("Search Wikipedia for a given query and return the summaries of the pages. Use this tool to get additional context for your command."), mcp.WithString("query", mcp.Required(), mcp.Description("The query to search for documents and information"), ), mcp.WithNumber("num", mcp.Description("The number of results to return"), ), ) s.AddTool(execTool, cmdKnowledgeSearch) return s } func cmdKnowledgeSearch(ctx context.Context, request mcp.CallToolRequest) (*mcp.CallToolResult, error) { args, ok := request.Params.Arguments.(map[string]interface{}) if !ok { return mcp.NewToolResultError("invalid arguments"), nil } query, ok := args["query"].(string) if !ok || query == "" { return mcp.NewToolResultError("query is required"), nil } num, ok := args["num"].(int) if !ok { numf, ok := args["num"].(float64) if !ok { num = 4 } else { num = int(numf) } } if num <= 0 { num = 4 } pages, err := getPagesFromWiki(query, num) if err != nil { return mcp.NewToolResultError(fmt.Sprintf("failed to get pages: %v", err)), nil } if len(pages) == 0 { return mcp.NewToolResultError("no pages found"), nil } // Create a string with the page summaries summaries := "" for _, page := range pages { summaries += page + "\n\n" } // Return the summaries as a text result return mcp.NewToolResultText(summaries), nil } func getPagesFromWiki(query string, num int) (pages []string, err error) { // Perform a Wikipedia search results, _, err := wiki.Search(query, num, false) if err != nil { fmt.Println("Error searching Wikipedia: ", err) return } if len(results) == 0 { err = fmt.Errorf("no results found for query: %s", query) return } pages = []string{} for _, title := range results { var p page.WikipediaPage p, err = wiki.GetPage(title, -1, false, true) if err != nil { continue } summary, err := p.GetSummary() if err != nil { continue } // remove any new empty lines in this summary lines := []string{} for _, line := range strings.Split(summary, "\n") { if strings.TrimSpace(line) != "" { lines = append(lines, line) } } summary = strings.Join(lines, "\n") pages = append(pages, summary) } return } ``` ## Using RAG system with CleverChatty ### Direct connection The configuration of **CleverChatty CLI** should include the RAG system as an MCP server under the `mcpServers` section. For example: ```json { "mcpServers": { "RAG_Server": { "url": "http://localhost:8002/sse", "headers": [ ], "interface": "rag" }, .... other MCP servers }, } ``` The `"interface": "rag"` setting is key here — it tells **CleverChatty CLI** that this MCP server is a RAG system and that it should use the `knowledge_search` method to retrieve relevant documents **before** processing the user's prompt. When the user asks a question, CleverChatty CLI will automatically call the RAG system, fetch the most relevant documents, and inject them into the context before sending the prompt to the language model. **Note:** Your RAG MCP server can also run over `STDIO`. In that case, the configuration should look like this: ```json { "mcpServers": { "RAG_Server": { "command": "rag_server_binary_name", "args": ["stdio"], "interface": "rag" }, .... other MCP servers }, ......... } ``` ### Tool-Based Connection The configuration remains the same, but **you should not include** `"interface": "rag"` in the MCP server settings. In this mode, the RAG system is used as a **tool**, and there's nothing special about it—it's simply another MCP server. The key difference is that the **LLM decides** whether and when to use the tool. That’s why, unlike in the direct connection approach, the **tool description becomes crucial**. It should be clear and concise, helping the LLM understand what the tool does and how to use it effectively. ```json { "mcpServers": { "RAG_Server": { "url": "http://localhost:8002/sse", "headers": [ ] }, .... other MCP servers }, } ``` ### More Configuration Options RAG usage in **CleverChatty CLI** can be fine-tuned using the `rag_settings` section in the configuration file. These settings apply **only to the direct connection mode**. Tool-based connections do not require any special configuration. ```json { "rag_settings": { "context_prefix": "Context: ", "require_preprocessing": true, "preprocessing_prompt": "Extract the most relevant key phrase/key word from the following text" }, "mcpServers": { ... } } ``` * **`context_prefix`** – This prefix is added to the RAG context before the user's prompt. It helps the LLM distinguish between retrieved context and user input. The default is `"Context: "`, but you can customize it to fit your prompt formatting. * **`require_preprocessing`** – If set to `true`, CleverChatty CLI will **preprocess the user's prompt** using the LLM before calling the RAG system. This can improve retrieval quality by simplifying or refining the query—for example, extracting a key phrase or keyword. It’s especially useful if your RAG backend does not support complex queries (by default, the full user prompt is sent as the query). * **`preprocessing_prompt`** – This is the prompt used to instruct the LLM on how to preprocess the user's input. It is only used if `require_preprocessing` is `true`. --- ## Example: Using a RAG System with CleverChatty CLI In this example, I'm using the lightweight local LLM **Qwen 2.5**. It's a small model with limited world knowledge and outdated training data, so it doesn't have information about recent events. This makes it a perfect use case for supplementing knowledge with a RAG system. ### Test Chat Without RAG connected ``` You: What do you know about Trump–Zelenskyy Oval Office meeting Assistant: The Trump–Zelenskyy Oval Office meeting was an historic encounter that took place on April 24, 2019, in the East Room of the White House. At this meeting, then-U.S. President Donald Trump and Ukrainian President Volodymyr Zelensky discussed a range of issues, including anti-corruption efforts, Russian interference in the U.S. presidential election, and NATO's stance on Russia. Some key points about this meeting include: 1. Purpose: The meeting was initiated by Trump to address concerns raised by Ukraine's parliament regarding corruption and Russian interference in Ukraine's 2019 presidential elections. 2. Content of Discussion: Both leaders discussed various issues including anti-corruption efforts, the role of NATO forces in Europe, and Russia's actions during the 2016 U.S. presidential election. ......... truncated for brevity ......... ``` As we can see the LLM does not have any information about this meeting. It is not in its memory. So, it provides a very poor answer. ### Test Chat With RAG connected with direct connection When RAG is enables the response is much better. Even if this historical event is not in the memory of this LLM there can be useful response. I received similar resut when there is extra filtering of a prompt and when not. ``` You: What do you know about Trump–Zelenskyy Oval Office meeting Assistant: The Trump–Zelenskyy Oval Office meeting on February 28, 2025, was a highly contentious and controversial event that took place during the ongoing Russo-Ukrainian War. Here are some key points regarding this meeting: 1. Purpose: The meeting aimed to discuss continued U.S. support for Ukraine in repelling Russia's invasion of the country. 2. Content and Tone: It became known as a confrontation between President Donald Trump of the United States and President Volodymyr Zelenskyy of Ukraine, with Trump and his surrogates, such as Vice President Joe Biden (through personal lawyer Rudy Giuliani), criticizing Zelenskyy multiple times. 3. Criticism from Zelenskyy: Zelenskyy repeatedly stated that he was only supporting the Ukrainian people and fighting against Russian aggression on their behalf. 4. Media Description: Media outlets described it as an unprecedented public confrontation between an American president and a foreign head of state, highlighting its stark nature compared to previous diplomatic exchanges. ......... truncated for brevity ......... ``` This worked exactly as expected. Before to precess user's prompt with LLM we called the RAG system to get the most relevant documents. It found the page with the information about the meeting. We included it in the context before the prompt and the LLM was able to provide a very good answer. ### Test Chat With RAG connected as a tool And one more example. This time the RAG system is connected but as a "normal" MCP server. We do not call it automatically before the prompt. We allow LLM to call it as a tool when it needs more information. So, LLM decides itself. ``` You: What do you know about Trump–Zelenskyy Oval Office meeting Assistant: The meeting you are referring to is the controversial Oval Office meeting held between President Donald Trump and Ukrainian President Volodymyr Zelensky on February 28, 2025. This meeting took place in the Oval Office at the White House in Washington D.C., and was televised live. Key points from this meeting include: • The main goal of the meeting was to discuss continued U.S. support for Ukraine during its ongoing conflict with Russia. • There were tensions between Trump's administration and Zelensky’s government regarding security guarantees against future Russian aggression. • During the meeting, there were criticisms directed at Zelensky by both Trump and Vance, often drowning out his voice. The confrontation ended abruptly without a clear resolution. ......... truncated for brevity ......... ``` As we can see LLM did corect decision to call the RAG system to get more information about the meeting. It called it as a tool and got the information from the RAG system. This allows to get a correct answer even if the LLM does not have this information in its memory. In logs of my RAG MCP server i can see that it was called and it was called with correct expected query. So, LLM was smart enought to call the RAG system with the correct query. ## Try it yourself * Prepare some MCP server with the interface described above - it must have the `knowledge_search` method. You can just copy the code above and run it as a MCP server. It will provide Wikipedia search results. * Prepare the config file, like this one (presume you have ollama installed and llama3.1 model is available): ```json { "log_file_path": "", "model": "ollama:llama3.1:latest", "system_instruction": "", "mcpServers": { "RAG_Server": { "url": "http://localhost:8002/sse", "headers": [], "interface": "rag" } }, "rag_settings": { "context_prefix": "Context: ", "require_preprocessing": true, "preprocessing_prompt": "Extract the most relevan key phrase/key word from the following text" } } ``` Save this to a file, for example `config.json`. * Then run the RAG MCP server and CleverChatty CLI with this config file. To run ClevrChatty-CLI you need to have Go installed. Then you can run it with the following command: ```bash go run github.com/gelembjuk/cleverchatty-cli@latest --config config.json ``` Then you can ask questions and see how it works.

Science Fiction

The End of the Holocene

1 June 2025

The End of the Holocene is a science fiction narrative exploring the implications of artificial intelligence on humanity's future. It delves into themes of consciousness, identity, and the potential consequences of technological advancement.

MCP, LLM

What’s Missing in MCP

9 June 2025

Over the past couple of months, I’ve been experimenting with the Model Context Protocol (MCP) — building AI agents and tools around it. While the experience has been promising, I’ve noticed a few areas where MCP could be expanded or improved.