Building AI Agent with llama.cpp

0
3



Image by Author

 

llama.cpp is the original, high-performance framework that powers many popular local AI tools, including Ollama, local chatbots, and other on-device LLM solutions. By working directly with llama.cpp, you can minimize overhead, gain fine-grained control, and optimize performance for your specific hardware, making your local AI agents and applications faster and more configurable

In this tutorial, I will guide you through building AI applications using llama.cpp, a powerful C/C++ library for running large language models (LLMs) efficiently. We will cover setting up a llama.cpp server, integrating it with Langchain, and building a ReAct agent capable of using tools like web search and a Python REPL.

 

1. Setting up the llama.cpp Server

 
This section covers the installation of llama.cpp and its dependencies, configuring it for CUDA support, building the necessary binaries, and running the server.

Note: we are using an NVIDIA RTX 4090 graphics card running on a Linux operating system with the CUDA toolkit pre-configured. If you don’t have access to similar local hardware, you can rent GPU instances from Vast.ai for a cheaper price.

 

Building AI Agent with llama.cpp
Screenshot from Vast.ai | Console

 

  1. Update your system’s package list and install essential tools like build-essential, cmake, curl, and git. pciutils is included for hardware information, and libcurl4-openssl-dev is needed for llama.cpp to download models from Hugging Face.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git -y

 

  1. Clone the official llama.cpp repository from GitHub and use cmake to configure the build.
# Clone llama.cpp repository
git clone https://github.com/ggml-org/llama.cpp

# Configure build with CUDA support
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_CUDA=ON \
    -DLLAMA_CURL=ON

 

  1. Compile llama.cpp and all its tools, including the server. For convenience, copy all the compiled binaries from the llama.cpp/build/bin/ directory to the main llama.cpp/ directory.
# Build all necessary binaries including server
cmake --build llama.cpp/build --config Release -j --clean-first

# Copy all binaries to main directory
cp llama.cpp/build/bin/* llama.cpp/

 

  1. Start the llama.cpp server with a unsloth/gemma-3-4b-it-GGUF model.
./llama.cpp/llama-server \
    -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL \
    --host 0.0.0.0 \
    --port 8000 \
    --n-gpu-layers 999 \
    --ctx-size 8192 \
    --threads $(nproc) \
    --temp 0.6 \
    --cache-type-k q4_0 \
    --jinja

 

  1. You can test if the server is running correctly by sending a POST request using curl.
(main) root@C.20841134:/workspace$ curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Hello! How are you today?"}
        ],
        "max_tokens": 150,
        "temperature": 0.7
    }'

 

Output:

{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"\nOkay, user greeted me with a simple "Hello! How are you today?" \n\nHmm, this seems like a casual opening. The user might be testing the waters to see if I respond naturally, or maybe they genuinely want to know how an AI assistant conceptualizes \"being\" but in a friendly way. \n\nI notice they used an exclamation mark, which feels warm and possibly playful. Maybe they're in a good mood or just trying to make conversation feel less robotic. \n\nSince I don't have emotions, I should clarify that gently but still keep it warm. The response should acknowledge their greeting while explaining my nature as an AI. \n\nI wonder if they're asking because they're curious about AI consciousness, or just being polite"}}],"created":1749319250,"model":"gpt-3.5-turbo","system_fingerprint":"b5605-5787b5da","object":"chat.completion","usage":{"completion_tokens":150,"prompt_tokens":9,"total_tokens":159},"id":"chatcmpl-jNfif9mcYydO2c6nK0BYkrtpNXSnseV1","timings":{"prompt_n":9,"prompt_ms":65.502,"prompt_per_token_ms":7.278,"prompt_per_second":137.40038472107722,"predicted_n":150,"predicted_ms":1207.908,"predicted_per_token_ms":8.052719999999999,"predicted_per_second":124.1816429728092}}

 

2. Building an AI Agent with Langgraph and llama.cpp

 
Now, let’s use Langgraph and Langchain to interact with the llama.cpp server and build a multi tool AI agent.

  1. Set your Tavily API key for search capabilities.
  2. For Langchain to work with the local llama.cpp server (which emulates an OpenAI API), you can set OPENAI_API_KEY to local or any non-empty string, as the base_url will direct requests locally.
export TAVILY_API_KEY="your_api_key_here"
export OPENAI_API_KEY=local

 

  1. Install the necessary Python libraries: langgraph for creating agents, tavily-python for the Tavily search tool, and various langchain packages for LLM interactions and tools.
%%capture
!pip install -U \
    langgraph tavily-python langchain langchain-community langchain-experimental langchain-openai

 

  1. Configure ChatOpenAI from Langchain to communicate with your local llama.cpp server.
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="unsloth/gemma-3-4b-it-GGUF:Q4_K_XL",   
    temperature=0.6,
    base_url="http://localhost:8000/v1",         
)

 

  1. Set up the tools that your agent will be able to use.
    • TavilySearchResults: Allows the agent to search the web.
    • PythonREPLTool: Provides the agent with a Python Read-Eval-Print Loop to execute code.
from langchain_community.tools import TavilySearchResults
from langchain_experimental.tools.python.tool import PythonREPLTool

search_tool = TavilySearchResults(max_results=5, include_answer=True)
code_tool   = PythonREPLTool()

tools = [search_tool, code_tool]

 

  1. Use LangGraph’s pre built create_react_agent function to create an agent that can reason and act (ReAct framework) using the LLM and the defined tools.
from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    model=llm,
    tools=tools,
)

 

3. Test the AI Agent with Example Queries

 
Now, we will test the AI agent and also display which tools the agent uses.

  1. This helper function extracts the names of the tools used by the agent from the conversation history. This is useful for understanding the agent’s decision-making process.
def extract_tool_names(conversation: dict) -> list[str]:
    tool_names = set()
    for msg in conversation.get('messages', []):
        calls = []
        if hasattr(msg, 'tool_calls'):
            calls = msg.tool_calls or []
        elif isinstance(msg, dict):
            calls = msg.get('tool_calls') or []
            if not calls and isinstance(msg.get('additional_kwargs'), dict):
                calls = msg['additional_kwargs'].get('tool_calls', [])
        else:
            ak = getattr(msg, 'additional_kwargs', None)
            if isinstance(ak, dict):
                calls = ak.get('tool_calls', [])
        for call in calls:
            if isinstance(call, dict):
                if 'name' in call:
                    tool_names.add(call['name'])
                elif 'function' in call and isinstance(call['function'], dict):
                    fn = call['function']
                    if 'name' in fn:
                        tool_names.add(fn['name'])
    return sorted(tool_names)

 

  1. Define a function to run the agent with a given question and print the tools used and the final answer.
def run_agent(question: str):
    result = agent.invoke({"messages": [{"role": "user", "content": question}]})
    raw_answer = result["messages"][-1].content
    tools_used = extract_tool_names(result)
    return tools_used, raw_answer

 

  1. Let’s ask the agent for the top 5 breaking news stories. It should use the tavily_search_results_json tool.
tools, answer = run_agent("What are the top 5 breaking news stories?")
print("Tools used ➡️", tools)
print(answer)

 

Output:

Tools used ➡️ ['tavily_search_results_json']
Here are the top 5 breaking news stories based on the provided sources:

1.  **Gaza Humanitarian Crisis:** Ongoing conflict and challenges in Gaza, including the Eid al-Adha holiday, and the retrieval of a Thai hostage's body.
2.  **Russian Drone Attacks on Kharkiv:** Russia continues to target Ukrainian cities with drone and missile strikes.
3.  **Wagner Group Departure from Mali:** The Wagner Group is leaving Mali after heavy losses, but Russia's Africa Corps remains.
4.  **Trump-Musk Feud:** A dispute between former President Trump and Elon Musk could have implications for Tesla stock and the U.S. space program.
5.  **Education Department Staffing Cuts:** The Biden administration is seeking Supreme Court intervention to block planned staffing cuts at the Education Department.

 

  1. Let’s ask the agent to write and execute Python code for the Fibonacci series. It should use the Python_REPL tool.
tools, answer = run_agent(
    "Write a code for the Fibonacci series and execute it using Python REPL."
)
print("Tools used ➡️", tools)
print(answer)

 

Output:

Tools used ➡️ ['Python_REPL']
The Fibonacci series up to 10 terms is [0, 1, 1, 2, 3, 5, 8, 13, 21, 34].

 

 

Final Thoughts

 
In this guide, I have used a small quantized LLM, which sometimes struggles with accuracy, especially when it comes to selecting the tools. If your goal is to build production-ready AI agents, I highly recommend running the latest, full-sized models with llama.cpp. Larger and more recent models generally provide better results and more reliable outputs

It’s important to note that setting up llama.cpp can be more challenging compared to user-friendly tools like Ollama. However, if you are willing to invest the time to debug, optimize, and tailor llama.cpp for your specific hardware, the performance gains and flexibility are well worth it.

One of the biggest advantages of llama.cpp is its efficiency: you don’t need high-end hardware to get started. It runs well on regular CPUs and laptops without dedicated GPUs, making local AI accessible to almost everyone. And if you ever need more power, you can always rent an affordable GPU instance from a cloud provider.

 
 

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.