The Beginner’s Guide to Tracking Token Usage in LLM Apps

0
4



Image by Author | Ideogram.ai

 

Introduction

 
When building large language model applications, tokens are money. If you’ve ever worked with an LLM like GPT-4, you’ve probably had that moment where you check the bill and think, “How did it get this high?!” Each API call you make consumes tokens, which directly impacts both latency and cost. But without tracking them, you have no idea where they’re being spent or how to optimize.

That’s where LangSmith comes in. It not only traces your LLM calls but also lets you log, monitor, and visualize token usage for every step in your workflow. In this guide, we’ll cover:

  1. Why token tracking matters?
  2. How to set up logging?
  3. How to visualize token consumption in the LangSmith dashboard?

 

Why does Token Tracking Matter?

 
Token tracking matters because every interaction with a large language model has a direct cost tied to the number of tokens processed, both in your inputs and the model’s outputs. Without monitoring, small inefficiencies in prompts, unnecessary context, or redundant requests can silently inflate your bill and slow down performance.

By tracking tokens, you gain visibility into exactly where they’re being consumed. This way you can optimize prompts, streamline workflows, and maintain cost control. For example, if your chatbot is using 1,500 tokens per request, reducing that to 800 tokens can cut costs almost in half. The token tracking concept somehow works like:
 
Why does Token Tracking Matter?

 

Setting Up LangSmith for Token Logging

 

// Step 1: Install Required Packages

pip3 install langchain langsmith transformers accelerate langchain_community

 

// Step 2: Make all necessary imports

import os
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langsmith import traceable

 

// Step 3: Configure Langsmith

Set your API key and project name:

# Replace with your API key
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
os.environ["LANGCHAIN_PROJECT"] = "HF_FLAN_T5_Base_Demo"
os.environ["LANGCHAIN_TRACING_V2"] = "true"


# Optional: disable tokenizer parallelism warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

 

// Step 4: Load a Hugging Face Model

Use a CPU-friendly model like google/flan-t5-base and enable sampling for more natural outputs:

model_name = "google/flan-t5-base"
pipe = pipeline(
   "text2text-generation",
   model=model_name,
   tokenizer=model_name,
   device=-1,      # CPU
   max_new_tokens=60,
   do_sample=True, # enable sampling
   temperature=0.7
)
llm = HuggingFacePipeline(pipeline=pipe)

 

// Step 5: Create a Prompt and Chain

Define a prompt template and connect it with your Hugging Face pipeline using LLMChain:

prompt_template = PromptTemplate.from_template(
   "Explain gravity to a 10-year-old in about 20 words using a fun analogy."
)


chain = LLMChain(llm=llm, prompt=prompt_template)

 

// Step 6: Make the Function Traceable with LangSmith

Use the @traceable decorator to automatically log inputs, outputs, token usage, and runtime:

@traceable(name="HF Explain Gravity")
def explain_gravity():
   return chain.run({})

 

// Step 7: Run the Function and Print Results

answer = explain_gravity()
print("\n=== Hugging Face Model Answer ===")
print(answer)

 

Output:

=== Hugging Face Model Answer ===
Gravity is a measure of mass of an object.

 

// Step 8: Check the Langsmith Dashboard

Go to smith.langchain.com → Tracing Projects. You will something as:
 
Langsmith Dashboard - Tracing Projects
 
You can even see the cost associated with each project, which lets you analyse your billing. Now to see the usage of tokens and other insights, click on your project. And you will see:
 
Langsmith Dashboard - Number of Runs
 
The red box highlights and lists down the number of runs you have made to your project. Click on any run and you will see:
 
Langsmith Dashboard - Token Insights
 

You can see various things here such as total tokens, latency, etc. Click on dashboard as shown below:
 
Langsmith Dashboard
 

Now you can view graphs over time to track token usage trends, check average latency per request, compare input vs. output tokens, and identify peak usage periods. These insights help optimize prompts, manage costs, and improve model performance.
 
Langsmith Dashboard - Graph
 

Please scroll down to view all the associated graphs with your project.

 

// Step 9: Explore the LangSmith Dashboard

You can analyse plenty of the insights such as:

  • View Example Traces: Click on a trace to see detailed execution, including raw input, generated output, and performance metrics
  • Inspect Individual Traces: For each trace, you can explore every step of execution, seeing prompts, outputs, token usage, and latency
  • Check Token Usage & Latency: Detailed token counts and processing times help identify bottlenecks and optimize performance
  • Evaluation Chains: Use LangSmith’s evaluation tools to test scenarios, track model performance, and compare outputs
  • Experiment in Playground: Adjust parameters such as temperature, prompt templates, or sampling settings to fine-tune your model’s behavior

With this setup, you now have full visibility of your Hugging Face model runs, token usage, and overall performance in the LangSmith dashboard.

 

How To Spot and Fix Token Hogs?

 
Once you’ve got logging, you can:

  • See if prompts are too long
  • Identify calls where the model is over-generating
  • Switch to smaller models for cheaper tasks
  • Cache responses to avoid duplicate requests

This is gold for debugging long chains or agents. Find the step eating the most tokens and fix it.

 

Wrapping Up

 
This is how you can set up and use Langsmith. Logging token usage isn’t just about saving money, it’s about building smarter, more efficient LLM apps. The guide provides a foundation, you can learn more by exploring, experimenting, and analyzing your own workflows.
 
 

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.