As AI applications move from experiments to production systems, one problem becomes unavoidable: how do you know your system is actually improving?
A prompt that works well today may degrade tomorrow. A model upgrade may silently introduce hallucinations. A new tool integration may cause your agent to behave unpredictably.
This is where evaluation systems (evals) become critical.
In traditional software engineering, we rely on automated tests to ensure reliability. AI systems require a similar discipline. Instead of testing deterministic outputs, we evaluate task performance, reasoning quality, tool usage, and safety behaviors.
This article walks through a practical approach to implementing evals using LangSmith, LangChain's evaluation and observability platform.
The goal is not to build the most sophisticated evaluation framework. The goal is to create a repeatable system that tells you whether your AI application is getting better or worse over time.
The Example We Will Use
To make this discussion concrete, let's assume we are building a customer support assistant for a SaaS company.
The assistant answers product questions by retrieving information from documentation using a RAG pipeline.
The flow looks like this:
- User asks a question
- The system retrieves relevant documentation
- The LLM generates an answer grounded in those documents
At first glance, the system may appear to work well. However, several hidden failure modes can appear in production:
- the system retrieves the wrong document
- the model hallucinates unsupported information
- the answer is technically correct but incomplete
- the system fails when question phrasing changes
Without structured evaluation, these problems remain invisible.
This is where LangSmith evaluation workflows become useful.
Step 1: Define What "Success" Means
Before writing evaluation code, the most important step is defining what a correct result actually looks like.
Many teams skip this step and jump directly into testing prompts. But evaluation works best when you clearly define success criteria first.
For our support assistant example, a good response should:
- answer the user's question
- rely on retrieved documentation
- avoid hallucinated information
- remain concise and readable
LangSmith evaluations work by measuring these criteria through custom evaluators and grading functions.
Learn more about evaluation concepts in LangSmith.
Step 2: Create a Dataset of Real Test Cases
Next, we create a dataset of representative tasks.
LangSmith evaluations are dataset-driven. Instead of manually testing prompts, you define a dataset of inputs and run your system against it repeatedly.
LangSmith datasets allow you to store test examples that include:
- user inputs
- reference outputs
- metadata
- evaluation criteria
You can learn how to create and manage datasets here.
For our support assistant, the dataset might include questions such as:
- "How do I reset my password?"
- "What integrations does your platform support?"
- "Can I export reports to CSV?"
- "Is there an API rate limit?"
In the beginning, 10–20 well-chosen examples are usually enough to build a useful evaluation dataset. Over time, teams add more examples based on real production failures.
Step 3: Run Your Application as the Target Function
LangSmith evaluates systems by executing a target function.
The target function is simply the workflow you want to test. In our example, the function performs:
- document retrieval
- LLM response generation
- returning the final answer
LangSmith runs this function against every dataset example and records the output.
The evaluation framework then compares results across multiple runs.
The official guide for evaluating an application using a target function can be found here.
This design is powerful because it evaluates the entire system, not just the prompt.
Step 4: Implement Deterministic Evaluators
The first evaluators you should implement are deterministic checks.
These are evaluation rules written in code instead of using another AI model.
Examples include checking whether:
- the response includes document citations
- the output format is valid
- the response length stays within limits
- restricted phrases are avoided
LangSmith allows developers to implement custom evaluators that score outputs programmatically.
Deterministic checks should always be the foundation of an evaluation framework because they are objective and repeatable.
Step 5: Use LLM-as-Judge for Complex Evaluation
Some response qualities are difficult to measure using code.
For example:
- Is the answer helpful?
- Does it fully address the user's question?
- Is it grounded in context?
In these cases we can use LLM-as-judge evaluators.
In this approach, another model evaluates the response using a structured rubric.
LangSmith supports building evaluators that use language models as judges.
A typical judge prompt might look like:
"Given the user question, retrieved context, and generated response, determine whether the response is grounded in the provided documents."
This approach allows teams to evaluate qualitative dimensions such as clarity and correctness.
Step 6: Evaluate Intermediate Steps
When building agent systems or complex pipelines, evaluating the final answer alone may not be sufficient.
A system could accidentally produce the correct answer while taking an incorrect reasoning path.
LangSmith allows developers to evaluate intermediate steps in execution traces.
This enables teams to verify:
- whether the correct documents were retrieved
- whether tools were used correctly
- whether the reasoning path followed expected behavior
Trace inspection is one of the most valuable capabilities when debugging AI systems.
Step 7: Turn Evals Into a Continuous Development Loop
The real power of evals comes from running them continuously as the system evolves.
Every time the application changes — whether through prompt adjustments, model upgrades, or retriever improvements — the evaluation dataset should be executed again.
LangSmith makes it easy to compare runs and track improvements over time.
You can learn about evaluation workflows here.
This approach allows teams to detect:
- regressions
- hallucination increases
- retrieval failures
- performance improvements
Over time, the evaluation dataset becomes a quality benchmark for your AI system.
Final Thoughts
AI systems behave differently from traditional deterministic software. Their behavior can shift as prompts, models, or surrounding systems evolve.
Evaluation frameworks like LangSmith introduce the discipline needed to manage that complexity.
A practical evaluation workflow typically includes:
- defining clear success criteria
- building a dataset of representative tasks
- evaluating the entire application pipeline
- combining deterministic checks with LLM judges
- running evaluations continuously during development
By adopting this approach, AI applications stop behaving like unpredictable prototypes and begin to behave like reliable production systems.