Evaluating Skills
📰 LangChain Blog
Evaluating skills for coding agents like Claude Code requires a structured approach to ensure they improve agent performance
Action Steps
- Define tasks for the agent to complete
- Create skills to aid in task completion
- Test the agent with and without skills
- Compare performance and iterate on skill development
- Set up a clean testing environment using tools like Docker or Harbor
Who Needs to Know This
Developers and engineers working with coding agents and LLMs can benefit from this evaluation pipeline to improve agent performance and scalability
Key Insight
💡 A clean testing environment is crucial for reproducible and accurate skill evaluation
Share This
🤖 Improve coding agent performance with a structured skill evaluation pipeline! #LLMs #CodingAgents
Key Takeaways
Evaluating skills for coding agents like Claude Code requires a structured approach to ensure they improve agent performance
Full Article
Published Time: 2026-03-05T18:00:49.000Z
# Evaluating Skills
[Skip to content](https://blog.langchain.com/evaluating-skills/#main)
[](https://blog.langchain.com/)
* [Website](https://www.langchain.com/)
* [Docs](https://docs.langchain.com/)
* [Case Studies](https://blog.langchain.com/tag/case-studies/)
* [Harrison's In the Loop Series](https://blog.langchain.com/tag/in-the-loop/)
* [Try LangSmith](https://smith.langchain.com/)
[Sign in](https://blog.langchain.com/evaluating-skills/#/portal/signin)[Subscribe](https://blog.langchain.com/evaluating-skills/#/portal/signup)

# Evaluating Skills
7 min read Mar 5, 2026
_By Robert Xu_
Recently at LangChain we’ve been building skills to help coding agents like Codex, Claude Code, and Deep Agents CLI work with our ecosystem: namely, LangChain and [LangSmith](https://www.langchain.com/langsmith/evaluation?ref=blog.langchain.com). This is not an effort unique to us - most (if not all) companies are exploring how to create skills to give to coding agents. A key part of building these skills is making sure they actually work. In this blog, we cover some learnings and best practices for how to evaluate skills as you create them.
## What are Skills?
Skills are curated instructions, scripts, and resources that improve agent performance in specialized domains. Importantly, skills are dynamically loaded through progressive disclosure — the agent only retrieves a skill when it’s relevant to the task at hand. This helps agents scale their performance; historically, giving too many tools to an [agent would cause its performance to degrade](https://blog.langchain.com/react-agent-benchmarking/).
In practice, skills can be thought of as prompts that are dynamically loaded when the agent needs them. Like any prompt, they can impact agent behavior in unexpected ways. Consequently, skills need to be tested, just like you would your LLM prompts. Which skills improve coding agent performance? Which content changes resulted in the most improvement?
## The Basic Evaluation Pipeline
Our basic approach for testing skills:
1. Define tasks you want Claude Code to successfully complete
2. Define skills that help with the tasks
3. Run Claude Code on the tasks without skills
4. Run Claude Code on the tasks with skills
5. Compare performance and iterate on your skill
Below, we share some best practices from our experiences on creating your own evaluation pipeline.
## Step 1: Set Up a Clean Testing Environment
Skills are commonly used with coding agents like Claude Code, or harnesses like Deep Agents. When you’re testing a skill, you’re really testing if these powerful agents can use the skill information effectively. You’re testing if the agent’s performance improves — so in practice, you’re testing the coding agent itself.
Coding agents and harnesses have a large action space they can operate over. They are also sensitive to starting conditions: Claude Code will often explore your directory before it starts working, and what it finds will shape its approach. This means when testing skills, it’s critical to prepare a consistent and clean environment for the agent using your skills. It ensures you maximize the reproducibility of your tests.
In our testing, we used a lightweight Docker scaffold to run Claude Code in. Other alternatives include [Harbor](https://github.com/laude-institute/harbor?ref=blog.langchain.com) or your choice of sandbox.
```python
def run_claude_in_docker(
test_dir: Path, prompt: str, timeout: int = 300, model: str = None
) -> subprocess.CompletedProcess:
"""Run Claude CLI in Docker. Returns CompletedProcess."""
if not check_docker_available():
raise RuntimeError("Docker not available")
cmd = ["run-claude", s
# Evaluating Skills
[Skip to content](https://blog.langchain.com/evaluating-skills/#main)
[](https://blog.langchain.com/)
* [Website](https://www.langchain.com/)
* [Docs](https://docs.langchain.com/)
* [Case Studies](https://blog.langchain.com/tag/case-studies/)
* [Harrison's In the Loop Series](https://blog.langchain.com/tag/in-the-loop/)
* [Try LangSmith](https://smith.langchain.com/)
[Sign in](https://blog.langchain.com/evaluating-skills/#/portal/signin)[Subscribe](https://blog.langchain.com/evaluating-skills/#/portal/signup)

# Evaluating Skills
7 min read Mar 5, 2026
_By Robert Xu_
Recently at LangChain we’ve been building skills to help coding agents like Codex, Claude Code, and Deep Agents CLI work with our ecosystem: namely, LangChain and [LangSmith](https://www.langchain.com/langsmith/evaluation?ref=blog.langchain.com). This is not an effort unique to us - most (if not all) companies are exploring how to create skills to give to coding agents. A key part of building these skills is making sure they actually work. In this blog, we cover some learnings and best practices for how to evaluate skills as you create them.
## What are Skills?
Skills are curated instructions, scripts, and resources that improve agent performance in specialized domains. Importantly, skills are dynamically loaded through progressive disclosure — the agent only retrieves a skill when it’s relevant to the task at hand. This helps agents scale their performance; historically, giving too many tools to an [agent would cause its performance to degrade](https://blog.langchain.com/react-agent-benchmarking/).
In practice, skills can be thought of as prompts that are dynamically loaded when the agent needs them. Like any prompt, they can impact agent behavior in unexpected ways. Consequently, skills need to be tested, just like you would your LLM prompts. Which skills improve coding agent performance? Which content changes resulted in the most improvement?
## The Basic Evaluation Pipeline
Our basic approach for testing skills:
1. Define tasks you want Claude Code to successfully complete
2. Define skills that help with the tasks
3. Run Claude Code on the tasks without skills
4. Run Claude Code on the tasks with skills
5. Compare performance and iterate on your skill
Below, we share some best practices from our experiences on creating your own evaluation pipeline.
## Step 1: Set Up a Clean Testing Environment
Skills are commonly used with coding agents like Claude Code, or harnesses like Deep Agents. When you’re testing a skill, you’re really testing if these powerful agents can use the skill information effectively. You’re testing if the agent’s performance improves — so in practice, you’re testing the coding agent itself.
Coding agents and harnesses have a large action space they can operate over. They are also sensitive to starting conditions: Claude Code will often explore your directory before it starts working, and what it finds will shape its approach. This means when testing skills, it’s critical to prepare a consistent and clean environment for the agent using your skills. It ensures you maximize the reproducibility of your tests.
In our testing, we used a lightweight Docker scaffold to run Claude Code in. Other alternatives include [Harbor](https://github.com/laude-institute/harbor?ref=blog.langchain.com) or your choice of sandbox.
```python
def run_claude_in_docker(
test_dir: Path, prompt: str, timeout: int = 300, model: str = None
) -> subprocess.CompletedProcess:
"""Run Claude CLI in Docker. Returns CompletedProcess."""
if not check_docker_available():
raise RuntimeError("Docker not available")
cmd = ["run-claude", s
DeepCamp AI