Terminal-Bench 2.0 is the official benchmark for evaluating AI coding agents, and Harbor is its official harness. This guide shows you how to run Terminal-Bench evaluations locally and in the cloud.
What is Terminal-Bench?
Terminal-Bench 2.0 is a comprehensive benchmark that evaluates AI agents’ ability to:
Complete real-world coding tasks
Navigate complex software environments
Use command-line tools effectively
Debug and fix issues autonomously
The benchmark includes diverse tasks across multiple programming languages and domains.
Quick Start
Set up your API key
Export your Anthropic API key (or other provider): export ANTHROPIC_API_KEY = your-api-key-here
Run the evaluation
Execute Terminal-Bench with Claude Code: harbor run --dataset terminal-bench@2.0 \
--agent claude-code \
--model anthropic/claude-opus-4-1 \
--n-concurrent 4
This runs the benchmark locally using Docker with 4 parallel tasks.
Monitor progress
Harbor displays real-time progress: Running 250 trials across 1 agent(s) and 250 task(s)
Progress: 12/250 (4.8%) | Success: 8/12 (66.7%)
View results
When complete, results are saved to jobs/<job-id>/: harbor view jobs/ < job-i d >
Cloud Execution
For faster evaluation at scale, run on cloud providers like Daytona:
export ANTHROPIC_API_KEY = your-anthropic-key
export DAYTONA_API_KEY = your-daytona-key
harbor run --dataset terminal-bench@2.0 \
--agent claude-code \
--model anthropic/claude-opus-4-1 \
--n-concurrent 100 \
--env daytona
Cloud execution allows you to run 100+ tasks in parallel, dramatically reducing evaluation time.
Configuration Options
Agent Selection
Evaluate different agents on Terminal-Bench:
# OpenHands
harbor run -d terminal-bench@2.0 -a openhands -m anthropic/claude-opus-4-1
# Aider
harbor run -d terminal-bench@2.0 -a aider -m anthropic/claude-opus-4-1
# Goose
harbor run -d terminal-bench@2.0 -a goose -m anthropic/claude-opus-4-1
Task Filtering
Run a subset of tasks:
# Run first 10 tasks only
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-opus-4-1 --max-tasks 10
# Run specific task by ID
harbor trials start -p path/to/terminal-bench/task-id -a claude-code -m anthropic/claude-opus-4-1
Timeout Configuration
# Increase agent timeout to 30 minutes
harbor run -d terminal-bench@2.0 \
-a claude-code \
-m anthropic/claude-opus-4-1 \
--agent-timeout-multiplier 2.0
Multiple Attempts
Run multiple attempts per task for statistical significance:
harbor run -d terminal-bench@2.0 \
-a claude-code \
-m anthropic/claude-opus-4-1 \
--attempts 3
Understanding Results
After evaluation completes, Harbor generates comprehensive results:
Job Summary
{
"id" : "job-abc123" ,
"status" : "completed" ,
"stats" : {
"total_trials" : 250 ,
"completed" : 250 ,
"success_rate" : 0.68 ,
"mean_reward" : 0.68 ,
"total_cost" : 45.32
}
}
Per-Task Results
Each task generates a trial_result.json:
{
"reward" : 1.0 ,
"status" : "completed" ,
"timing" : {
"agent_time_sec" : 145.2 ,
"verifier_time_sec" : 8.1
},
"usage_info" : {
"prompt_tokens" : 12500 ,
"completion_tokens" : 3200 ,
"total_cost" : 0.18
}
}
Viewing Traces
View agent trajectories in the web UI:
harbor view jobs/ < job-i d >
Or export to ATIF format for analysis:
harbor traces export jobs/ < job-i d > --output traces.jsonl
Typical execution times for Terminal-Bench 2.0 (250 tasks):
Environment Concurrency Time Cost (Claude Opus) Local Docker 4 ~18 hours ~$50 Local Docker 16 ~5 hours ~$50 Daytona 50 ~2 hours ~$50 + compute Daytona 100 ~1 hour ~$50 + compute Modal 100 ~1 hour ~$50 + compute
Start with a small subset (10-20 tasks) to validate your setup before running the full benchmark.
Troubleshooting
Increase build timeout: harbor run -d terminal-bench@2.0 \
--build-timeout-multiplier 2.0 \
-a claude-code -m anthropic/claude-opus-4-1
Reduce concurrency or add delays: harbor run -d terminal-bench@2.0 \
--n-concurrent 2 \
-a claude-code -m anthropic/claude-opus-4-1
Clean up Docker resources: harbor cache clean --all
docker system prune -a
Next Steps
SWE-Bench Run software engineering benchmarks
Custom Benchmark Create your own benchmark
RL Optimization Generate rollouts for RL training
Parameter Sweeps Optimize agent parameters