Documentation Index Fetch the complete documentation index at: https://mintlify.com/harbor-framework/harbor/llms.txt
Use this file to discover all available pages before exploring further.
Harbor’s sweep functionality allows you to systematically explore parameter spaces to find optimal agent configurations. This guide shows you how to run parameter sweeps and analyze results.
Overview
Parameter sweeps help you:
Find optimal agent configurations
Compare different models and temperatures
Test retry strategies
Identify performance bottlenecks
Validate hyperparameter sensitivity
Quick Start
Define sweep configuration
Create a sweep config file: dataset :
registry :
name : terminal-bench@2.0
max_tasks : 50 # Use subset for faster iteration
orchestrator :
n_concurrent_trials : 10
# Parameter grid
sweep :
agent :
name : [ "claude-code" , "aider" , "openhands" ]
model :
- "anthropic/claude-opus-4-1"
- "anthropic/claude-sonnet-4"
temperature : [ 0.0 , 0.3 , 0.7 ]
Run sweep
export ANTHROPIC_API_KEY = your-key
harbor sweeps run --config sweep-config.yaml
This runs all combinations: 3 agents × 2 models × 3 temperatures = 18 experiments
Analyze results
View results table: harbor sweeps summarize < sweep-i d >
Sweep Configuration
Grid Search
Test all combinations of parameters:
sweep :
agent :
name : [ "claude-code" , "aider" ]
model :
- "anthropic/claude-opus-4-1"
- "anthropic/claude-sonnet-4"
temperature : [ 0.0 , 0.5 , 1.0 ]
max_tokens : [ 4000 , 8000 ]
# Total: 2 × 2 × 3 × 2 = 24 experiments
Conditional Parameters
Different parameters per agent:
sweep :
- agent :
name : claude-code
model : "anthropic/claude-opus-4-1"
temperature : [ 0.0 , 0.3 , 0.7 ]
- agent :
name : aider
model : "anthropic/claude-sonnet-4"
edit_format : [ "whole" , "diff" , "udiff" ]
- agent :
name : openhands
model : "anthropic/claude-opus-4-1"
max_iterations : [ 10 , 20 , 30 ]
Timeout Multipliers
sweep :
agent :
name : [ "claude-code" ]
model : [ "anthropic/claude-opus-4-1" ]
timeout_multipliers :
agent : [ 1.0 , 1.5 , 2.0 ]
verifier : [ 1.0 , 2.0 ]
build : [ 1.0 , 1.5 ]
Analysis
Results Table
harbor sweeps summarize < sweep-i d >
Output:
┌────────────┬─────────────────┬─────────┬──────────┬──────────┬──────────┐
│ Agent │ Model │ Temp │ Success │ Avg Time │ Avg Cost │
├────────────┼─────────────────┼─────────┼──────────┼──────────┼──────────┤
│ claude-code│ opus-4-1 │ 0.0 │ 72% │ 145s │ $0.42 │
│ claude-code│ opus-4-1 │ 0.3 │ 68% │ 152s │ $0.45 │
│ claude-code│ sonnet-4 │ 0.0 │ 65% │ 128s │ $0.18 │
│ aider │ opus-4-1 │ 0.0 │ 58% │ 182s │ $0.38 │
└────────────┴─────────────────┴─────────┴──────────┴──────────┴──────────┘
Export Results
Export to CSV for analysis:
harbor sweeps summarize < sweep-i d > --format csv > results.csv
Load in Python:
import pandas as pd
import matplotlib.pyplot as plt
# Load results
df = pd.read_csv( "results.csv" )
# Plot success rate by temperature
for agent in df[ "agent" ].unique():
agent_df = df[df[ "agent" ] == agent]
plt.plot(agent_df[ "temperature" ], agent_df[ "success_rate" ],
label = agent, marker = 'o' )
plt.xlabel( "Temperature" )
plt.ylabel( "Success Rate" )
plt.legend()
plt.title( "Success Rate vs Temperature" )
plt.savefig( "temperature_sweep.png" )
Statistical Significance
Compare configurations:
import json
from scipy import stats
def load_trial_rewards ( job_id ):
"""Load all trial rewards from a job."""
job_result = json.load( open ( f "jobs/ { job_id } /job_result.json" ))
rewards = []
for trial in job_result[ "trials" ]:
if trial[ "status" ] == "completed" :
rewards.append(trial[ "reward" ])
return rewards
# Compare two configurations
config_a_rewards = load_trial_rewards( "job-abc" )
config_b_rewards = load_trial_rewards( "job-xyz" )
# T-test
t_stat, p_value = stats.ttest_ind(config_a_rewards, config_b_rewards)
print ( f "Config A mean: { sum (config_a_rewards) / len (config_a_rewards) :.3f} " )
print ( f "Config B mean: { sum (config_b_rewards) / len (config_b_rewards) :.3f} " )
print ( f "P-value: { p_value :.4f} " )
if p_value < 0.05 :
print ( "Difference is statistically significant" )
else :
print ( "No significant difference" )
Common Sweep Patterns
Model Comparison
sweep :
agent :
name : [ "claude-code" ]
model :
- "anthropic/claude-opus-4-1"
- "anthropic/claude-sonnet-4"
- "openai/gpt-4"
- "openai/gpt-4-turbo"
- "google/gemini-pro"
temperature : [ 0.0 ]
Temperature Tuning
sweep :
agent :
name : [ "claude-code" ]
model : [ "anthropic/claude-opus-4-1" ]
temperature : [ 0.0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 1.0 ]
Retry Strategy
sweep :
agent :
name : [ "claude-code" ]
model : [ "anthropic/claude-opus-4-1" ]
orchestrator :
retry :
max_attempts : [ 1 , 2 , 3 , 5 ]
on_agent_error : [ true ]
on_verifier_error : [ false ]
backoff_factor : [ 1.0 , 2.0 ]
Concurrency Optimization
sweep :
agent :
name : [ "claude-code" ]
model : [ "anthropic/claude-opus-4-1" ]
orchestrator :
n_concurrent_trials : [ 1 , 2 , 4 , 8 , 16 , 32 ]
environment :
type : [ "docker" , "daytona" ]
Advanced Techniques
Multi-Objective Optimization
Balance success rate, cost, and time:
import pandas as pd
import numpy as np
df = pd.read_csv( "sweep_results.csv" )
# Normalize metrics to 0-1 scale
df[ "success_norm" ] = df[ "success_rate" ]
df[ "cost_norm" ] = 1 - (df[ "avg_cost" ] / df[ "avg_cost" ].max())
df[ "time_norm" ] = 1 - (df[ "avg_time" ] / df[ "avg_time" ].max())
# Combined score (weights: 50% success, 30% cost, 20% time)
df[ "score" ] = (
0.5 * df[ "success_norm" ] +
0.3 * df[ "cost_norm" ] +
0.2 * df[ "time_norm" ]
)
# Find best configuration
best = df.loc[df[ "score" ].idxmax()]
print ( f "Best configuration:" )
print ( f " Agent: { best[ 'agent' ] } " )
print ( f " Model: { best[ 'model' ] } " )
print ( f " Temperature: { best[ 'temperature' ] } " )
print ( f " Score: { best[ 'score' ] :.3f} " )
Bayesian Optimization
For expensive sweeps, use Bayesian optimization:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
import numpy as np
def objective_function ( params ):
"""Run Harbor job with given params and return success rate."""
temp, max_tokens = params
# Run job
result = run_harbor_job(
agent = "claude-code" ,
model = "anthropic/claude-opus-4-1" ,
temperature = temp,
max_tokens = int (max_tokens)
)
return result[ "stats" ][ "mean_reward" ]
# Bayesian optimization loop
X_observed = [] # Parameter combinations tried
y_observed = [] # Observed success rates
for iteration in range ( 20 ):
# Fit GP model
gp = GaussianProcessRegressor( kernel = RBF())
if len (X_observed) > 0 :
gp.fit(X_observed, y_observed)
# Acquisition function (upper confidence bound)
def acquisition ( x ):
mu, sigma = gp.predict([x], return_std = True )
return mu + 2 * sigma # Exploration factor
# Sample next point
candidates = np.random.rand( 1000 , 2 )
candidates[:, 0 ] *= 1.0 # Temperature 0-1
candidates[:, 1 ] = candidates[:, 1 ] * 8000 + 2000 # Tokens 2000-10000
scores = [acquisition(c) for c in candidates]
next_params = candidates[np.argmax(scores)]
# Evaluate
result = objective_function(next_params)
X_observed.append(next_params)
y_observed.append(result)
print ( f "Iteration { iteration } : temp= { next_params[ 0 ] :.2f} , "
f "tokens= { int (next_params[ 1 ]) } , success= { result :.3f} " )
# Best found
best_idx = np.argmax(y_observed)
print ( f " \n Best: temp= { X_observed[best_idx][ 0 ] :.2f} , "
f "tokens= { int (X_observed[best_idx][ 1 ]) } , "
f "success= { y_observed[best_idx] :.3f} " )
Adaptive Sweeps
Focus on promising regions:
def adaptive_sweep ():
# Phase 1: Coarse grid
coarse_temps = [ 0.0 , 0.5 , 1.0 ]
coarse_results = run_sweep( temperatures = coarse_temps)
# Find best region
best_temp = coarse_results.loc[coarse_results[ "success_rate" ].idxmax()][ "temperature" ]
# Phase 2: Fine-grained search around best
fine_temps = np.linspace( max ( 0 , best_temp - 0.3 ),
min ( 1 , best_temp + 0.3 ),
num = 11 )
fine_results = run_sweep( temperatures = fine_temps)
return fine_results
Cost Optimization
Subset Evaluation
Test on small subset first:
# Quick sweep on 20 tasks
dataset :
registry :
name : terminal-bench@2.0
max_tasks : 20
sweep :
agent :
name : [ "claude-code" , "aider" , "openhands" ]
model :
- "anthropic/claude-opus-4-1"
- "anthropic/claude-sonnet-4"
Then run full evaluation on top performers only.
Early Stopping
Stop poor configurations early:
def run_with_early_stopping ( config , eval_tasks , threshold = 0.3 ):
# Run on first 20% of tasks
sample_size = int ( len (eval_tasks) * 0.2 )
sample_result = run_harbor_job(config, tasks = eval_tasks[:sample_size])
# Early stop if poor performance
if sample_result[ "stats" ][ "mean_reward" ] < threshold:
print ( f "Early stopping config { config } : { sample_result[ 'stats' ][ 'mean_reward' ] :.2f} < { threshold } " )
return sample_result
# Continue with full evaluation
return run_harbor_job(config, tasks = eval_tasks)
Best Practices
Start small : Use subset of tasks for initial sweeps
One variable at a time : Change one parameter when possible
Multiple seeds : Run key configs multiple times for variance
Document findings : Track insights in markdown/notebook
Version control : Save sweep configs and results
Monitor costs : Track spending during sweeps
Use cloud : Leverage parallelization for faster sweeps
Next Steps
RL Optimization Use sweep results for RL training
Custom Metrics Define custom success metrics
Parallel Execution Optimize sweep performance
Cloud Execution Scale sweeps to the cloud