Skip to main content
Experiments allow you to systematically evaluate and compare different runnables using a set of evaluators. Each experiment is associated with a specific runnable and can test multiple candidate variations against defined evaluation criteria.

Creating an Experiment

To create an experiment, you’ll need:
  • A base runnable to test variations against
  • Previous runs of the base runnable to use to source realistic input resources from
  • A set of automated evaluators and/or gold labels from previous runs
  • Configuration for candidate runnables to test
import { Experiment } from "@re-factor/sdk/experiment";

const myCompletionRunnableId = "...";
const summaryLengthEvaluatorId = "...";
const coherenceEvaluatorId = "...";

const experiment = await Experiment.construct({
  name: "Model Comparison Test",
  runnable_type: "completion",
  runnable_id: myCompletionRunnableId,
  description: "Comparing different LLM models for summarization",
  evaluator_ids: [summaryLengthEvaluatorId, coherenceEvaluatorId],
  candidate_runnables: [{
    name: "Claude 3.5 Sonnet",
    llms: [{
        name: "claude-sonnet-3.5",
        provider: "anthropic",
        model: "claude-3.5-sonnet",
        default: true
    }],
  }, {
    name: "Gemini 2.0 Flash Experimental",
    llms: [{
        name: "gemini-2.0-flash-exp",
        provider: "google",
        model: "gemini-2.0-flash-exp",
        default: true
    }],
  }],
  timeout_seconds: 3600
});

await experiment.start();

Experiment Configuration

Key fields in the experiment configuration:
  • runnable_id: The ID of the base runnable being tested
  • evaluator_ids: Array of evaluator IDs to use for assessment
  • run_filters: Optional filters to select specific runs for evaluation
  • candidate_runnables: Configuration for variations to test
  • timeout_seconds: Maximum time allowed per run (-1 for no timeout)
  • max_runs: Maximum number of runs to evaluate

Managing Experiments

List experiments for a runnable:
import { Experiment } from "@re-factor/sdk/experiment";

const runnableId = "...";
const experiments = await Experiment.list({
  runnable_id: runnableId
});
Load a specific experiment:
import { Experiment } from "@re-factor/sdk/experiment";

const experiment = await Experiments.load(experimentId);
Cancel a specific experiment:
import { Experiment } from "@re-factor/sdk/experiment";

const experiment = await Experiments.load(experimentId);
await experiment.cancel();

Experiment Results

Results are available through the run evaluations associated with each experiment run. You can analyze these to compare performance across different configurations.
import { Experiment } from "@re-factor/sdk/experiment";

const experimentId = "...";
const experiment = await Experiment.load(experimentId);

// Get aggregated results for the current runnable and candidates
const evaluationResults = await experiment.getEvaluationResults();

// Get individual runs and see outputs and evaluation values for
// current runnable and candidates
const runs = await experiment.listRuns();

Best Practices

  1. Evaluator Selection: Choose evaluators that measure relevant aspects of performance for your use case.
  2. Timeout Configuration: Set appropriate timeouts based on expected processing time.
  3. Run Filters: Use filters to focus evaluation on specific types of inputs or scenarios.
  4. Resource Management: Monitor resource usage when running large experiments.
I