Experiments

Experiments allow you to systematically evaluate and compare different runnables using a set of evaluators. Each experiment is associated with a specific runnable and can test multiple candidate variations against defined evaluation criteria.

Creating an Experiment

To create an experiment, you’ll need:

A base runnable to test variations against
Previous runs of the base runnable to use to source realistic input resources from
A set of automated evaluators and/or gold labels from previous runs
Configuration for candidate runnables to test

import { Experiment } from "@re-factor/sdk/experiment";

const myCompletionRunnableId = "...";
const summaryLengthEvaluatorId = "...";
const coherenceEvaluatorId = "...";

const experiment = await Experiment.construct({
  name: "Model Comparison Test",
  runnable_type: "completion",
  runnable_id: myCompletionRunnableId,
  description: "Comparing different LLM models for summarization",
  evaluator_ids: [summaryLengthEvaluatorId, coherenceEvaluatorId],
  candidate_runnables: [{
    name: "Claude 3.5 Sonnet",
    llms: [{
        name: "claude-sonnet-3.5",
        provider: "anthropic",
        model: "claude-3.5-sonnet",
        default: true
    }],
  }, {
    name: "Gemini 2.0 Flash Experimental",
    llms: [{
        name: "gemini-2.0-flash-exp",
        provider: "google",
        model: "gemini-2.0-flash-exp",
        default: true
    }],
  }],
  timeout_seconds: 3600
});

await experiment.start();

Experiment Configuration

Key fields in the experiment configuration:

runnable_id: The ID of the base runnable being tested
evaluator_ids: Array of evaluator IDs to use for assessment
run_filters: Optional filters to select specific runs for evaluation
candidate_runnables: Configuration for variations to test
timeout_seconds: Maximum time allowed per run (-1 for no timeout)
max_runs: Maximum number of runs to evaluate

Managing Experiments

List experiments for a runnable:

import { Experiment } from "@re-factor/sdk/experiment";

const runnableId = "...";
const experiments = await Experiment.list({
  runnable_id: runnableId
});

Load a specific experiment:

import { Experiment } from "@re-factor/sdk/experiment";

const experiment = await Experiments.load(experimentId);

Cancel a specific experiment:

import { Experiment } from "@re-factor/sdk/experiment";

const experiment = await Experiments.load(experimentId);
await experiment.cancel();

Experiment Results

Results are available through the run evaluations associated with each experiment run. You can analyze these to compare performance across different configurations.

import { Experiment } from "@re-factor/sdk/experiment";

const experimentId = "...";
const experiment = await Experiment.load(experimentId);

// Get aggregated results for the current runnable and candidates
const evaluationResults = await experiment.getEvaluationResults();

// Get individual runs and see outputs and evaluation values for
// current runnable and candidates
const runs = await experiment.listRuns();

Best Practices

Evaluator Selection: Choose evaluators that measure relevant aspects of performance for your use case.
Timeout Configuration: Set appropriate timeouts based on expected processing time.
Run Filters: Use filters to focus evaluation on specific types of inputs or scenarios.
Resource Management: Monitor resource usage when running large experiments.

Get Started

Guides

Creating an Experiment

Experiment Configuration

Managing Experiments

Experiment Results

Best Practices

Get Started

Guides

​Creating an Experiment

​Experiment Configuration

​Managing Experiments

​Experiment Results

​Best Practices

Creating an Experiment

Experiment Configuration

Managing Experiments

Experiment Results

Best Practices