Motivation

While AI brings incredible intelligent automation benefits to your business, it is not always perfect in its outputs. As businesses automate more and more processes, they sometimes find that these processes start to fail in ways that are unexpected or different from the failure modes they experienced with human operators. Continuous evaluation of the points of intersection of your business and AI is a powerful way to unlock the efficiencies of AI while limiting the risks of error. With re-factor, we seek to provide you with the tools you need to continuously monitor and evaluate the performance of your AI-enabled processes.

Concepts

Evaluation Labels

Evaluation labels define the schema and constraints for evaluating outputs in your AI workflows. They provide a structured way to specify what aspects of the output should be evaluated and how they should be measured.

Value Types

Evaluation labels support four different value types to accommodate various evaluation needs:
  • Binary (binary): Simple yes/no or true/false evaluations. Useful for checking if an output meets a specific criterion.
    {
      "name": "Factual Accuracy",
      "description": "Does the response contain only factually accurate information?",
      "value_type": "binary"
    }
    
  • Continuous (continuous): Numeric scores within a defined range. Useful for rating quality on a scale.
    {
      "name": "Response Quality",
      "description": "Overall quality of the response",
      "value_type": "continuous",
      "min_value": 1,
      "max_value": 10
    }
    
  • Categorical (categorical): Fixed set of categories or labels. Useful for classifying outputs into distinct categories.
    {
      "name": "Tone",
      "description": "The tone of the response",
      "value_type": "categorical",
      "categories": ["formal", "casual", "technical", "friendly"]
    }
    
  • Ordinal (ordinal): Ordered categories. Useful when categories have a natural order.
    {
      "name": "Complexity Level",
      "description": "The complexity level of the response",
      "value_type": "ordinal",
      "categories": ["beginner", "intermediate", "advanced", "expert"]
    }
    

Using Evaluation Labels

Evaluation labels are used in conjunction with evaluators to assess the quality of AI outputs. Here’s an example of creating an evaluation label:
import { EvaluationLabel } from '@re-factor/sdk/evaluations';

// Create an evaluation label
const factualAccuracy = await EvaluationLabel.create({
  name: 'Factual Accuracy',
  description: 'Evaluates if the response contains only factually accurate information',
  value_type: 'binary'
});

Best Practices

  1. Clear Names and Descriptions: Use descriptive names and detailed descriptions to ensure evaluators understand what they’re measuring.
  2. Appropriate Value Types: Choose the most appropriate value type for your evaluation needs:
    • Use binary for clear yes/no criteria
    • Use continuous for nuanced quality assessments
    • Use categorical for distinct, unordered classifications
    • Use ordinal when categories have a natural progression
  3. Consistent Ranges: For continuous values, establish consistent ranges across related evaluations (e.g., always use 1-10 or 0-100)
  4. Validation Rules: Include any necessary validation rules in the metadata to ensure consistent evaluation

Run Evaluations

Run evaluations represent assessments of specific outputs or aspects of a run. Each evaluation is associated with an evaluation label that defines the type of value being measured (binary, continuous, categorical, or ordinal).

Creating Run Evaluations

Manual run evaluations can be created in two ways:
  1. API Based Manual Evaluation: Users can directly create evaluations through the API:
    // Create a manual evaluation for output quality
    const evaluation = await RunEvaluation.create({
      run_id: "run_123",
      evaluation_label_id: "label_456",
      output_name: "response",
      value: 0.85
    });
    
  2. UI Based Manual Evaluation: Users can also create evaluations through the re-factor UI:
    1. Click on a run in the UI
    2. Click on the “Add Evaluation” button
    3. Select or create the evaluation label
    4. Enter the evaluation value
    5. Click on the “Save” button

Automated Evaluators

Run evaluations can also be created automatically by Evaluators. Here’s an example of creating an evaluator:
// Configure an evaluator to assess factual accuracy
const evaluator = await Evaluator.create({
  name: "Fact Checker",
  evaluation_label_id: "label_789",
  config: {
    prompt_template: "Evaluate the factual accuracy...",
    evaluation_criteria: [...]
  }
});

// The evaluator will create run evaluations automatically
await evaluator.evaluate(run);

Evaluation Fields

  • output_name: Name of the specific output being evaluated (optional)
  • field_path: JSON path to a specific field within the output (optional)
  • value: The evaluation value, must conform to the evaluation label’s type
  • created_by_evaluator_id: ID of the evaluator if created automatically

Gold Labels

Gold labels represent the ground truth or expected outputs for a run. They are essential for:
  • Training and fine-tuning models
  • Evaluating model performance
  • Creating test suites
  • Benchmarking different model configurations

Types of Gold Labels

  1. Text Gold Labels: Simple text values representing the expected output:
    await RunGoldLabel.create({
      run_id: "run_123",
      output_resource_name: "translation",
      text_value: "Bonjour le monde"
    });
    
  2. Object Gold Labels: Structured data representing complex expected outputs:
    await RunGoldLabel.create({
      run_id: "run_123",
      output_resource_name: "extracted_info",
      object_value: {
        name: "John Doe",
        age: 30,
        occupation: "Software Engineer"
      }
    });
    
  3. Partial Gold Labels: Object gold labels that only specify some of the expected fields:
    await RunGoldLabel.create({
      run_id: "run_123",
      output_resource_name: "extracted_info",
      object_value: {
        name: "John Doe"
      },
      object_value_is_partial: true
    });
    

Using Gold Labels with Evaluators

Gold labels can be used by evaluators to automatically assess model outputs:
const evaluator = await Evaluator.create({
  name: "Translation Accuracy",
  evaluation_label_id: "label_789",
  config: {
    prompt_template: "Compare the translation with the gold standard...",
    evaluation_criteria: [
      "Accuracy of meaning",
      "Grammar correctness",
      "Style preservation"
    ]
  }
});

// The evaluator will compare the run output with the gold label
const evaluation = await evaluator.evaluate(run);

Best Practices

  1. Clear Names and Descriptions: Use descriptive names and detailed descriptions to ensure evaluators understand what they’re measuring.
  2. Consistent Ranges: For continuous values, establish consistent ranges across related evaluations (e.g., always use 1-10 or 0-100)
  3. Validation Rules: Include any necessary validation rules in the metadata to ensure consistent evaluation