Motivation
While AI brings incredible intelligent automation benefits to your business, it is not always perfect in its outputs. As businesses automate more and more processes, they sometimes find that these processes start to fail in ways that are unexpected or different from the failure modes they experienced with human operators. Continuous evaluation of the points of intersection of your business and AI is a powerful way to unlock the efficiencies of AI while limiting the risks of error. With re-factor, we seek to provide you with the tools you need to continuously monitor and evaluate the performance of your AI-enabled processes.Concepts
Evaluation Labels
Evaluation labels define the schema and constraints for evaluating outputs in your AI workflows. They provide a structured way to specify what aspects of the output should be evaluated and how they should be measured.Value Types
Evaluation labels support four different value types to accommodate various evaluation needs:-
Binary (
binary): Simple yes/no or true/false evaluations. Useful for checking if an output meets a specific criterion. -
Continuous (
continuous): Numeric scores within a defined range. Useful for rating quality on a scale. -
Categorical (
categorical): Fixed set of categories or labels. Useful for classifying outputs into distinct categories. -
Ordinal (
ordinal): Ordered categories. Useful when categories have a natural order.
Using Evaluation Labels
Evaluation labels are used in conjunction with evaluators to assess the quality of AI outputs. Here’s an example of creating an evaluation label:Best Practices
- Clear Names and Descriptions: Use descriptive names and detailed descriptions to ensure evaluators understand what they’re measuring.
- Appropriate Value Types: Choose the most appropriate value type for your evaluation needs:
- Use
binaryfor clear yes/no criteria - Use
continuousfor nuanced quality assessments - Use
categoricalfor distinct, unordered classifications - Use
ordinalwhen categories have a natural progression
- Use
- Consistent Ranges: For continuous values, establish consistent ranges across related evaluations (e.g., always use 1-10 or 0-100)
- Validation Rules: Include any necessary validation rules in the metadata to ensure consistent evaluation
Run Evaluations
Run evaluations represent assessments of specific outputs or aspects of a run. Each evaluation is associated with an evaluation label that defines the type of value being measured (binary, continuous, categorical, or ordinal).Creating Run Evaluations
Manual run evaluations can be created in two ways:-
API Based Manual Evaluation: Users can directly create evaluations through the API:
-
UI Based Manual Evaluation: Users can also create evaluations through the re-factor UI:
- Click on a run in the UI
- Click on the “Add Evaluation” button
- Select or create the evaluation label
- Enter the evaluation value
- Click on the “Save” button
Automated Evaluators
Run evaluations can also be created automatically byEvaluators. Here’s an example of creating an evaluator:
Evaluation Fields
output_name: Name of the specific output being evaluated (optional)field_path: JSON path to a specific field within the output (optional)value: The evaluation value, must conform to the evaluation label’s typecreated_by_evaluator_id: ID of the evaluator if created automatically
Gold Labels
Gold labels represent the ground truth or expected outputs for a run. They are essential for:- Training and fine-tuning models
- Evaluating model performance
- Creating test suites
- Benchmarking different model configurations
Types of Gold Labels
-
Text Gold Labels: Simple text values representing the expected output:
-
Object Gold Labels: Structured data representing complex expected outputs:
-
Partial Gold Labels: Object gold labels that only specify some of the expected fields:
Using Gold Labels with Evaluators
Gold labels can be used by evaluators to automatically assess model outputs:Best Practices
- Clear Names and Descriptions: Use descriptive names and detailed descriptions to ensure evaluators understand what they’re measuring.
- Consistent Ranges: For continuous values, establish consistent ranges across related evaluations (e.g., always use 1-10 or 0-100)
- Validation Rules: Include any necessary validation rules in the metadata to ensure consistent evaluation

