Similarity Evaluator¶

The Similarity Evaluator measures text similarity between actual and expected outputs using various algorithms. It's the default evaluator and most commonly used for content quality assessment.

🎯 Overview¶

The similarity evaluator compares your agent's output against expected results using configurable similarity algorithms. It's ideal for:

Content quality assessment
Paraphrasing validation
Translation quality
Text generation evaluation

⚙️ Configuration¶

Global Configuration (config.yaml)¶

evaluators:
  - name: 'similarity'
    type: 'string_similarity'
    config:
      method: 'cosine' # cosine, levenshtein, jaccard, exact
      threshold: 0.8 # Pass threshold (0-1)
    weight: 1.0
    enabled: true

Per-Test Configuration¶

@agent_test(criteria=['similarity'])
def test_with_custom_config():
    return {
        "input": "Test input",
        "actual": "Agent response",
        "expected": "Expected response",
        "similarity_config": {
            "method": "levenshtein",
            "threshold": 0.9
        }
    }

📊 Similarity Methods¶

Cosine Similarity (Default)¶

Best for: General text similarity, semantic comparison

config:
  method: 'cosine'
  threshold: 0.8

Uses TF-IDF vectorization
Range: 0.0 to 1.0
Good for semantic similarity
Handles different text lengths well

Example:

@agent_test(criteria=['similarity'])
def test_semantic_similarity():
    return {
        "input": "What is machine learning?",
        "actual": "ML is a subset of AI that enables computers to learn from data",
        "expected": "Machine learning allows computers to learn patterns from data automatically"
        # Cosine similarity: ~0.85 (high semantic similarity despite different wording)
    }

Levenshtein Distance¶

Best for: Exact matching with tolerance for typos

config:
  method: 'levenshtein'
  threshold: 0.9

Edit distance based
Range: 0.0 to 1.0
Good for exact matches with minor variations
Sensitive to character-level changes

Example:

@agent_test(criteria=['similarity'])
def test_exact_format():
    return {
        "input": "Format as JSON",
        "actual": '{"name": "John", "age": 30}',
        "expected": '{"name": "John", "age": 30}',
        "similarity_config": {"method": "levenshtein", "threshold": 0.95}
        # Levenshtein: 1.0 (exact match)
    }

Jaccard Similarity¶

Best for: Keyword similarity, set-based comparison

config:
  method: 'jaccard'
  threshold: 0.7

Set intersection over union
Range: 0.0 to 1.0
Good for keyword presence
Order-independent

Example:

@agent_test(criteria=['similarity'])
def test_keyword_coverage():
    return {
        "input": "List programming languages",
        "actual": "Python, Java, JavaScript, C++, Go",
        "expected": "JavaScript, Python, Java, C++, Rust",
        "similarity_config": {"method": "jaccard", "threshold": 0.6}
        # Jaccard: 0.8 (4 common out of 5 total unique)
    }

Exact Match¶

Best for: Precise output validation

config:
  method: 'exact'
  threshold: 1.0

Exact string comparison
Range: 0.0 or 1.0 only
Case-sensitive
No tolerance for differences

Example:

@agent_test(criteria=['similarity'])
def test_exact_output():
    return {
        "input": "Return exactly: 'SUCCESS'",
        "actual": "SUCCESS",
        "expected": "SUCCESS",
        "similarity_config": {"method": "exact"}
        # Exact: 1.0 (perfect match)
    }

📈 Advanced Usage¶

Multiple Test Cases¶

@agent_test(criteria=['similarity'])
def test_batch_evaluation():
    test_cases = [
        {
            "input": "Question 1",
            "actual": "Answer 1",
            "expected": "Expected 1"
        },
        {
            "input": "Question 2",
            "actual": "Answer 2",
            "expected": "Expected 2"
        }
    ]

    return {"test_cases": test_cases}

Dynamic Threshold Adjustment¶

@agent_test(criteria=['similarity'])
def test_adaptive_threshold():
    difficulty = "high"  # Based on input complexity
    threshold = 0.9 if difficulty == "easy" else 0.7

    return {
        "input": complex_prompt,
        "actual": agent_response,
        "expected": reference_response,
        "similarity_config": {
            "method": "cosine",
            "threshold": threshold
        }
    }

Preprocessing Options¶

@agent_test(criteria=['similarity'])
def test_with_preprocessing():
    import re

    def clean_text(text):
        # Remove extra whitespace and normalize
        return re.sub(r'\s+', ' ', text.strip().lower())

    actual_cleaned = clean_text(agent_response)
    expected_cleaned = clean_text(reference_response)

    return {
        "input": prompt,
        "actual": actual_cleaned,
        "expected": expected_cleaned
    }

📋 Best Practices¶

Method Selection Guide¶

Use Case	Recommended Method	Threshold Range
Semantic similarity	`cosine`	0.7 - 0.9
Exact formatting	`levenshtein`	0.9 - 1.0
Keyword presence	`jaccard`	0.6 - 0.8
Precise validation	`exact`	1.0

Threshold Guidelines¶

0.9-1.0: Very strict, near-exact matching
0.8-0.9: High similarity, good paraphrasing
0.7-0.8: Moderate similarity, semantic equivalence
0.6-0.7: Loose similarity, general topic match
<0.6: Very permissive, basic relevance

Common Patterns¶

Content Generation¶

@agent_test(criteria=['similarity'])
def test_content_generation():
    return {
        "input": "Write about renewable energy",
        "actual": agent_content,
        "expected": reference_content,
        "similarity_config": {
            "method": "cosine",
            "threshold": 0.75  # Allow creative variation
        }
    }

Translation Quality¶

@agent_test(criteria=['similarity'])
def test_translation():
    return {
        "input": "Translate: 'Hello world'",
        "actual": agent_translation,
        "expected": "Hola mundo",
        "similarity_config": {
            "method": "levenshtein",
            "threshold": 0.9  # High accuracy for translation
        }
    }

Code Generation¶

@agent_test(criteria=['similarity'])
def test_code_generation():
    return {
        "input": "Write a Python function to sort a list",
        "actual": agent_code,
        "expected": reference_code,
        "similarity_config": {
            "method": "jaccard",
            "threshold": 0.8  # Focus on keyword/structure similarity
        }
    }

🔧 Troubleshooting¶

Common Issues¶

Low Similarity Scores

Check if method matches your use case
Adjust threshold based on expected variation
Consider preprocessing to normalize text
Use multiple evaluators for comprehensive assessment

Inconsistent Results

Ensure consistent text formatting
Remove or normalize special characters
Consider case sensitivity
Check for encoding issues

Performance Issues

Cosine similarity is fastest for long texts
Levenshtein can be slow for very long strings
Consider text length limits for batch processing

Debug Information¶

The similarity evaluator provides detailed debug information:

# Result details include:
{
    "passed": True,
    "score": 0.85,
    "threshold": 0.8,
    "details": {
        "actual": "Agent response",
        "expected": "Expected response",
        "similarity": 0.85,
        "method": "cosine"
    }
}

📚 Examples¶

See Basic Examples for more similarity evaluator usage patterns.