Skip to content

NLP Metrics Evaluator

The NLP Metrics Evaluator provides standard evaluation metrics for text generation tasks including ROUGE, BLEU, and METEOR. These metrics are widely used in academic research and industry for evaluating summarization, translation, and text generation quality.

🎯 Overview

The metrics evaluator implements established NLP evaluation metrics for quantitative assessment of text generation quality. It's ideal for:

  • Summarization quality assessment
  • Translation evaluation
  • Text generation benchmarking
  • Academic research and comparison
  • Automated quality scoring

βš™οΈ Configuration

Global Configuration (config.yaml)

evaluators:
  - name: 'metrics'
    type: 'nlp_metrics'
    config:
      rouge:
        variants: ['rouge1', 'rouge2', 'rougeL']
        min_score: 0.4
        use_stemmer: true
      bleu:
        n_grams: [1, 2, 3, 4]
        min_score: 0.3
        smoothing: true
      meteor:
        min_score: 0.5
        alpha: 0.9
        beta: 3.0
        gamma: 0.5
    weight: 1.0
    enabled: true

Per-Test Configuration

@agent_test(criteria=['metrics'])
def test_with_custom_metrics():
    return {
        "input": "Summarize this article",
        "actual": agent_summary,
        "reference": gold_standard_summary,
        "metrics": {
            "rouge": {
                "variants": ["rouge1", "rougeL"],
                "min_score": 0.4
            },
            "bleu": {
                "min_score": 0.3,
                "smoothing": True
            }
        }
    }

πŸ“Š Available Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE measures recall-based overlap between generated and reference texts.

ROUGE Variants

Variant Description Best For
rouge1 Unigram (single word) overlap Basic content overlap
rouge2 Bigram (two word) overlap Phrase-level quality
rougeL Longest Common Subsequence Sentence structure
rougeW Weighted Longest Common Subsequence Word order importance
rougeS Skip-bigram overlap Non-consecutive pairs

Configuration Options

rouge:
  variants: ['rouge1', 'rouge2', 'rougeL']
  min_score: 0.4
  use_stemmer: true # Apply Porter stemming
  remove_stopwords: false # Remove common words
  alpha: 0.5 # F-score balance (precision vs recall)

Usage Example

@agent_test(criteria=['metrics'])
def test_summarization_rouge():
    """Test summarization using ROUGE metrics."""
    article = "Long article text..."
    summary = summarization_agent(article)
    reference = "Gold standard summary..."

    return {
        "input": article,
        "actual": summary,
        "reference": reference,
        "metrics": {
            "rouge": {
                "variants": ["rouge1", "rouge2", "rougeL"],
                "min_score": 0.4
            }
        }
    }

BLEU (Bilingual Evaluation Understudy)

BLEU measures precision-based n-gram overlap, originally designed for machine translation.

Configuration Options

bleu:
  n_grams: [1, 2, 3, 4] # N-gram levels to evaluate
  min_score: 0.3
  smoothing: true # Apply smoothing for short texts
  auto_reweigh: true # Automatically adjust weights
  weights: [0.25, 0.25, 0.25, 0.25] # Custom n-gram weights

Usage Example

@agent_test(criteria=['metrics'])
def test_translation_bleu():
    """Test translation quality using BLEU."""
    source = "Hello, how are you today?"
    translation = translation_agent(source, target_lang="es")
    reference = "Hola, ΒΏcΓ³mo estΓ‘s hoy?"

    return {
        "input": source,
        "actual": translation,
        "reference": reference,
        "metrics": {
            "bleu": {
                "n_grams": [1, 2, 3, 4],
                "min_score": 0.3,
                "smoothing": True
            }
        }
    }

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR considers word order, synonyms, and paraphrases for more sophisticated evaluation.

Configuration Options

meteor:
  min_score: 0.5
  alpha: 0.9 # Precision-recall balance
  beta: 3.0 # Fragmentation penalty weight
  gamma: 0.5 # Fragmentation penalty exponent
  wordnet: true # Use WordNet for synonyms

Usage Example

@agent_test(criteria=['metrics'])
def test_paraphrasing_meteor():
    """Test paraphrasing quality using METEOR."""
    original = "The quick brown fox jumps over the lazy dog"
    paraphrase = paraphrasing_agent(original)

    return {
        "input": original,
        "actual": paraphrase,
        "reference": original,
        "metrics": {
            "meteor": {
                "min_score": 0.5,
                "alpha": 0.9
            }
        }
    }

πŸ’‘ Advanced Usage

Multi-Reference Evaluation

@agent_test(criteria=['metrics'])
def test_multi_reference():
    """Evaluate against multiple reference texts."""

    references = [
        "First reference summary...",
        "Second reference summary...",
        "Third reference summary..."
    ]

    return {
        "input": source_text,
        "actual": agent_summary,
        "references": references,  # Multiple references
        "metrics": {
            "rouge": {
                "variants": ["rouge1", "rougeL"],
                "min_score": 0.4
            }
        }
    }

Comparative Evaluation

@agent_test(criteria=['metrics'])
def test_model_comparison():
    """Compare multiple models using metrics."""

    summary_a = model_a.summarize(article)
    summary_b = model_b.summarize(article)

    return {
        "input": article,
        "actual": summary_a,
        "alternatives": [summary_b],
        "reference": gold_summary,
        "metrics": {
            "rouge": {"variants": ["rouge1", "rouge2", "rougeL"]},
            "bleu": {"n_grams": [1, 2, 3, 4]},
            "meteor": {"min_score": 0.5}
        }
    }

Domain-Specific Configuration

@agent_test(criteria=['metrics'])
def test_domain_specific():
    """Configure metrics for specific domains."""

    # For technical documentation
    if domain == "technical":
        config = {
            "rouge": {
                "variants": ["rouge1", "rougeL"],
                "min_score": 0.6,  # Higher threshold for technical accuracy
                "use_stemmer": False  # Preserve technical terms
            }
        }

    # For creative writing
    elif domain == "creative":
        config = {
            "meteor": {
                "min_score": 0.4,  # Lower threshold for creativity
                "alpha": 0.8,      # Favor recall over precision
                "wordnet": True    # Consider synonyms
            }
        }

    return {
        "input": prompt,
        "actual": agent_output,
        "reference": reference_output,
        "metrics": config
    }

Batch Evaluation

@agent_test(criteria=['metrics'])
def test_batch_metrics():
    """Evaluate multiple text pairs efficiently."""

    test_pairs = [
        {"actual": summary1, "reference": ref1},
        {"actual": summary2, "reference": ref2},
        {"actual": summary3, "reference": ref3}
    ]

    return {
        "test_cases": test_pairs,
        "metrics": {
            "rouge": {"variants": ["rouge1", "rougeL"]},
            "bleu": {"n_grams": [1, 2, 3, 4]}
        }
    }

πŸ“ˆ Metric Selection Guide

By Task Type

Task Type Recommended Metrics Configuration
Summarization ROUGE-1, ROUGE-L High recall focus
Translation BLEU, METEOR Precision and fluency
Paraphrasing METEOR, ROUGE-L Synonym awareness
Question Answering ROUGE-1, BLEU-1 Exact match and overlap
Content Generation METEOR, ROUGE-2 Creativity with coherence

By Evaluation Focus

Focus Area Primary Metric Secondary Metrics
Content Overlap ROUGE-1 ROUGE-2, BLEU-1
Phrase Quality ROUGE-2 BLEU-2, BLEU-3
Structure ROUGE-L METEOR
Fluency BLEU METEOR
Semantic Similarity METEOR ROUGE-L

πŸ” Result Format

The metrics evaluator returns detailed scores:

{
    "passed": True,
    "score": 0.65,  # Overall combined score
    "threshold": 0.4,
    "details": {
        "rouge": {
            "rouge1": {
                "precision": 0.72,
                "recall": 0.68,
                "f1": 0.70
            },
            "rouge2": {
                "precision": 0.65,
                "recall": 0.61,
                "f1": 0.63
            },
            "rougeL": {
                "precision": 0.69,
                "recall": 0.64,
                "f1": 0.66
            }
        },
        "bleu": {
            "bleu1": 0.68,
            "bleu2": 0.45,
            "bleu3": 0.32,
            "bleu4": 0.24,
            "overall": 0.42
        },
        "meteor": {
            "score": 0.58,
            "precision": 0.72,
            "recall": 0.49,
            "fragmentation": 0.23
        }
    }
}

πŸ“‹ Best Practices

Threshold Selection

Conservative (High Quality)

thresholds = {
    "rouge1": 0.6,
    "rouge2": 0.4,
    "rougeL": 0.5,
    "bleu": 0.4,
    "meteor": 0.6
}

Moderate (Balanced)

thresholds = {
    "rouge1": 0.4,
    "rouge2": 0.2,
    "rougeL": 0.3,
    "bleu": 0.2,
    "meteor": 0.4
}

Permissive (High Recall)

thresholds = {
    "rouge1": 0.3,
    "rouge2": 0.1,
    "rougeL": 0.2,
    "bleu": 0.1,
    "meteor": 0.3
}

Preprocessing Recommendations

def preprocess_for_metrics(text):
    """Recommended preprocessing for metric evaluation."""
    import re

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text.strip())

    # Lowercase for case-insensitive comparison
    text = text.lower()

    # Remove extra punctuation (optional)
    text = re.sub(r'[^\w\s\.]', '', text)

    return text

@agent_test(criteria=['metrics'])
def test_with_preprocessing():
    actual_clean = preprocess_for_metrics(agent_output)
    reference_clean = preprocess_for_metrics(reference_text)

    return {
        "input": prompt,
        "actual": actual_clean,
        "reference": reference_clean,
        "metrics": {"rouge": {"variants": ["rouge1", "rougeL"]}}
    }

Combining with Other Evaluators

@agent_test(criteria=['metrics', 'similarity', 'llm_judge'])
def test_comprehensive_evaluation():
    """Combine metrics with other evaluators for complete assessment."""

    return {
        "input": prompt,
        "actual": agent_output,
        "expected": reference_output,  # For similarity
        "reference": reference_output,  # For metrics
        "evaluation_criteria": ["coherence", "fluency"],  # For LLM judge
        "metrics": {
            "rouge": {"variants": ["rouge1", "rougeL"]},
            "meteor": {"min_score": 0.4}
        }
    }

🚨 Troubleshooting

Common Issues

Low ROUGE Scores

  • Check for text preprocessing differences
  • Verify reference quality and relevance
  • Consider using ROUGE-L for structural similarity
  • Adjust stemming and stopword settings

BLEU Score Issues

  • Ensure adequate text length (BLEU requires sufficient n-grams)
  • Enable smoothing for short texts
  • Check for exact match requirements
  • Consider alternative metrics for creative tasks

METEOR Installation

  • Requires NLTK data: nltk.download('wordnet')
  • May need Java runtime for full functionality
  • Consider simplified configuration if installation fails

Performance Optimization

# Cache preprocessing results
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_preprocess(text):
    return preprocess_for_metrics(text)

# Batch processing for efficiency
@agent_test(criteria=['metrics'])
def test_batch_optimized():
    return {
        "test_cases": large_test_set,
        "metrics": {
            "rouge": {"variants": ["rouge1"]},  # Limit variants for speed
            "batch_size": 50  # Process in batches
        }
    }