LLM Evaluation
Evaluating LLMs is challenging because outputs are subjective and open-ended. Here are the key metrics and methods used in practice.
Evaluation Metrics
Perplexity
Measures how well the model predicts text. Lower is better.
PPL = exp(average negative log-likelihood)BLEU Score
Compares generated text to reference translations. Used for MT tasks.
ROUGE Score
Measures overlap with reference summaries. Used for summarization.
Human Evaluation
Gold standard. Rate outputs on helpfulness, accuracy, safety.
Benchmark Datasets
MMLU
Massive Multitask Language Understanding. 57 subjects from elementary to professional.
HumanEval
Code generation benchmark. 164 programming problems.
TruthfulQA
Tests if models generate truthful answers to questions.
MT-Bench
Multi-turn conversation benchmark. Tests instruction following.
LLM-as-Judge
Use a powerful LLM (like GPT-4) to evaluate other models' outputs. Fast and scalable alternative to human eval.
Evaluation Criteria
- ✓ Relevance: Does it answer the question?
- ✓ Accuracy: Is the information correct?
- ✓ Coherence: Is it well-structured?
- ✓ Helpfulness: Is it useful to the user?
- ✓ Safety: No harmful/biased content?
Best Practice: Combine automated metrics with human evaluation for comprehensive assessment.