LLM Evaluation

Evaluating LLMs is challenging because outputs are subjective and open-ended. Here are the key metrics and methods used in practice.

Evaluation Metrics

Measures how well the model predicts text. Lower is better.

PPL = exp(average negative log-likelihood)

Compares generated text to reference translations. Used for MT tasks.

Measures overlap with reference summaries. Used for summarization.

Gold standard. Rate outputs on helpfulness, accuracy, safety.

Massive Multitask Language Understanding. 57 subjects from elementary to professional.

Code generation benchmark. 164 programming problems.

Tests if models generate truthful answers to questions.

Multi-turn conversation benchmark. Tests instruction following.

Use a powerful LLM (like GPT-4) to evaluate other models' outputs. Fast and scalable alternative to human eval.

Best Practice: Combine automated metrics with human evaluation for comprehensive assessment.