Data Cleaning
Real-world data is messy. Data cleaning is the process of detecting and correcting errors, handling missing values, and preparing data for analysis. It's often 80% of the work in a data science project.
Remember: Garbage in, garbage out. Clean data is essential for accurate models.
Common Data Issues
Missing Values
NULL, NaN, empty strings
Outliers
Extreme values that skew analysis
Duplicates
Repeated rows or entries
Inconsistent Format
Mixed date formats, capitalization
Invalid Data
Negative ages, impossible dates
Wrong Data Types
Numbers stored as strings
Handling Missing Values
1. Remove (Deletion)
Drop rows or columns with missing values.
✗ Loses information, reduces dataset size
Use when: Few missing values, large dataset
2. Imputation (Fill)
Replace missing values with estimates.
• Forward/Backward fill (time series)
• Model-based imputation (KNN, regression)
Use when: Many missing values, can't afford to lose data
3. Flag as Missing
Create indicator variable for missingness.
Use when: Missingness itself is informative
Outlier Detection
Outliers can skew analysis and model performance. Detect them before deciding how to handle.
Statistical Methods
- Z-score: Values beyond 3 standard deviations
- IQR: Values outside 1.5 × IQR from quartiles
- Modified Z-score: Robust to outliers
Visual Methods
- Box plots: Show quartiles and outliers
- Scatter plots: Identify unusual patterns
- Histograms: Spot extreme values
Data Cleaning Checklist
Best Practice: Always keep a copy of raw data. Document every cleaning step. Clean data incrementally and validate at each stage.