Data Cleaning

Real-world data is messy. Data cleaning is the process of detecting and correcting errors, handling missing values, and preparing data for analysis. It's often 80% of the work in a data science project.

Remember: Garbage in, garbage out. Clean data is essential for accurate models.

Common Data Issues

Missing Values

NULL, NaN, empty strings

📊

Outliers

Extreme values that skew analysis

🔄

Duplicates

Repeated rows or entries

📝

Inconsistent Format

Mixed date formats, capitalization

⚠️

Invalid Data

Negative ages, impossible dates

🔢

Wrong Data Types

Numbers stored as strings

Handling Missing Values

1. Remove (Deletion)

Drop rows or columns with missing values.

✓ Simple and fast
✗ Loses information, reduces dataset size

Use when: Few missing values, large dataset

2. Imputation (Fill)

Replace missing values with estimates.

• Mean/Median/Mode (numerical)
• Forward/Backward fill (time series)
• Model-based imputation (KNN, regression)

Use when: Many missing values, can't afford to lose data

3. Flag as Missing

Create indicator variable for missingness.

Add column: is_missing = True/False

Use when: Missingness itself is informative

python
Output:
Click "Run Code" to see output

Outlier Detection

Outliers can skew analysis and model performance. Detect them before deciding how to handle.

Statistical Methods

  • Z-score: Values beyond 3 standard deviations
  • IQR: Values outside 1.5 × IQR from quartiles
  • Modified Z-score: Robust to outliers

Visual Methods

  • Box plots: Show quartiles and outliers
  • Scatter plots: Identify unusual patterns
  • Histograms: Spot extreme values
python
Output:
Click "Run Code" to see output

Data Cleaning Checklist

Check for missing values and decide on strategy
Identify and handle outliers appropriately
Remove duplicate rows
Standardize formats (dates, text, categories)
Validate data types and convert if needed
Check for logical inconsistencies
Handle special characters and encoding issues
Document all cleaning steps for reproducibility

Best Practice: Always keep a copy of raw data. Document every cleaning step. Clean data incrementally and validate at each stage.