Data Cleaning

Real-world data is messy. Data cleaning is the process of detecting and correcting errors, handling missing values, and preparing data for analysis. It's often 80% of the work in a data science project.

Remember: Garbage in, garbage out. Clean data is essential for accurate models.

Common Data Issues

❌

Missing Values

NULL, NaN, empty strings

📊

Outliers

Extreme values that skew analysis

🔄

Duplicates

Repeated rows or entries

📝

Inconsistent Format

Mixed date formats, capitalization

⚠️

Invalid Data

Negative ages, impossible dates

🔢

Wrong Data Types

Numbers stored as strings

Handling Missing Values

1. Remove (Deletion)

Drop rows or columns with missing values.

✓ Simple and fast
✗ Loses information, reduces dataset size

Use when: Few missing values, large dataset

2. Imputation (Fill)

Replace missing values with estimates.

• Mean/Median/Mode (numerical)
• Forward/Backward fill (time series)
• Model-based imputation (KNN, regression)

Use when: Many missing values, can't afford to lose data

3. Flag as Missing

Create indicator variable for missingness.

Add column: is_missing = True/False

Use when: Missingness itself is informative

python

Output:

Click "Run Code" to see output

Outlier Detection

Outliers can skew analysis and model performance. Detect them before deciding how to handle.

Statistical Methods

Z-score: Values beyond 3 standard deviations
IQR: Values outside 1.5 × IQR from quartiles
Modified Z-score: Robust to outliers

Visual Methods

Box plots: Show quartiles and outliers
Scatter plots: Identify unusual patterns
Histograms: Spot extreme values

python

Output:

Click "Run Code" to see output

Data Cleaning Checklist

✓Check for missing values and decide on strategy

✓Identify and handle outliers appropriately

✓Remove duplicate rows

✓Standardize formats (dates, text, categories)

✓Validate data types and convert if needed

✓Check for logical inconsistencies

✓Handle special characters and encoding issues

✓Document all cleaning steps for reproducibility

Best Practice: Always keep a copy of raw data. Document every cleaning step. Clean data incrementally and validate at each stage.