Feature Engineering

Feature engineering is the art of creating better input features for machine learning models. Good features can make the difference between a mediocre model and a great one.

Key Insight: Better features beat better algorithms. A simple model with great features outperforms a complex model with poor features.

Feature Encoding

Convert categorical variables into numerical format that models can understand.

Label Encoding

Assign each category a unique integer.

['red', 'blue', 'green'] → [0, 1, 2]

⚠️ Use for ordinal data only (implies order)

One-Hot Encoding

Create binary column for each category.

'red' → [1, 0, 0]
'blue' → [0, 1, 0]
'green' → [0, 0, 1]

✓ No ordinal assumption, works for nominal data

Target Encoding

Replace category with mean of target variable.

'NYC' → average salary in NYC
'LA' → average salary in LA

✓ Captures relationship with target, ⚠️ risk of overfitting

python
Output:
Click "Run Code" to see output

Feature Scaling

Normalize features to similar ranges so no single feature dominates.

Standardization (Z-score)

Mean = 0, Std = 1

x_scaled = (x - mean) / std

Use for: Most ML algorithms, when data is normally distributed

Normalization (Min-Max)

Scale to [0, 1] range

x_scaled = (x - min) / (max - min)

Use for: Neural networks, when you need bounded values

python
Output:
Click "Run Code" to see output

Feature Creation

Create new features from existing ones to capture relationships.

Polynomial Features
x² x³, x₁×x₂ for non-linear relationships
Binning
Group continuous values into categories
Date Features
Extract day, month, year, day of week
Text Features
Length, word count, TF-IDF
Aggregations
Sum, mean, max by group
Ratios
price/sqft, clicks/impressions

Feature Selection

Choose the most important features to reduce dimensionality and improve performance.

Filter Methods

Select features based on statistical tests (correlation, chi-square, mutual information)

Wrapper Methods

Use model performance to select features (forward selection, backward elimination, RFE)

Embedded Methods

Feature selection during model training (Lasso, Ridge, tree-based feature importance)

Best Practices

Understand your data before engineering features
Create features based on domain knowledge
Avoid data leakage (don't use future information)
Scale features after train/test split
Document feature engineering steps
Test feature importance and remove low-value features
Iterate: create, test, refine

Key Takeaway: Feature engineering is both art and science. Combine domain expertise with experimentation to create features that help your model learn better.