Feature Engineering

Feature engineering is the art of creating better input features for machine learning models. Good features can make the difference between a mediocre model and a great one.

Key Insight: Better features beat better algorithms. A simple model with great features outperforms a complex model with poor features.

Feature Encoding

Convert categorical variables into numerical format that models can understand.

Label Encoding

Assign each category a unique integer.

['red', 'blue', 'green'] → [0, 1, 2]

⚠️ Use for ordinal data only (implies order)

One-Hot Encoding

Create binary column for each category.

'red' → [1, 0, 0]
'blue' → [0, 1, 0]
'green' → [0, 0, 1]

✓ No ordinal assumption, works for nominal data

Target Encoding

Replace category with mean of target variable.

'NYC' → average salary in NYC
'LA' → average salary in LA

✓ Captures relationship with target, ⚠️ risk of overfitting

python

Output:

Click "Run Code" to see output

Feature Scaling

Normalize features to similar ranges so no single feature dominates.

Standardization (Z-score)

Mean = 0, Std = 1

x_scaled = (x - mean) / std

Use for: Most ML algorithms, when data is normally distributed

Normalization (Min-Max)

Scale to [0, 1] range

x_scaled = (x - min) / (max - min)

Use for: Neural networks, when you need bounded values

python

Output:

Click "Run Code" to see output

Feature Creation

Create new features from existing ones to capture relationships.

Polynomial Features

x² x³, x₁×x₂ for non-linear relationships

Binning

Group continuous values into categories

Date Features

Extract day, month, year, day of week

Text Features

Length, word count, TF-IDF

Aggregations

Sum, mean, max by group

Ratios

price/sqft, clicks/impressions

Feature Selection

Choose the most important features to reduce dimensionality and improve performance.

Filter Methods

Select features based on statistical tests (correlation, chi-square, mutual information)

Wrapper Methods

Use model performance to select features (forward selection, backward elimination, RFE)

Embedded Methods

Feature selection during model training (Lasso, Ridge, tree-based feature importance)

Best Practices

▹Understand your data before engineering features

▹Create features based on domain knowledge

▹Avoid data leakage (don't use future information)

▹Scale features after train/test split

▹Document feature engineering steps

▹Test feature importance and remove low-value features

▹Iterate: create, test, refine

Key Takeaway: Feature engineering is both art and science. Combine domain expertise with experimentation to create features that help your model learn better.