Back to Projects
ml
production

Credit Risk Classifier — Random Forest

Binary classification model to predict loan default probability on 32K credit records. Implements feature engineering (DTI ratio, income buckets, age groups), trains Logistic Regression and Random Forest with class balancing, and evaluates using AUC-ROC, KS Statistic, and confusion matrix. AUC: 0.93.

Pythonscikit-learnRandom ForestpandasEDA

Architecture

Raw Data (32K records)
Feature Engineering
Train / Evaluate
AUC 0.93 / F1 0.82

Code Snippet

# Feature engineering
df['dti_ratio']      = df['loan_amnt'] / df['person_income']
df['income_bucket']  = pd.qcut(df['person_income'], q=4, labels=['low','mid','high','top'])
df['age_group']      = pd.cut(df['person_age'], bins=[0,25,35,50,100],
                               labels=['<25','25-35','35-50','50+'])

# Train / test split (stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=42
)

# Random Forest — class_weight balances the 78/22 split
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    min_samples_split=10,
    class_weight='balanced',
    random_state=42,
)
rf.fit(X_train, y_train)

# Results on held-out test set
# AUC-ROC  : 0.9280   (industry threshold: > 0.80)
# F1-Score : 0.8162
# Sensitivity: 70.11% (defaults correctly flagged)
# Specificity: 99.53% (good clients correctly approved)
# Top features: loan_percent_income 20.6%, person_income 16.7%, loan_int_rate 13.8%
Detailed write-up, screenshots, and metrics coming in Phase 4.