Matchmaking Predictor

📊 Behind the Prediction

This app is powered by a machine learning model trained to predict long-lasting relationships. The final model — a Logistic Regression with 6 selected features — reaches an AUC of 0.75. Here's the full analytical journey, from raw data to production.

🔬 The ML Pipeline — Step by Step

Data Audit

data_audit.ipynb

Exploration

▼

Initial deep-dive into the raw dataset: column types, distributions, and missing values. This step established the data quality baseline and confirmed the data was clean enough to proceed — no major imputation needed, no data leakage concerns.

Key finding: the dataset was well-structured and complete. No significant cleaning was required, which allowed moving directly into feature engineering.

Distribution of relationship longevity (months) — only 0.9% of couples reach 120 months (10 years)

Feature Engineering

features_engineering.ipynb

Engineering

▼

Individual features from partner A and partner B in isolation showed no exploitable signal — raw personality scores alone couldn't predict anything meaningful. This led to the core insight: create relational features that measure compatibility and difference between the two partners rather than individual traits.

Examples: absolute difference in emotional expressiveness, same/different love language flag, compatibility score on openness. These relational features became the model's actual input space — because it's not who you are individually, it's how you fit together.

Modeling — Regression Benchmark

modeling_longevity_monthly_target.ipynb

Dead end → pivot

▼

First attempt: predict the exact duration of the relationship in months (regression). Multiple algorithms were benchmarked — Linear Regression, Ridge, Random Forest, LightGBM.

Result: R² ≈ 0.11, MAE ≈ 1 year. Models barely explained variance — predicting an exact duration is too noisy. Decision: reformulate as binary classification (will this couple last 10+ years, yes or no?).

Modeling — Binary Classification

modeling_longevity_binary_target.ipynb

✓ Production model chosen

▼

The problem was reframed: predict whether a relationship will last 10 years or more (binary yes/no). A full model comparison was run: Logistic Regression, Decision Tree, Random Forest, LightGBM — all with class_weight="balanced" to handle class imbalance.

Winner: Logistic Regression (AUC = 0.75). Counter-intuitively, the simpler linear model outperformed tree-based methods — likely because the signal is genuinely linear on this small, noisy dataset. Interpretability as a bonus.

ROC curves — LR vs Random Forest vs LightGBM vs Baseline

Feature Importance — SHAP Analysis

feature_importance.ipynb

Interpretability

▼

SHAP (SHapley Additive exPlanations) values were computed to understand why the model makes each prediction, and which features actually drive the outcome.

Features near-zero SHAP contribution were flagged as candidates for removal. This gave a human-readable ranking of what compatibility factors matter most — and which are noise.

SHAP value distribution per feature — impact on model output

Permutation importance — Logistic Regression

Feature Selection — Sequential Backward Elimination

optimize_feature_selection.ipynb

6 features retained

▼

Rather than relying on SHAP alone, a more rigorous Sequential Backward Elimination approach was applied: iteratively remove the least useful feature and measure the impact on AUC. This gives a holistic view rather than feature-by-feature inspection.

Result: 6 features retained out of the full engineered set — with no AUC degradation. Fewer features = simpler model, less overfitting, and easier to interpret for the end user.

Sequential Backward Elimination — AUC vs number of features remaining

Model Export

model_export.ipynb

✓ Deployed

▼

The final production pipeline — StandardScaler + Logistic Regression — was retrained on the full dataset (not just the train split) using the 6 selected features, then serialized to disk.

This .pkl file is what the FastAPI backend loads at startup to serve real-time predictions. The pipeline handles scaling automatically, so the API just receives raw scores.

Paired Category Features (optional)

paired_features.ipynb

Exploration

▼

An experimental notebook testing whether specific pair combinations of categorical features (e.g. "Quality Time × Acts of Service" as a single feature) provide richer signal than the simple binary same_love_language flag.

Result: the binary flag wins. Love pair columns (15 combos): AUC −0.003 vs baseline. All pairs — love + career + location (81 features): AUC −0.014 vs baseline. Verdict: Binary flag is sufficient. The only pairs that carry signal are the homogeneous ones (same × same), which is exactly what the binary flag already captures — expanding to 15+ columns just dilutes the signal for a linear model.

🧠 Key Modelling Decisions

🔄

Regression → Classification — Predicting exact duration in months proved too noisy (R² = 0.11). Reformulating as binary "10-year threshold" gave meaningful, actionable signal.

⚖️

Class imbalance handling — Long-lasting couples are rare in the data. Using class_weight="balanced" prevented the model from simply predicting "no" for everyone.

🏆

Why Logistic Regression beat LightGBM — On a small, noisy, relationship dataset, simpler linear models generalize better. Tree models overfit to noise that linear models ignore.

✂️

Feature reduction without accuracy loss — Sequential Backward Elimination reduced the feature space from ~15 to 6 with zero AUC impact. Simpler model, cleaner UX, less maintenance.

🔗 Want to dig deeper? Browse the full code & notebooks, or connect on LinkedIn.

GitHub LinkedIn

💜 Matchmaking Predictor

📊 Behind the Prediction