💜 Matchmaking Predictor

Will this relationship last 10 years or more? (~0.9% of couples do)

Loading questions…

📊 Behind the Prediction

This app is powered by a machine learning model trained to predict long-lasting relationships. The final model — a Logistic Regression with 6 selected features — reaches an AUC of 0.75. Here's the full analytical journey, from raw data to production.

🔬 The ML Pipeline — Step by Step
01
Data Audit
data_audit.ipynb
Exploration

Initial deep-dive into the raw dataset: column types, distributions, and missing values. This step established the data quality baseline and confirmed the data was clean enough to proceed — no major imputation needed, no data leakage concerns.

Key finding: the dataset was well-structured and complete. No significant cleaning was required, which allowed moving directly into feature engineering.
Distribution of relationship longevity (months) — only 0.9% of couples reach 120 months (10 years)
02
Feature Engineering
features_engineering.ipynb
Engineering

Individual features from partner A and partner B in isolation showed no exploitable signal — raw personality scores alone couldn't predict anything meaningful. This led to the core insight: create relational features that measure compatibility and difference between the two partners rather than individual traits.

Examples: absolute difference in emotional expressiveness, same/different love language flag, compatibility score on openness. These relational features became the model's actual input space — because it's not who you are individually, it's how you fit together.
03
Modeling — Regression Benchmark
modeling_longevity_monthly_target.ipynb
Dead end → pivot

First attempt: predict the exact duration of the relationship in months (regression). Multiple algorithms were benchmarked — Linear Regression, Ridge, Random Forest, LightGBM.

Result: R² ≈ 0.11, MAE ≈ 1 year. Models barely explained variance — predicting an exact duration is too noisy. Decision: reformulate as binary classification (will this couple last 10+ years, yes or no?).
04
Modeling — Binary Classification
modeling_longevity_binary_target.ipynb
✓ Production model chosen

The problem was reframed: predict whether a relationship will last 10 years or more (binary yes/no). A full model comparison was run: Logistic Regression, Decision Tree, Random Forest, LightGBM — all with class_weight="balanced" to handle class imbalance.

Winner: Logistic Regression (AUC = 0.75). Counter-intuitively, the simpler linear model outperformed tree-based methods — likely because the signal is genuinely linear on this small, noisy dataset. Interpretability as a bonus.
ROC curves — LR vs Random Forest vs LightGBM vs Baseline
05
Feature Importance — SHAP Analysis
feature_importance.ipynb
Interpretability

SHAP (SHapley Additive exPlanations) values were computed to understand why the model makes each prediction, and which features actually drive the outcome.

Features near-zero SHAP contribution were flagged as candidates for removal. This gave a human-readable ranking of what compatibility factors matter most — and which are noise.
SHAP value distribution per feature — impact on model output
Permutation importance — Logistic Regression
06
Feature Selection — Sequential Backward Elimination
optimize_feature_selection.ipynb
6 features retained

Rather than relying on SHAP alone, a more rigorous Sequential Backward Elimination approach was applied: iteratively remove the least useful feature and measure the impact on AUC. This gives a holistic view rather than feature-by-feature inspection.

Result: 6 features retained out of the full engineered set — with no AUC degradation. Fewer features = simpler model, less overfitting, and easier to interpret for the end user.
Sequential Backward Elimination — AUC vs number of features remaining
07
Model Export
model_export.ipynb
✓ Deployed

The final production pipeline — StandardScaler + Logistic Regression — was retrained on the full dataset (not just the train split) using the 6 selected features, then serialized to disk.

This .pkl file is what the FastAPI backend loads at startup to serve real-time predictions. The pipeline handles scaling automatically, so the API just receives raw scores.
08
Paired Category Features (optional)
paired_features.ipynb
Exploration

An experimental notebook testing whether specific pair combinations of categorical features (e.g. "Quality Time × Acts of Service" as a single feature) provide richer signal than the simple binary same_love_language flag.

Result: the binary flag wins. Love pair columns (15 combos): AUC −0.003 vs baseline. All pairs — love + career + location (81 features): AUC −0.014 vs baseline. Verdict: Binary flag is sufficient. The only pairs that carry signal are the homogeneous ones (same × same), which is exactly what the binary flag already captures — expanding to 15+ columns just dilutes the signal for a linear model.
🧠 Key Modelling Decisions
🔄
Regression → Classification — Predicting exact duration in months proved too noisy (R² = 0.11). Reformulating as binary "10-year threshold" gave meaningful, actionable signal.
⚖️
Class imbalance handling — Long-lasting couples are rare in the data. Using class_weight="balanced" prevented the model from simply predicting "no" for everyone.
🏆
Why Logistic Regression beat LightGBM — On a small, noisy, relationship dataset, simpler linear models generalize better. Tree models overfit to noise that linear models ignore.
✂️
Feature reduction without accuracy loss — Sequential Backward Elimination reduced the feature space from ~15 to 6 with zero AUC impact. Simpler model, cleaner UX, less maintenance.
🔗 Want to dig deeper? Browse the full code & notebooks, or connect on LinkedIn.
GitHub LinkedIn

Built by Céline Apéry   GitHub   LinkedIn  ·  Model trained on the Cupid's Algorithm — Kaggle dataset