Life Satisfaction Prediction
Predicting life satisfaction using machine learning, explainable AI, and novel tabular-to-text NLP algorithms
Overview
An ensemble machine learning pipeline for life satisfaction prediction from tabular survey data, enhanced through feature engineering and a novel tabular-to-text NLP algorithm achieving 5-10% accuracy improvement. Deployed as a real-time assessment tool with <100ms response time.
π Published in: Heliyon (Elsevier, Impact Factor: 3.7, Q1 Journal)
π Paper: Heliyon
π» Code: GitHub Repository
Problem Statement
Life satisfaction is a critical indicator of mental health and well-being:
- Subjective measure: Difficult to quantify objectively
- Multi-dimensional: Influenced by work, relationships, health, finances, etc.
- Public health: Understanding drivers of life satisfaction informs policy
- Personalized interventions: Predicting satisfaction enables targeted support
Goal: Develop an ML model to predict life satisfaction from demographic and behavioral survey data, with interpretable insights into contributing factors.
Dataset
Source: World Values Survey (WVS) Wave 7 (2017-2022)
Countries: 50+ countries, focus on diverse socioeconomic contexts
Sample Size: 15,487 participants
Features: 120+ survey questions covering:
- Demographics (age, gender, education, income)
- Work satisfaction and employment status
- Family and social relationships
- Health and well-being indicators
- Religious and political beliefs
- Trust in institutions
Target Variable: Life satisfaction score (1-10 Likert scale)
Task: Regression β Classification (binned into 3 classes: Low, Medium, High)
Solution Architecture
Phase 1: Exploratory Data Analysis
Key Insights:
- Non-linear relationships: Life satisfaction has complex interactions with income, health
- Class imbalance: 67% participants reported high satisfaction (8-10)
- Missing data: 8-15% missing values per feature (MCAR/MAR)
- Multicollinearity: High correlation between income and education (r=0.68)
Phase 2: Feature Engineering
Novel Tabular-to-Text NLP Algorithm:
Motivation: Traditional ML treats each feature independently, missing contextual relationships that humans understand through language.
Algorithm:
- Feature Encoding: Convert each tabular feature into a natural language sentence
- Example:
age=35, income=50000, health=goodβ βThe person is 35 years old, earns $50,000 annually, and has good health.β
- Example:
- Template Generation: Create semantic templates for each feature type
templates = { 'age': "The person is {age} years old.", 'income': "They earn ${income} annually.", 'health': "They report {health} health status.", 'job_satisfaction': "Job satisfaction is rated {job_satisfaction}/10.", # ... 120+ templates } -
Text Aggregation: Concatenate all sentences into a coherent paragraph
-
BERT Embedding: Encode text paragraph into 768-dimensional vector using BERT
- Concatenation: Combine BERT embeddings with original tabular features
- Final feature vector: 120 (tabular) + 768 (BERT) = 888 dimensions
Result: 100% data retention (no information loss) + contextual embeddings
Performance Boost: +5-10% accuracy across all models
Phase 3: Statistical Modeling
Classical ML Models Tested (8 algorithms):
- XGBoost β Best Overall
- Accuracy: 87.3%
- MAE: 0.62 (on 1-10 scale)
- RΒ²: 0.74
- LightGBM
- Accuracy: 86.9%
- MAE: 0.65
- Random Forest
- Accuracy: 85.4%
- MAE: 0.71
- Gradient Boosting
- Accuracy: 86.1%
- MAE: 0.68
- Support Vector Regression (SVR)
- Accuracy: 82.7%
- MAE: 0.83
- Ridge Regression
- Accuracy: 79.5%
- MAE: 0.94
- Lasso Regression
- Accuracy: 78.9%
- MAE: 0.97
- ElasticNet
- Accuracy: 79.2%
- MAE: 0.95
Ensemble Stacking:
- Level 1: XGBoost, LightGBM, Random Forest, Gradient Boosting
- Level 2: Ridge Regression as meta-learner
- Final Accuracy: 88.7% (classification), MAE: 0.58 (regression)
Phase 4: Deep Learning Enhancement
Architecture: Fully Connected Neural Network
model = Sequential([
Dense(512, activation='relu', input_dim=888), # Tabular + BERT
Dropout(0.3),
Dense(256, activation='relu'),
BatchNormalization(),
Dropout(0.3),
Dense(128, activation='relu'),
Dense(64, activation='relu'),
Dense(3, activation='softmax') # 3 classes: Low/Med/High
])
optimizer = Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
Results:
- Accuracy: 86.2% (slightly lower than XGBoost)
- Benefit: Better handling of non-linear interactions
Phase 5: Explainability & Insights
SHAP (SHapley Additive exPlanations):
Top 10 Life Satisfaction Predictors:
- Health Status (SHAP: 0.42) - Strongest predictor
- Income Level (SHAP: 0.31)
- Job Satisfaction (SHAP: 0.28)
- Social Relationships (SHAP: 0.25)
- Work-Life Balance (SHAP: 0.22)
- Trust in Government (SHAP: 0.18)
- Education Level (SHAP: 0.15)
- Age (SHAP: 0.12) - U-shaped relationship
- Marital Status (SHAP: 0.11)
- Religious Beliefs (SHAP: 0.09)
Key Findings:
- Health status is 1.4Γ more important than income
- Job satisfaction matters more than absolute income level
- Social connections have stronger impact than wealth
- Trust in institutions correlates with life satisfaction
LIME (Local Interpretable Model-Agnostic Explanations):
- Individual prediction explanations
- Counterfactual scenarios: βIf health improved, satisfaction would increase by 1.2 pointsβ
Technical Implementation
Machine Learning Pipeline
# Tabular-to-Text Conversion
def tabular_to_text(row):
text = f"The person is {row['age']} years old. "
text += f"They earn ${row['income']} annually. "
text += f"Health status is {row['health']}. "
# ... 120+ features
return text
# BERT Embedding
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
texts = df.apply(tabular_to_text, axis=1)
embeddings = model(tokenizer(texts, padding=True, return_tensors='pt')).last_hidden_state.mean(dim=1)
# Concatenate with original tabular features
X_combined = np.hstack([X_tabular, embeddings.numpy()])
# XGBoost Training
xgb_model = XGBRegressor(
n_estimators=500,
max_depth=8,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8
)
xgb_model.fit(X_combined, y_train)
Real-Time Deployment
Gradio Web Interface:
- Interactive form for user input (120 questions)
- Real-time prediction with confidence scores
- SHAP waterfall plot for explanation
- Personalized recommendations
Flask REST API:
- Endpoint:
POST /predict_satisfaction - Latency: <100ms per prediction
- Throughput: 500 requests/second
Deployment:
- Hosted on HuggingFace Spaces (free tier)
- Docker containerization for reproducibility
- CI/CD with GitHub Actions
Results & Performance
Model Comparison
| Model | Accuracy | MAE | RΒ² | Inference Time |
|---|---|---|---|---|
| XGBoost + BERT | 88.7% | 0.58 | 0.76 | <100ms |
| XGBoost (Baseline) | 82.1% | 0.74 | 0.68 | 50ms |
| LightGBM + BERT | 87.9% | 0.61 | 0.74 | 80ms |
| Random Forest | 80.5% | 0.82 | 0.63 | 120ms |
| Neural Network | 86.2% | 0.65 | 0.72 | 150ms |
| Ridge Regression | 75.3% | 1.02 | 0.54 | 30ms |
Improvement: +6.6% accuracy with tabular-to-text NLP algorithm
Cross-Cultural Validation
Tested on 5 Countries:
- USA: 89.2% accuracy
- Germany: 87.5%
- India: 85.1%
- Brazil: 83.7%
- Nigeria: 82.4%
Robustness: Model generalizes across diverse cultural contexts
Technical Stack
Machine Learning: Scikit-learn, XGBoost, LightGBM, CatBoost
Deep Learning: TensorFlow, Keras, PyTorch
NLP: HuggingFace Transformers, BERT, RoBERTa
Explainability: SHAP, LIME, ELI5
Data Processing: Pandas, NumPy, SciPy
Deployment: Gradio, Flask, HuggingFace Spaces
Visualization: Matplotlib, Seaborn, Plotly
Key Innovations
- Tabular-to-Text NLP: First application of natural language encoding to tabular life satisfaction data
- Zero Information Loss: 100% data retention while adding contextual embeddings
- Hybrid Architecture: Combines statistical ML (XGBoost) with deep NLP (BERT)
- Real-Time Deployment: <100ms latency for interactive use
- Cross-Cultural Validation: Tested across 50+ countries
Impact & Applications
Public Health
β
Mental Health Screening: Early identification of at-risk individuals
β
Policy Making: Inform government decisions on social programs
β
Workplace Wellness: Employee satisfaction prediction
β
Healthcare Integration: Combine with medical records for holistic care
Industry Use Cases
- HR Analytics: Predict employee satisfaction and turnover
- Insurance: Life satisfaction as a health indicator
- Market Research: Consumer happiness surveys
- Education: Student well-being assessment
Publication & Recognition
π Citation:
@article{khan2024predicting,
title={Predicting life satisfaction using machine learning and explainable AI},
author={Khan, Alif Elham and Hasan, Mohammad Junayed and Anjum, Humayra and Mohammed, Nabeel and Momen, Sifat},
journal={Heliyon},
volume={10},
number={10},
year={2024},
publisher={Elsevier}
}
Journal Metrics: Q1, Impact Factor: 3.7
Future Work
- Longitudinal Studies: Track satisfaction changes over time
- Causal Inference: Identify interventions that improve satisfaction
- Multi-Modal Input: Integrate social media sentiment, wearable data
- Personalized Recommendations: Tailored suggestions for improvement
- Global Expansion: Include more countries and cultures
- Transfer Learning: Apply to related tasks (happiness, quality of life)
Status: Published & Deployed
Journal: Heliyon (Elsevier, Q1, IF: 3.7)
GitHub: Life-Satisfaction-Machine-Learning
Demo: Interactive Gradio App Available