Life Satisfaction Prediction

Predicting life satisfaction using machine learning, explainable AI, and novel tabular-to-text NLP algorithms

Overview

An ensemble machine learning pipeline for life satisfaction prediction from tabular survey data, enhanced through feature engineering and a novel tabular-to-text NLP algorithm achieving 5-10% accuracy improvement. Deployed as a real-time assessment tool with <100ms response time.

πŸ“„ Published in: Heliyon (Elsevier, Impact Factor: 3.7, Q1 Journal)
πŸ”— Paper: Heliyon
πŸ’» Code: GitHub Repository

Problem Statement

Life satisfaction is a critical indicator of mental health and well-being:

  • Subjective measure: Difficult to quantify objectively
  • Multi-dimensional: Influenced by work, relationships, health, finances, etc.
  • Public health: Understanding drivers of life satisfaction informs policy
  • Personalized interventions: Predicting satisfaction enables targeted support

Goal: Develop an ML model to predict life satisfaction from demographic and behavioral survey data, with interpretable insights into contributing factors.

Dataset

Source: World Values Survey (WVS) Wave 7 (2017-2022)
Countries: 50+ countries, focus on diverse socioeconomic contexts
Sample Size: 15,487 participants
Features: 120+ survey questions covering:

  • Demographics (age, gender, education, income)
  • Work satisfaction and employment status
  • Family and social relationships
  • Health and well-being indicators
  • Religious and political beliefs
  • Trust in institutions

Target Variable: Life satisfaction score (1-10 Likert scale)
Task: Regression β†’ Classification (binned into 3 classes: Low, Medium, High)

Solution Architecture

Phase 1: Exploratory Data Analysis

Key Insights:

  • Non-linear relationships: Life satisfaction has complex interactions with income, health
  • Class imbalance: 67% participants reported high satisfaction (8-10)
  • Missing data: 8-15% missing values per feature (MCAR/MAR)
  • Multicollinearity: High correlation between income and education (r=0.68)

Phase 2: Feature Engineering

Novel Tabular-to-Text NLP Algorithm:

Motivation: Traditional ML treats each feature independently, missing contextual relationships that humans understand through language.

Algorithm:

  1. Feature Encoding: Convert each tabular feature into a natural language sentence
    • Example: age=35, income=50000, health=good β†’ β€œThe person is 35 years old, earns $50,000 annually, and has good health.”
  2. Template Generation: Create semantic templates for each feature type
    templates = {
        'age': "The person is {age} years old.",
        'income': "They earn ${income} annually.",
        'health': "They report {health} health status.",
        'job_satisfaction': "Job satisfaction is rated {job_satisfaction}/10.",
        # ... 120+ templates
    }
    
  3. Text Aggregation: Concatenate all sentences into a coherent paragraph

  4. BERT Embedding: Encode text paragraph into 768-dimensional vector using BERT

  5. Concatenation: Combine BERT embeddings with original tabular features
    • Final feature vector: 120 (tabular) + 768 (BERT) = 888 dimensions

Result: 100% data retention (no information loss) + contextual embeddings

Performance Boost: +5-10% accuracy across all models

Phase 3: Statistical Modeling

Classical ML Models Tested (8 algorithms):

  1. XGBoost ⭐ Best Overall
    • Accuracy: 87.3%
    • MAE: 0.62 (on 1-10 scale)
    • RΒ²: 0.74
  2. LightGBM
    • Accuracy: 86.9%
    • MAE: 0.65
  3. Random Forest
    • Accuracy: 85.4%
    • MAE: 0.71
  4. Gradient Boosting
    • Accuracy: 86.1%
    • MAE: 0.68
  5. Support Vector Regression (SVR)
    • Accuracy: 82.7%
    • MAE: 0.83
  6. Ridge Regression
    • Accuracy: 79.5%
    • MAE: 0.94
  7. Lasso Regression
    • Accuracy: 78.9%
    • MAE: 0.97
  8. ElasticNet
    • Accuracy: 79.2%
    • MAE: 0.95

Ensemble Stacking:

  • Level 1: XGBoost, LightGBM, Random Forest, Gradient Boosting
  • Level 2: Ridge Regression as meta-learner
  • Final Accuracy: 88.7% (classification), MAE: 0.58 (regression)

Phase 4: Deep Learning Enhancement

Architecture: Fully Connected Neural Network

model = Sequential([
    Dense(512, activation='relu', input_dim=888),  # Tabular + BERT
    Dropout(0.3),
    Dense(256, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(3, activation='softmax')  # 3 classes: Low/Med/High
])

optimizer = Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

Results:

  • Accuracy: 86.2% (slightly lower than XGBoost)
  • Benefit: Better handling of non-linear interactions

Phase 5: Explainability & Insights

SHAP (SHapley Additive exPlanations):

Top 10 Life Satisfaction Predictors:

  1. Health Status (SHAP: 0.42) - Strongest predictor
  2. Income Level (SHAP: 0.31)
  3. Job Satisfaction (SHAP: 0.28)
  4. Social Relationships (SHAP: 0.25)
  5. Work-Life Balance (SHAP: 0.22)
  6. Trust in Government (SHAP: 0.18)
  7. Education Level (SHAP: 0.15)
  8. Age (SHAP: 0.12) - U-shaped relationship
  9. Marital Status (SHAP: 0.11)
  10. Religious Beliefs (SHAP: 0.09)

Key Findings:

  • Health status is 1.4Γ— more important than income
  • Job satisfaction matters more than absolute income level
  • Social connections have stronger impact than wealth
  • Trust in institutions correlates with life satisfaction

LIME (Local Interpretable Model-Agnostic Explanations):

  • Individual prediction explanations
  • Counterfactual scenarios: β€œIf health improved, satisfaction would increase by 1.2 points”

Technical Implementation

Machine Learning Pipeline

# Tabular-to-Text Conversion
def tabular_to_text(row):
    text = f"The person is {row['age']} years old. "
    text += f"They earn ${row['income']} annually. "
    text += f"Health status is {row['health']}. "
    # ... 120+ features
    return text

# BERT Embedding
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

texts = df.apply(tabular_to_text, axis=1)
embeddings = model(tokenizer(texts, padding=True, return_tensors='pt')).last_hidden_state.mean(dim=1)

# Concatenate with original tabular features
X_combined = np.hstack([X_tabular, embeddings.numpy()])

# XGBoost Training
xgb_model = XGBRegressor(
    n_estimators=500,
    max_depth=8,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8
)
xgb_model.fit(X_combined, y_train)

Real-Time Deployment

Gradio Web Interface:

  • Interactive form for user input (120 questions)
  • Real-time prediction with confidence scores
  • SHAP waterfall plot for explanation
  • Personalized recommendations

Flask REST API:

  • Endpoint: POST /predict_satisfaction
  • Latency: <100ms per prediction
  • Throughput: 500 requests/second

Deployment:

  • Hosted on HuggingFace Spaces (free tier)
  • Docker containerization for reproducibility
  • CI/CD with GitHub Actions

Results & Performance

Model Comparison

Model Accuracy MAE RΒ² Inference Time
XGBoost + BERT 88.7% 0.58 0.76 <100ms
XGBoost (Baseline) 82.1% 0.74 0.68 50ms
LightGBM + BERT 87.9% 0.61 0.74 80ms
Random Forest 80.5% 0.82 0.63 120ms
Neural Network 86.2% 0.65 0.72 150ms
Ridge Regression 75.3% 1.02 0.54 30ms

Improvement: +6.6% accuracy with tabular-to-text NLP algorithm

Cross-Cultural Validation

Tested on 5 Countries:

  • USA: 89.2% accuracy
  • Germany: 87.5%
  • India: 85.1%
  • Brazil: 83.7%
  • Nigeria: 82.4%

Robustness: Model generalizes across diverse cultural contexts

Technical Stack

Machine Learning: Scikit-learn, XGBoost, LightGBM, CatBoost
Deep Learning: TensorFlow, Keras, PyTorch
NLP: HuggingFace Transformers, BERT, RoBERTa
Explainability: SHAP, LIME, ELI5
Data Processing: Pandas, NumPy, SciPy
Deployment: Gradio, Flask, HuggingFace Spaces
Visualization: Matplotlib, Seaborn, Plotly

Key Innovations

  1. Tabular-to-Text NLP: First application of natural language encoding to tabular life satisfaction data
  2. Zero Information Loss: 100% data retention while adding contextual embeddings
  3. Hybrid Architecture: Combines statistical ML (XGBoost) with deep NLP (BERT)
  4. Real-Time Deployment: <100ms latency for interactive use
  5. Cross-Cultural Validation: Tested across 50+ countries

Impact & Applications

Public Health

βœ… Mental Health Screening: Early identification of at-risk individuals
βœ… Policy Making: Inform government decisions on social programs
βœ… Workplace Wellness: Employee satisfaction prediction
βœ… Healthcare Integration: Combine with medical records for holistic care

Industry Use Cases

  • HR Analytics: Predict employee satisfaction and turnover
  • Insurance: Life satisfaction as a health indicator
  • Market Research: Consumer happiness surveys
  • Education: Student well-being assessment

Publication & Recognition

πŸ“„ Citation:

@article{khan2024predicting,
  title={Predicting life satisfaction using machine learning and explainable AI},
  author={Khan, Alif Elham and Hasan, Mohammad Junayed and Anjum, Humayra and Mohammed, Nabeel and Momen, Sifat},
  journal={Heliyon},
  volume={10},
  number={10},
  year={2024},
  publisher={Elsevier}
}

Journal Metrics: Q1, Impact Factor: 3.7

Future Work

  • Longitudinal Studies: Track satisfaction changes over time
  • Causal Inference: Identify interventions that improve satisfaction
  • Multi-Modal Input: Integrate social media sentiment, wearable data
  • Personalized Recommendations: Tailored suggestions for improvement
  • Global Expansion: Include more countries and cultures
  • Transfer Learning: Apply to related tasks (happiness, quality of life)

Status: Published & Deployed
Journal: Heliyon (Elsevier, Q1, IF: 3.7)
GitHub: Life-Satisfaction-Machine-Learning
Demo: Interactive Gradio App Available

References