SHAP-Guided Feature Selection and Model Simplification for Facial Time-Series Reaction Detection

Andrei Kulinskii; Ilia Stulov; Bratenkov Miron Andreevich; Olga Krivorotko; Konstantin Gnidko; Sergei Strijhak

SHAP-Guided Feature Selection and Model Simplification for Facial Time-Series Reaction Detection

Andrei Kulinskii, Ilia Stulov, Bratenkov Miron Andreevich, Olga Krivorotko, Konstantin Gnidko, Sergei Strijhak

Published: 15 Mar 2026, Last Modified: 26 Mar 20262026 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: facial reaction, question, answer, landmarks, feature selection, FaceMesh, dynamics, time-series classification, transformer, model pruning, detection, robustness, ROC-AUC, PR-AUC, SHAP, explainable, AI

TL;DR: We used SHAP to identify and prune low-impact facial dynamics features in a transformer time-series model, improving interpretability and reducing complexity while maintaining reaction-detection performance.

Abstract: Facial micro-reactions provide a rich, non-invasive signal for detecting short-lived emotional responses during constrained "question-answer" episodes. However, modern landmark-based pipelines (e.g., FaceMesh) yield high-dimensional and strongly correlated inputs, which can lead to unnecessary computational cost and reduced interpretability when models are trained directly on raw FaceMesh points. In this work, we studied an interpretable time-series baseline that maps FaceMesh trajectories to a compact set of 17 semantically meaningful facial dynamics features (e.g., eye openness and blink-related dynamics, mouth-related dynamics, symmetry and motion-derived descriptors, etc.). We used a transformer-based classifier to predict a binary target ("reaction" vs "no reaction"). To make the model interpretable and to identify which facial cues drive decisions, we adopted SHAP method, grounded in cooperative game theory via Shapley values. After that we focused on the additive feature attribution formulation where an explanation model was expressed as a linear function of binary feature-presence variables. In particular, we used a model-agnostic Kernel SHAP approach to estimate each feature's marginal contribution to the model output under different feature coalitions. Beyond explanation, we proposed an iterative SHAP-driven feature selection loop: 1. Train the baseline model; 2. Compute feature attributions and global importance (e.g., mean absolute SHAP values); 3. Reduce the feature space by removing low-contribution and unstable features (17 --> k); 4. Retrain the model and evaluate both predictive performance and stability. Concretely, we adopted a staged pruning procedure in which the first selection step targets a reduced set of ~10 features, followed by a second SHAP re-estimation on the pruned model and a subsequent reduction to ~7 features, repeating the same cycle to assess convergence of feature rankings and performance trade-offs as dimensionality decreases. The evaluation protocol reports classification metrics including Accuracy, Precision (PPV), Recall (TPR/Sensitivity), Specificity (TNR), Balanced Accuracy, Matthews Correlation Coefficient (MCC), ROC-AUC, PR-AUC, and Brier score, and additionally tracks training/inference time to quantify computational gains from feature reduction. The outcome is a more compact configuration that maintains (or improves) predictive performance while reducing input dimensionality and inference cost. For the 17-feature model, the evaluation metrics were as follows: Accuracy = 0.8409, Precision (PPV) = 0.6, Recall (TPR/Sensitivity) ≈ 0.6667, Specificity (TNR) ≈ 0.8857, Balanced Accuracy ≈ 0.7762, Matthews Correlation Coefficient (MCC) ≈ 0.5317, ROC-AUC ≈ 0.7714, PR-AUC ≈ 0.5947, and Brier score ≈ 0.1301. The experiments were conducted on a train/test split of raw session videos (≈600 MB per recording), where each recording was annotated with ~30 labeled “question–answer” episodes used to form fixed-duration clips for training and evaluation. The baseline classifier was a multi-head Transformer Encoder with a projection layer and a learnable CLS token, sinusoidal positional encoding, and padding-based length alignment with an attention mask for variable-length sequences. In our PyTorch implementation, the encoder used d_"model" =128, h=4 attention heads, 2 encoder layers, a feed-forward hidden layer size of 256, dropout = 0.1, pre-layer normalization, and GELU activations; the classification head is a lightweight MLP (Linear --> GELU --> Dropout --> Linear) producing a single logit. Our implementation was based on Python 3.10.x with PyTorch for model training and inference, Hydra for experiment configuration, and DVC for DAG definition, artifact tracking, and reproducibility. Data preparation and I/O relied on OpenCV and FFmpeg for clip extraction and transformations; face detection used YOLO (Ultralytics v8), followed by per-frame landmark extraction by MediaPipe; NumPy/SciPy and pandas supported numerical and tabular processing; scikit-learn was used for metric computation (including ROC/PR curves and calibration-related scores); and matplotlib/seaborn were used for visualizations. Artifacts and intermediate results were serialized via pickle, JSON, and NumPy npy/npz formats.

Submission Number: 16

Loading