Abstract: Culture significantly influences co mmunication between humans and intelligent systems, yet embedding culturally adaptive behaviors, such as co-speech gestures and speech, in artificial a gents remains a challenging t ask. I n this study, we present three key contributions by analyzing data from a multi-modal interaction dataset. First, we examined high-level textual and gestural features to identify cultural differences among three distinct cultural groups. Second, we classified culture from different features using Fully Connected Neural Networks (FCNN) and Random Forest (RF) classifiers, employing both subject-dependent and subject-independent data splits. Third, we employed adversarial learning techniques to improve culture classification b y developing speaker-invariant representations. Our findings indicate that both R F and F CNN models achieved high accuracy with subject-dependent data splits but faced significant challenges with subject-independent data splits, highlighting issues in generalization to unseen speakers. Enhancing FCNN models with adversarial learning showed partial improvement in generalization, yet further research is necessary to achieve robust cultural representation in intelligent systems.
Loading