Keywords: Code representations, Brain representations, Cognitive neuroscience, Neuroimaging, Multivoxel pattern analysis, Representation decoding analysis, Representation similarity analysis, fMRI analysis, ML for PL/SE, ML4code
TL;DR: An analysis to determine whether ML models trained on code corpora learn the same information that our brains learn when we comprehend code.
Abstract: What aspects of computer programs are represented by the human brain during comprehension? We leverage brain recordings derived from functional magnetic resonance imaging (fMRI) studies of programmers comprehending Python code to evaluate the properties and code-related information encoded in the neural signal. We first evaluate a selection of static and dynamic code properties, such as abstract syntax tree (AST)-related and runtime-related metrics. Then, to learn whether brain representations encode fine-grained information about computer programs, we train a probe to align brain recordings with representations learned by a suite of ML models. We find that both the Multiple Demand and Language systems--brain systems which are responsible for very different cognitive tasks, encode specific code properties and uniquely align with machine learned representations of code. These findings suggest at least two distinct neural mechanisms mediating computer program comprehension and evaluation, prompting the design of code model objectives that go beyond static language modeling. We make all the corresponding code, data, and analysis publicly available at https://github.com/ALFA-group/code-representations-ml-brain
Supplementary Material: pdf