Paper Title: Gibbs Sampling with Simulated Annealing K-Means for Mixture Regression
Anonymity Note: This repository and its contents have been anonymized to comply with the ICML double-blind review process. All identifying information, author details, and affiliations will be added to the public version of this repository upon the paper’s acceptance.
This repository contains the R code to reproduce all simulation studies presented in our paper, “Gibbs Sampling with Simulated Annealing K-Means for Mixture Regression”. The code provides functionalities to (1) generate simulated datasets based on the parameters described in the “Simulation setup” section, (2) run our proposed Gibbs sampling with simulated annealing K-means clustering algorithm (Algorithm 1) to obtain the results, and (3) plot these results to generate the figures presented in the paper.
This section details the necessary hardware and software to run our experiments. ### 2.1. Hardware* CPU: [Any modern multi-core CPU is sufficient.]* GPU: [Not required. All experiments are run on CPU.]* RAM: [32GB is sufficient]* ### 2.2. Software Environment
The code was developed and tested on Rstudio
2025.05.0+496 using R version 4.4.3. All
required R packages and their specific versions are managed by the
renv package and are listed in the renv.lock
file. Here is the detailed session information:
## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8 LC_CTYPE=Chinese (Simplified)_China.utf8
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8 LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.4.3 fastmap_1.2.0 cli_3.6.3 htmltools_0.5.8.1
## [5] tools_4.4.3 rstudioapi_0.17.1 yaml_2.3.10 rmarkdown_2.29
## [9] knitr_1.50 xfun_0.52 digest_0.6.37 rlang_1.1.4
## [13] renv_1.1.5 evaluate_1.0.5
We strongly recommend using the renv package for
dependency management to ensure a fully reproducible environment.
renv.lock file:## - The library is already synchronized with the lockfile.
The code repository is organized as follows:
├── plotting_accuracy.R #Script to generate the Figure of classification accuracy of the training set.
├── plotting_matrix.R #Script to generate the Figure of estimation error
├── plotting_metric.R #Script to generate the Figure of WCSS of the training set.
├── plotting_test_accuracy.R #Script to generate the Figure of classification accuracy of the testing set.
├── plotting_test_metric.R #Script to generate the Figure of WCSS of the testing set.
├── README.md # The generated Markdown README
├── README.Rmd # The R Markdown source for this README
├── README.html
├── renv/ # renv project folder
├── renv.lock # R environment lockfile for reproducibility
├── results_with_D=20,p=35,q=2,K=4.csv #All 16 CSV files is the simulation result of simulate_study_program.R
├── results_with_D=20,p=35,q=3,K=4.csv
├── results_with_D=20,p=50,q=2,K=3.csv
├── results_with_D=20,p=50,q=2,K=4.csv
├── results_with_D=20,p=50,q=3,K=3.csv
├── results_with_D=20,p=50,q=3,K=4.csv
├── results_with_D=20,p=70,q=2,K=3.csv
├── results_with_D=20,p=70,q=3,K=3.csv
├── results_with_D=40,p=35,q=2,K=4.csv
├── results_with_D=40,p=35,q=3,K=4.csv
├── results_with_D=40,p=50,q=2,K=3.csv
├── results_with_D=40,p=50,q=2,K=4.csv
├── results_with_D=40,p=50,q=3,K=3.csv
├── results_with_D=40,p=50,q=3,K=4.csv
├── results_with_D=40,p=70,q=2,K=3.csv
├── results_with_D=40,p=70,q=3,K=3.csv
├── simulate_study_program.R #The main R program, which is the code of Gibbs sampling with simulated annealing K-means clustering algorithm (Algorithm 1)
└── The_code_of_simulation_studies.Rproj
We provide two pathways for reproducing our results. We highly recommend reviewers start with the “Fast Verification” path.
This path regenerates the paper’s main figures using the pre-computed data from the .csv files in the root directory. This process is fast and should only take a few seconds.
This version is more compact if you prefer a shorter style.
To regenerate Figure 1 (estimation error) from the main body of the paper, run the following script from the project’s root directory:
This will generate 16 different .png files in the matrix/ directory.
The four other figures in the appendix can be generated by running their corresponding plotting_*.R scripts in a similar fashion.
Note that as these five plotting scripts are executed, they will also calculate and save the summary data for the five tables presented in the appendix into a single .csv file in the corresponding directory.
This path re-runs the entire experimental pipeline, including data generation and model fitting.
WARNING: This process is computationally expensive.
Configure Parameters: To run the simulation for a specific parameter group, you must first manually edit the parameters at the top of the simulate_study_program.R script. Open the file and modify the values for D, p, q, and k (on lines 19-22).
Run the Main Simulation: After saving your changes to the script, run it from the project’s root directory:
Estimated runtime: Approximately [4 hours] on an i9-14900k CPU per parameter group.
This process will generate or overwrite the .csv file in the main directory corresponding to the parameter group you selected in the script. After it completes, you can follow the steps in Path A to generate the figures. This path re-runs the entire experimental pipeline, including data generation and model fitting.
Upon acceptance, the code will be made available under an MIT License.