Abstract: In causal effect estimation, determining the appropriate sampling size is critical for ensuring reliability and validity in both experimental and observational studies, a challenge closely tied to robust model generalization under limited data conditions in machine learning. This paper tackles these challenges by leveraging the Probably Approximately Correct (PAC) theory to establish a theoretically grounded framework for determining sampling boundaries. We utilize Hoeffding's inequality and Vapnik–Chervonenkis (VC) dimension to set upper boundaries for dataset adequacy in diverse scenarios: no confounders, confounders with a finite hypothesis space, and confounders with an infinite hypothesis space. Our work ensures that if the dataset size exceeds the upper boundary, the error probability for the estimated causal effect stays within a specified threshold at the given confidence level. Additionally, we demonstrate that when the dataset size is inadequate, the error of the estimated average treatment effects is bounded by the estimation of the outcome variable, which forms the theoretical basis for data augmentation strategies to improve the accuracy of causal effect estimation. Extensive experiments on synthetic and semi-synthetic datasets validate the correctness of our presented sampling upper limitations under different error and confidence level constraints. Our findings not only offer a systematic and reliable method for determining sample size in causal effect estimation but also provide actionable guidance for developing causal inference models in data-scarce environments, enhancing their applicability and robustness across fields such as healthcare, social sciences, and policy evaluation.
Supplementary Material: pdf
Submission Number: 126
Loading