Structured Pre-training for Edge-Deployable Language Models: A Data-Centric Approach to Resource-Constrained AI
Abstract: Deploying language models on edge devices faces significant constraints from limited computational resources and memory budgets. While Large Language Models demonstrate conversational capabilities, their resource requirements often preclude on-device inference,
and existing small models often struggle to achieve reliable interactive behavior. This work investigates whether structured pre-training data formats can improve learning efficiency in resource-constrained settings. We conduct a systematic study of pre-training
a 0.12B-parameter model exclusively on structured Question-Answer pairs, using only a single consumer-grade GPU. Across three token budgets (100M, 500M, 1B) and multiple baseline formats, structured Q&A pre-training yields lower perplexity (68.3% reduction at
100M tokens), reduced gradient variance (47.8%), and improved performance on Q&A tasks compared to unstructured text pre-training and masked-loss supervised fine-tuning formats. Ablation studies show that full-sequence Q&A learning achieves substantially better cross-domain generalization (average perplexity 6.83 vs. 21–246 for alternative formats), and these advantages persist over extended multi-epoch training. Despite having only 0.12B parameters, the resulting model achieves 82–99% of 1B-parameter baseline scores on Q&A-style semantic metrics when evaluated on conversational benchmarks (OpenAssistant-OASST1, Natural Questions, TruthfulQA, MS MARCO)., while requiring approximately 25% of their memory footprint. Under identical decoding settings on consumer-grade hardware (RTX 2000 Ada), our model demonstrates approximately 85× higher throughput than Llama-3.2-1B on structured Q&A tasks (3,869 vs. 43 tokens/sec). For completeness, we note that throughput can increase further on datacenter-grade hardware with optimized driver modes,
but all reported results use consumer hardware to reflect realistic deployment conditions. These findings indicate that data structure may play a meaningful role in enabling practical conversational competence in resource-constrained environments, and highlight structured
Q&A pre-training as a promising direction for edge-focused language models.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We sincerely thank Reviewer for the insightful questions that help clarify the scope and positioning of our contribution.
Concern #1: Performance at Extended Data Scales (10-20× volume)
Original concern:
"Please discuss or test whether the performance gap persists if the unstructured baseline is trained on significantly more data (e.g., 10x the volume)."
Our response:
We have added a discussion paragraph in Section 5.4.1 (page 24-25) explicitly addressing this question.
Key clarifications:
- We explicitly state that we do not claim asymptotic superiority of our approach
- We clarify that our results are confined to the studied regime (≤1B tokens, single consumer GPU), which reflects resource- constrained settings relevant to edge deployment
- We position systematic scaling studies beyond this regime as important future work
Text added (Section 5.4.1):
"An important question is whether unstructured baselines trained on orders-of-magnitude more data (e.g., 10-20B tokens) would eventually match or exceed structured Q&A performance. We do not claim asymptotic superiority of our approach. Our results are confined to the studied regime (≤1B tokens, single consumer GPU), which reflects resource-constrained settings relevant to edge deployment. Systematic scaling studies beyond this regime remain important future work."
Rationale:
This addresses the reviewer's concern by (1) acknowledging the question's importance, (2) explicitly disclaiming asymptotic
superiority claims, and (3) clearly scoping our contribution to resource-constrained settings where extreme-scale training is not
feasible.
-----
Concern #2: Knowledge Distillation Comparison
Original concern:
"Include a comparison with knowledge distillation. Since the objective is to obtain the strongest possible 0.12B model, the paper should
clearly justify why Structured Pre-training is preferable to distilling a larger model into a 0.12B architecture."
Our response:
We have made two additions to clarify our relationship to knowledge distillation:
Addition 1: Section 3.2 (Dataset Description, page 6)
Added clarification of data sources:
"Note that our Structured Q&A corpus includes both LLM-generated responses (Open-Orca, UltraChat) and human-curated Q&A pairs (Dolly-15k, Natural Questions, QASC)."
Addition 2: Section 5.4.1 (Limitations, page 25)
Added comprehensive discussion paragraph:
"Our work is complementary to knowledge distillation. While our Structured Q&A corpus includes teacher-generated responses (e.g.,
Open-Orca from GPT-4) and expert-curated answers (e.g., Natural Questions), we do not perform training-time distillation with a
running teacher model. Our focus is on how data structure affects learning efficiency in small models. Systematic comparison with
training-time knowledge distillation (Hinton et al., 2015) remains valuable future work and would provide insights into whether soft-label
matching offers additional benefits beyond structured hard-label training."
Key clarifications:
- Our approach incorporates teacher knowledge through data curation (LLM-generated + human expert responses) rather than training-time soft-label matching
- We position our work as complementary to training-time knowledge distillation, not as a replacement
- We focus on demonstrating that data structure is a critical variable in knowledge transfer efficiency
- We explicitly propose systematic comparison with training-time KD as valuable future work
Rationale:
This addresses the reviewer's concern by (1) clarifying that our corpus already incorporates teacher knowledge from both LLM and human
sources, (2) distinguishing our approach from training-time distillation while positioning them as complementary, and (3) acknowledging
systematic KD comparison as important future work rather than claiming our method is universally superior.
Summary of Changes for Reviewer
Sections modified:
1. Section 3.2 (page 6): Added 1 sentence clarifying data composition
2. Section 5.4.1 (pages 24-25): Added 2 paragraphs addressing both concerns
Impact on paper:
These minimal additions directly address both concerns while maintaining our core contribution within its appropriate scope. The
changes clarify limitations and position future work without undermining our demonstrated findings within the resource-constrained
regime we investigate.
Assigned Action Editor: ~Yoshitomo_Matsubara1
Submission Number: 6013
Loading