Structured Pre-training for Edge-Deployable Language Models: A Data-Centric Approach to Resource-Constrained AI
Abstract: Deploying language models on edge devices faces significant constraints from limited computational resources and memory budgets. While Large Language Models demonstrate conversational capabilities, their resource requirements often preclude on-device inference,
and existing small models often struggle to achieve reliable interactive behavior. This work investigates whether structured pre-training data formats can improve learning efficiency in resource-constrained settings. We conduct a systematic study of pre-training
a 0.12B-parameter model exclusively on structured Question-Answer pairs, using only a single consumer-grade GPU. Across three token budgets (100M, 500M, 1B) and multiple baseline formats, structured Q&A pre-training yields lower perplexity (68.3% reduction at
100M tokens), reduced gradient variance (47.8%), and improved performance on Q&A tasks compared to unstructured text pre-training and masked-loss supervised fine-tuning formats. Ablation studies show that full-sequence Q&A learning achieves substantially better cross-domain generalization (average perplexity 6.83 vs. 21–246 for alternative formats), and these advantages persist over extended multi-epoch training. Despite having only 0.12B parameters, the resulting model achieves 82–99% of 1B-parameter baseline scores on Q&A-style semantic metrics when evaluated on conversational benchmarks (OpenAssistant-OASST1, Natural Questions, TruthfulQA, MS MARCO)., while requiring approximately 25% of their memory footprint. Under identical decoding settings on consumer-grade hardware (RTX 2000 Ada), our model demonstrates approximately 85× higher throughput than Llama-3.2-1B on structured Q&A tasks (3,869 vs. 43 tokens/sec). For completeness, we note that throughput can increase further on datacenter-grade hardware with optimized driver modes,
but all reported results use consumer hardware to reflect realistic deployment conditions. These findings indicate that data structure may play a meaningful role in enabling practical conversational competence in resource-constrained environments, and highlight structured
Q&A pre-training as a promising direction for edge-focused language models.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We sincerely thank the Reviewers for their detailed guidance.
This second revision substantially expands and clarifies the manuscript, resolving the remaining concerns regarding evaluation rigor, deployment claims, hardware transparency, and reproducibility.
1. New Section on Edge Deployment Feasibility (Section 4.2.2)
We added a new subsection addressing deployment evidence and fairness of comparison:
• Introduces task-dependent throughput analysis across structured Q&A vs. unstructured text.
• Shows a 57× specialization effect unique to the Structured Q&A model (not observed in Pythia baselines).
• Provides a controlled comparison with Llama-3.2-1B under identical decoding and hardware settings.
• Demonstrates that our 0.12B model achieves 82–99% of Llama-3.2-1B’s semantic scores while offering 85× higher throughput and 10× lower VRAM usage.
• Clarifies that all measurements reflect consumer hardware (RTX 2000 Ada, Windows/WDDM), not datacenter configurations.
This directly addresses concerns regarding deployment claims, fairness, and potential over-statement.
2. Hardware Microbenchmarking and Driver-Level Analysis (Appendix A.5)
To ensure complete transparency of efficiency results, we added a dedicated appendix with raw logs and reproducibility scripts. Appendix A.5 now includes:
• Kernel-launch latency benchmarks (10,000 no-op kernels), showing WDDM incurs 2.7–3.8× higher launch and sync overhead than TCC.
• Memory-bandwidth measurements and CUDA synchronization benchmarks quantifying platform-level constraints.
• Profiling of 400-kernel MLP blocks, illustrating per-token computational structure in small transformers.
• Cross-platform verification: RTX 3090 (Linux/TCC) vs. RTX 2000 Ada (Windows/WDDM), explaining the ~50× absolute throughput gap via driver overhead, scheduler interference, and bandwidth differences.
• Explanation of why WDDM disproportionately affects small-model inference (kernel fragmentation, increased launch latency, reduced effective bandwidth).
These analyses show that the observed differences originate from hardware/driver characteristics, not model-specific artifacts.
All raw logs are included in WDDM_benchmarks_supplementary.txt, and the scripts used for replication are provided as hardware_performance.py and hardware_verification.py.
We additionally verified that when both models are run on the same 3090/TCC setup, the relative speed advantage remains consistent, confirming that the 50× gap is environmental, not methodological.
3. Evaluation Dataset Transparency (Appendix A.1.2)
To ensure evaluation clarity and avoid dataset-alignment concerns:
• We documented all datasets used in Figures 2–3 and Tables 2–3: OASST1, MS MARCO, TruthfulQA, Natural Questions, and SmolLM.
• Specified exactly which dataset corresponds to each experiment.
• Emphasized that all datasets are public and no custom data were used.
This ensures full reproducibility and confirms evaluation neutrality.
4. Integration of Edge-Oriented Related Work (Appendix A.7)
We added a comparison to TinyLlama, Phi-1.5, MobileLLM, and Llama-3.2-1B, clarifying:
• Our contribution is data-centric, focused on structured pre-training.
• It complements architecture-centric and post-training methods.
This resolves concerns about over-claiming and strengthens the contextualization of contributions.
5. Refinement of Writing Tone
Following reviewer feedback:
• Reduced rhetorical phrasing.
• Replaced strong claims with measured academic language.
• Clarified that efficiency advantages depend on task structure and hardware settings.
6. Clarification of Practical Viability (no fixed threshold)
We removed the earlier “Semantic Similarity ≥ 0.17” threshold.
Viability is now framed as empirical conversational competence demonstrated on public benchmarks rather than a fixed cutoff.
7. Reproducibility Enhancements
We added:
• Exact PyTorch inference script used in all evaluations.
• Explicit decoding settings and measurement protocols.
• HF IDs for all datasets.
• Documentation for the expected output format of the provided .py files.
These allow full replication of all training-dynamics figures, semantic-metric results, Pythia baselines, and hardware microbenchmarks.
8. Structural Revision for Clarity (Section 5.1)
As requested, we significantly shortened Section 5.1 by moving low-level hardware details to Appendix A.5.
The main text now presents only the conceptual explanation, improving clarity without removing technical depth.
Assigned Action Editor: ~Yoshitomo_Matsubara1
Submission Number: 6013
Loading