MultiChar: A Resource-Efficient Character and Subword Model for Multilingual Web Automation

ACL ARR 2025 July Submission704 Authors

28 Jul 2025 (modified: 31 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present MultiChar, a resource-efficient neural framework for multilingual web form filling, data extraction, navigation and question answering. Our approach combines masked character-level and subword-level processing with a modular architecture designed to support any language, although demonstrated on German, French, Arabic and English as a proof of concept due to resource constraints. The system features a character-level masked model for robust handling of morphologically rich languages, language-specific adapters for cross-lingual transfer, and a universal form analyzer for dynamic web form processing. We introduce a learned model selector framework that dynamically chooses between character and subword representations based on input characteristics. Our experiments show that MultiChar achieves promising results in web form filling (83-89% precision), data extraction (>90% precision) and website navigation (88-95% success rate), while maintaining efficiency with only 2.1M parameters. In particular, our language-specific adapters yield a 14.2% improvement over language-agnostic approaches. This work establishes a foundation for resource-efficient cross-lingual web automation, demonstrating scalability to diverse languages and domains without requiring massive computational resources.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual representations, cross-lingual transfer, data augmentation
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English, German, French, Arabic
Previous URL: https://openreview.net/forum?id=UhKhvxe247
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: section 9 (Potential Risks and Ethical Considerations) discusses potential risks.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 4.1 cites OPUS-100 translation pairs and FQuAD (French QA) dataset. Section 8 (References) provides citations for these datasets.
B2 Discuss The License For Artifacts: N/A
B2 Elaboration: Not applicable as we used only publicly available datasets under their standard terms.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: While we used existing datasets (OPUS-100, FQuAD) for their intended research purposes, we did not explicitly discuss this consistency in the paper.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We used synthetic data generated specifically for this work, eliminating PII concerns.
B5 Documentation Of Artifacts: No
B5 Elaboration: Our synthetic forms cover standard web form types; detailed demographic analysis was not applicable to our approach.
B6 Statistics For Data: Yes
B6 Elaboration: Section 4.1, Section 4.2, and Appendix C.4 with Table 9. Details include 50,000 examples per language for character/subword models, train/test/dev splits, and comprehensive dataset statistics across all four languages.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 3.1.1 (2.1M parameters), Section 4.2, and Appendix C.1
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 3.1.1, Section 4.2, and Appendix A. Detailed hyperparameters including learning rates (3e-5 for models, 5e-6 for adapters), batch sizes (64), epochs (5), attention heads (16), hidden sizes, and complete architectural specifications.
C3 Descriptive Statistics: Yes
C3 Elaboration: Table 2 reports standard deviations for masked character prediction. Section 5.1 discusses variability across character positions and morphological contexts. Multiple tables report averages and individual language performance across tasks.
C4 Parameters For Packages: Yes
C4 Elaboration: Section 3.1.2 (SentencePiece tokenization), Section 3.4 (BeautifulSoup), Section 3.6 (Playwright browser automation), and Appendix E.1 (SentencePiece vocabulary with 32,000 tokens). Implementation details provided for all major packages used.
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 704
Loading