Abstract: In domains like economics, health, and journalism, text embedded with numerical data is common, yet readers often struggle to derive insights. Converting such texts into charts enhances comprehension but is typically labor-intensive and domain-dependent. Despite progress in English, there is no prior dataset for Bengali. To fill this gap, we have introduced BETAR, a BEngali Text-to-chARt dataset comprising 3,519 annotated texts. We also propose BN-GraBERTNet, a grapheme-aware hybrid model that combines BanglaBERT, BiLSTM, and a fully connected layer to identify x-axis and y-axis entities in text. To handle complex numerical reasoning, we selectively employ open-source large language models (LLMs) to simplify sentences when necessary. These simplified sentences are then processed by our sequence tagging model. The primary goal of this work is to develop a lightweight, user-friendly, and cost-free Bengali test-to-chart system that performs competitively. Although we also evaluated open-source, purely LLM-based approaches, our proposed architecture outperformed them, achieving a weighted average F1 score of 0.93 on the test set.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: parameter-efficient-training, named entity recognition and relation extraction, zero/few-shot extraction
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources, Data analysis
Languages Studied: Bengali
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: No
B1 Elaboration: Creators of artifacts are author themselves
B2 Discuss The License For Artifacts: N/A
B2 Elaboration: Artifacts such as dataset and architecture is created by me. We have acknowledge the pretrained models.
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 3
B6 Statistics For Data: Yes
B6 Elaboration: Appendix B
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix A
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Yes, only hyperparameter values are stated. Section 4
C3 Descriptive Statistics: Yes
C3 Elaboration: We have provided confusion matrix to identify misclassification across classes. Section 5
C4 Parameters For Packages: Yes
C4 Elaboration: We have used rough score. Section 5
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Section 3.2
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: Yes
D4 Elaboration: Section 7
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 1197
Loading