Capturing Dialectal Variation in Code-Mixed Banglish through Multi-Task Transformers
Students & Supervisors
Student Authors
Supervisors
Abstract
"Code-mixed languages, especially Banglish (a mix of Bengali and English), are extremely difficult for natural language processing (NLP) due to various syntactic irregularities, lexical borrowing, and regional variation. Current methods generally only address isolated tasks like translation or region classification and frequently do not incorporate features of regional speech that impact model capabilities. In this work, we present a Region- Aware Multi-Task Transformer that jointly performs region classification and translation quality prediction on Banglish- English parallel data. The model combines separate Banglish and English BERT encoders, pooling via attention, and cross-attention fusion to represent both intra-lingual and cross-lingual contextual dependencies. Our model achieves 83% accuracy and a macro F1 score of 0.84 for region classification, while the translation quality prediction task achieves a Pearson correlation of 0.78. Both models significantly outperform traditional baselines based on machine learning (TF–IDF + Logistic Regression) and neural sequence models (BiLSTM). The results illustrate that regionaware multi-task learning improves representation learning and enhances generalization through the region’s dialectal variation in Banglish. The research presented in this paper takes one step closer towards building contextualized and robust NLP systems for low-resource, code-mixed languages."
Keywords
Publication Details
- Type of Publication:
- Conference Name: 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering 2025 (IEEE WIECON-ECE 2025)
- Date of Conference: 21/12/2025 - 21/12/2025
- Venue: Long Beach Hotel, Cox’s Bazar, Bangladesh
- Organizer: IEEE Bangladesh Section and IEEE WIE