BRDD: A Transformer-Based Approach for Region-Specific Dialect Detection in Banglish Using Pretrained Embeddings
Students & Supervisors
Student Authors
Supervisors
Abstract
Detecting regional dialects in Banglish (romanized Bangla) poses significant challenges due to noisy text, inconsistent romanization, and overlapping linguistic features. In this paper, we introduce BRDD (Bangla Romanized Dialect Detector), a novel Transformer-based model designed to accurately classify Banglish dialects from different regions of Bangladesh, such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet. Leveraging the power of pretrained word embeddings, our approach overcomes the complexities of spelling variations and romanization inconsistencies, providing an efficient method for regional dialect identification. The BRDD model utilizes a fine-tuned Transformer encoder that processes the Romanized Banglish text. To enhance dialect detection, we propose a prototypical classifier, which assigns sentence embeddings to region-specific prototypes, ensuring that the model learns to distinguish between subtle regional features. We further enhance performance by employing data augmentation strategies to simulate romanization noise, which improves the model's robustness to spelling variations and diverse writing styles. Experimental results on a custom Banglish dialect dataset show that BRDD outperforms traditional methods, achieving significant improvements in accuracy and macro-F1 score. The model is robust to noisy inputs and is interpretable, making it highly effective for real-world applications in social media monitoring, regional text classification, and other Bangla NLP tasks. By combining pretrained embeddings with transformer architecture, BRDD offers a powerful solution for Banglish dialect detection, advancing multilingual NLP in under-resourced languages. Our experiments conducted with a custom Banglish dialect dataset demonstrated that BRDD outperformed traditional methods, and shown a significant improvement in accuracy and macro-F1 score. The model is robust under noisy input conditions and is interpretable making our model extremely useful for social media monitoring, regional text classification, Bangla NLP, and possibly other longitudinal NLP tasks. Pretrained embeddings, merged with transformer architecture, enable BRDD to provide a powerful approach to Banglish dialect detection, further developing multilingual NLP in under-resourced languages.
Keywords
Publication Details
- Type of Publication:
- Conference Name: 7th International Conference on Integrated Sciences (ICIS) 2025
- Date of Conference: 25/10/2025 - 25/10/2025
- Venue: Eastern University Campus, Ashulia Model Town, Dhaka, Bangladesh
- Organizer: Eastern University, Bangladesh