← Back to Publications List

Discovering Research Trends in AI Subdomains Using TF-IDF and SciBERT-Based Clustering.

Students & Supervisors

Student Authors
Raisa Jarin
Bachelor of Science in Computer Science & Engineering, FST
Md. Nooruzzaman
Bachelor of Science in Computer Science & Engineering, FST
Sheikh Sajjad Hossain
Bachelor of Science in Computer Science & Engineering, FST
Supervisors
Tohedul Islam
Assistant Professor, Faculty, FST

Abstract

The rapid growth of research papers have made it increasingly challenging for researchers to keep track of new themes, and old methods like keyword indexing or manual categorization no longer work well. Text clustering provides an unsupervised way to organize document sets and uncover hidden thematic structures. However, most past studies have focused on either sparse features such as TF–IDF or dense embeddings such as BERT in isolation, resulting in limited comparative evaluation across methods and clustering algorithms. To fill this gap, we collected and preprocessed 1,800 research abstracts from agri- culture, education, and health, and represented them with both TF–IDF and SciBERT embeddings with dimensionality reduction for visualization. Clustering was applied on each representation using K-means, Agglomerative hierarchical clustering, and DB- SCAN, and tested with silhouette scores, elbow diagnostics, and 2-D plots. The results show that representation impacts more than the algorithm: in the SciBERT space, K-means achieved a silhouette score of 0.385 and Hierarchical 0.311, indicating clear separation, while TF–IDF gives zero silhouette values across methods. DBSCAN failed to give meaningful clusters under default settings, showing its sensitivity to parameters. These findings demonstrate that contextual embeddings significantly outperform sparse term-based features for abstract-level clus- tering and that simple centroid and variance-based algorithms remain effective when applied to the representations. The study further contributes to a reproducible framework for comparing classical and modern methods, providing practical insights into large-scale literature organization. In this corpus SciBERT with K-means configuration provided the best performance among all representation and algorithm combinations which were evalu- ated.

Keywords

Text Clustering Research Abstracts Unsuper- vised Learning K-Means Hierarchical Clustering DBSCAN TF–IDF SciBERT

Publication Details

  • Type of Publication:
  • Conference Name: 28th ICCIT 2025
  • Date of Conference: 19/12/2025 - 19/12/2025
  • Venue: Long Beach Hotel, Cox's Bazar , Bangladesh
  • Organizer: IEEE Bangladesh Section