MARKELA MUÇA, BORA LAMAJ (MYRTO), FREDERIK DARA

Abstract

This research investigates the efficacy of the Naïve Bayes classification method combined with n-gram language models for text categorization in the Albanian language. Despite Naïve Bayes being esteemed for its simplicity, computational efficiency, and competitive efficacy in natural language processing tasks, its utilization in low-resource and morphologically intricate languages like Albanian has not been thoroughly examined. This research addresses the lack by implementing and evaluating Naïve Bayes models utilizing unigram, bigram, trigram hybrid n-gram tokenization approaches. The influence of different n-gram representations on classification efficacy is evaluated using common measures, such as accuracy, precision, recall, and F1-score. This study advances machine learning approaches for under- resourced languages and offers empirical evidence to bolster the development of computational linguistics resources for Albanian. The experimental results reveal that hybrid models combining unigrams and bigrams outperform single-order n-gram configurations, achieving the highest accuracy and F1- score. Conversely, trigram-based models exhibit performance degradation due to data sparsity, highlighting the trade-off between contextual richness and feature dimensionality in small datasets. Furthermore, the study demonstrates that traditional machine learning approaches remain robust in low-resource settings, offering competitive results without the computational overhead of deep learning models. Beyond classification accuracy, this research emphasizes practical implications for sentiment analysis applications, such as identifying recurring negative themes in Albanian- language reviews to support decision-making for businesses and content creators. The findings contribute to the advancement of NLP
for underresourced languages and provide methodological guidelines for optimizing feature representation in similar linguistic contexts.

Key words: Text Classification, Naive Bayes, n-Gram.

Download PDF