Performance Comparison of Data Balancing Techniques on Hate Speech Detection in Turkish

Karayi�it, Habibe; Akdagli, Ali; ACI, �i�dem

Volume : Issue : Year : 2024

30/2Current Issue Ahead of Print Archive Most Accessed Articles Manuscript Submission

Pamukkale University Journal of Engineering Sciences Performance Comparison of Data Balancing Techniques on Hate Speech Detection in Turkish [Pamukkale Univ Muh Bilim Derg]

Pamukkale Univ Muh Bilim Derg. Ahead of Print: PAJES-40072 | DOI: 10.5505/pajes.2023.40072

Performance Comparison of Data Balancing Techniques on Hate Speech Detection in Turkish

Habibe Karayi�it¹, Ali Akdagli², �i�dem ACI³
¹Ministry Of National Education, Adana, Turkey
²Department Of Electrical And Electronics Engineering, Mersin University, Mersin, Turkey
³Department Of Computer Engineering, Mersin University, Mersin, Turkey

Increasing hate speech on social media platforms causes psychological disorders and deep and negative effects. Automatic language classification models are needed to detect hate speech. When testing language models for hate speech, imbalanced datasets where one data class is represented much more frequently than the other can be a problem in language datasets. When the dataset is imbalanced, the classifier may be biased towards the majority class and may not perform well in the minority class. This can lead to incorrect or unreliable classification results. To solve this problem, data level balancing methods such as oversampling or undersampling are used to balance the class distribution before classifying the dataset. This study, it is aimed to achieve a successful classification model combination that detects hate speech by using data-level balancing methods. For this, a comprehensive study was carried out by applying the balancing method at eight data levels (random oversampling, Synthetic Minority Oversampling Technique (SMOTE), K-means SMOTE, Localized Random Affine Shadow Sample (LoRAS), Text-based Generative Adversarial Network (TextGAN), Nearmiss, Tomek Links ve Clustering-based) to the Abusive Turkish Comments (ATC) dataset, which has an imbalanced distribution of labels, obtained from Instagram. Classification performances of data level balancing methods were evaluated with Basic Machine Learning (BML) and Convolutional Neural Network (CNN) methods. It has been observed that the CBoW+CNN model based on the TextGAN data-level balancing method, as well as the Skip-gram CNN model, exhibited the best classification performance with a Macro-Averaged F1 score of 0.972.

Keywords: Data balancing, Social media, Machine learning, Deep learning, Natural language processing, Hate speech

T�rk�e Nefret S�ylemi Tespitinde Veri Dengeleme Tekniklerinin Performans Kar��la�t�rmas�

Habibe Karayi�it¹, Ali Akdagli², �i�dem ACI³
¹Milli E�itim Bakanl��, Adana, T�rkiye
²Mersin �niversitesi, Elektrik-elektronik M�hendisli�i B�l�m�, Mersin, T�rkiye
³Mersin �niversitesi, Bilgisayar M�hendisli�i B�l�m�, Mersin, T�rkiye

Sosyal medya platformlar�nda artan nefret s�ylemleri, psikolojik rahats�zl�klara, derin ve olumsuz etkilere neden olmaktad�r. Nefret s�ylemlerini tespit etmek i�in otomatik dil s�n�fland�rma modellerine ihtiya� vard�r. Nefret s�ylemleri i�in dil modelleri test edilirken, bir veri s�n�f�n�n di�erinden �ok daha s�k temsil edildi�i dengesiz veri k�meleri dil verilerinde sorun te�kil edebilir. Veri k�mesi dengesiz da��l�ma sahip oldu�unda, s�n�fland�r�c� �o�unluk s�n�f�na y�nelik �nyarg�l� olabilir ve az�nl�k s�n�f�nda iyi performans g�stermeyebilir. Bu, yanl�� veya g�venilmez s�n�fland�rma sonu�lar�na yol a�abilir. Bu sorunu ��zmek i�in veri k�mesi s�n�fland�r�lmadan �nce oversampling veya undersampling gibi veri d�zeyi dengeleme y�ntemleri ile veri s�n�flar� dengelenir. Bu �al��mada, veri d�zeyi dengeleme y�ntemleri kullan�larak nefret s�ylemini tespit eden ba�ar�l� bir s�n�fland�rma modeli kombinasyonu elde etmek ama�lanmaktad�r. Bu ama�la, Instagram'dan elde edilmi� dengesiz etiket da��l�m�na sahip Abusive Turkish Comments (ATC) veri k�mesine sekiz veri d�zeyinde (rastgele oversampling, Synthetic Minority Oversampling Technique (SMOTE), K-means SMOTE, Localized Random Affine Shadow Sample (LoRAS), Text-based Generative Adversarial Network (TextGAN), Nearmiss, Tomek Links ve Clustering-based) dengeleme y�ntemi uygulanarak kapsaml� bir �al��ma yap�lm��t�r. Veri d�zeyi dengeleme y�ntemlerinin s�n�fland�rma performanslar� Basic Machine Learning (BML) ve Convolutional Neural Network (CNN) y�ntemleriyle de�erlendirilmi�tir. TextGAN veri d�zeyi dengeleme y�ntemine dayal� CBoW+CNN modelinin ve Skip-gram CNN modelinin 0,972 Makro Ortalamal� F1 puan� ile en iyi s�n�fland�rma performans�n� sergiledi�i g�r�lm��t�r.

Anahtar Kelimeler: Veri dengeleme, Sosyal medya, Makine ��renmesi, Derin ��renme, Do�al dil i�leme, Nefret s�ylemi

Corresponding Author: Ali Akdagli, T�rkiye
Manuscript Language: English

TOOLS Full Text PDF Print Download citation RIS EndNote BibTex Medlars Procite Reference Manager Share with email Share Send email to author Similar articles Google Scholar