Skip to main content Skip to main navigation menu Skip to site footer
Bulletin of Abai KazNPU. Series of Physical and mathematical sciences

IDENTIFYING AND ANALYZING FEATURES FOR THE CLASSIFICATION OF NEWS

Published March 2023

139

63

I.M. Ualiyeva+
al-Farabi Kazakh National University, Almaty
R.R. Mussabayev +
The Institute of Information and Computational Technologies, Almaty
al-Farabi Kazakh National University, Almaty
The Institute of Information and Computational Technologies, Almaty
Abstract

The number of documents, including online news, that requires a deeper understanding and analysis grows every year. Machine Learning algorithms help us to classify texts accurately. However, finding suitable structures and techniques for text, including feature extraction, is difficult for researchers. This paper addresses the task of identi-fying and analyzing features to distinguish different genres of texts. We studied the main characteristics of each genre of news text like news, articles, interviews, and blogs to obtain more informative features. We have built our data set by collecting texts from open-access official information portals. Analysis of our data set and features that look at structural complexity, detail, and imaginative details in a text are helpful to distinguish our dataset. In par-ticular, we use complexity (lexical diversity, lexical density, punctuation, average sentence length, number of personal pronouns, readability index), detail features (number of proper nouns in the text, numbers, month-related words), imaginative features (PoS tags, words-quantifiers, plural nouns) features. Our results suggest that our features provide effective representation to distinguish news texts from articles, blogs/opinions, and interviews with high accuracy.

pdf
Language

English

How to Cite

[1]
Ualiyeva, I. and Mussabayev , R. 2023. IDENTIFYING AND ANALYZING FEATURES FOR THE CLASSIFICATION OF NEWS. Bulletin of Abai KazNPU. Series of Physical and mathematical sciences. 81, 1 (Mar. 2023), 178–185. DOI:https://doi.org/10.51889/2959-5894.2023.81.1.020.