Volume 32, Issue 6 e5140
SPECIAL ISSUE PAPER

Research on improved text classification method based on combined weighted model

Yongchang Wang

Corresponding Author

Yongchang Wang

Communication University of China, Beijing, China

Yongchang Wang, Communication University of China, No. 1 Dingfuzhuang East Street, Chaoyang District, Beijing 100024, China.

Email: [email protected]

Search for more papers by this author
Ligu Zhu

Ligu Zhu

Communication University of China, Beijing, China

Search for more papers by this author
First published: 20 January 2019
Citations: 10

Summary

Text classification is very important in information retrieval, but the traditional text classification model has many problems, such as the feature dimension disaster, the lack of semantic features, etc. Aiming at the problems, this paper proposes an improved TFIDF model combined with the Word2vec model for weighing word vectors. In view of the inability of the Word2vec model to distinguish the importance of words with the text, TFIDF is further introduced to weighing Word2vec word vectors to achieve a weighted Word2vec classification model. For data preprocessing, we optimized the traditional StringToWordVector algorithm. The main improvement of StringToWordVector is the introduction to a new algorithm of stem extraction. First, this paper gives a simple description of the basic steps and algorithms of traditional text classification, and then, the ideas and steps of the improved StringToWordVector algorithm are proposed. Finally, experimental results using our improved algorithm are tested for four different data sets (WEBO_SINA and three standard UCI data sets). The experimental results show that the improved StringToWordVector algorithm combined with the combined weighted model has higher classification accuracy, recall, and F1 values than the traditional text classification model only using the Word2vec model or using TFIDF. The experimental results are satisfactory.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.