Detection of Spam Pages Using XGBoost Algorithm
Subject Areas : electrical and computer engineeringReyhane Rashidpour 1 , Ali-Mohammad Zareh-Bidoki 2
1 - Dept. of Comp. Eng., Yazd University, Yazd, Iran
2 - Dept. of Comp. Eng., Yazd University, Yazd, Iran
Keywords: Web spam, XGBoost classification algorithm, data balancing, machine learning.,
Abstract :
Today, search engines are the gateway to the web. With the increasing popularity of the web, the efforts to exploit it for commercial, social, and political purposes have also increased, making it difficult for search engines to distinguish good content from spam. The concept of web spam was first introduced in 1996 and quickly became recognized as one of the key challenges for the search engine industry. The phenomenon of spam occurs primarily because a significant portion of web page visits comes from search engines, and users tend to check the first search results. The goal of identifying spam pages is to ensure that these pages cannot achieve high rankings using deceptive strategies. Our effort is to provide an effective method for identifying spam pages, thereby reducing the presence of spam in the top search results. In this article, two methods for combating web spam are proposed. The first method, called XGspam, identifies spam pages based on the XGBoost learning algorithm with an accuracy of 94.27%. The second method, named XGSspam, offers a solution to the challenge of imbalanced web data by combining the SMOTE oversampling algorithm with the XGBoost classification model, achieving an accuracy of 95.44% in identifying spam pages.