Antyscam – practical web spam classifier
Abstract
To avoid of manipulating search engines results by web spam, anti spam system use machine learning techniques to detect spam. However, if the learning set for the system is out of date the quality of classification falls rapidly. We present the web spam recognition system that periodically refreshes the learning set to create an adequate classifier. A new classifier is trained exclusively on data collected during the last period. We have proved that such strategy is better than an incrementation of the learning set. The system solves the starting–up issues of lacks in learning set by minimisation of learning examples and utilization of external data sets. The system was tested on real data from the spam traps and common known web services: Quora, Reddit, and Stack Overflow. The test performed among ten months shows stability of the system and improvement of the results up to 60 percent at the end of the examined period.
References
J. Carpinter and R. Hunt, “Tightening the net: A review of current and next generation spam filtering tools,” Computers & Security, vol. 25, no. 8, pp. 566 – 578, 2006. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167404806000939
Q. Xu, E. Xiang, Q. Yang, J. Du, and J. Zhong, “Sms spam detection using noncontent features,” Intelligent Systems, IEEE, vol. 27, no. 6, pp. 44–51, 2012.
J. W. Yoon, H. Kim, and J. H. Huh, “Hybrid spam filtering for mobile communication,” Computers & Security, vol. 29, no. 4, pp. 446 – 459, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167404809001266
Y. Gao and A. Choudhary, “Active learning image spam hunter,” in Advances in Visual Computing, ser. Lecture Notes in Computer Science, G. Bebis, R. Boyle, B. Parvin, D. Koracin, Y. Kuno, J. Wang, R. Pajarola, P. Lindstrom, A. Hinkenjann, M. Encarnao, C. Silva, and D. Coming, Eds. Springer Berlin Heidelberg, 2009, vol. 5876, pp. 293–302.
S. Wakade, K. Liszka, and C.-C. Chan, “Application of learning algo- rithms to image spam evolution,” in Emerging Paradigms in Machine Learning, ser. Smart Innovation, Systems and Technologies, S. Ra- manna, L. C. Jain, and R. J. Howlett, Eds. Springer Berlin Heidelberg, 2013, vol. 13, pp. 471–495.
F. Benevenuto, T. Rodrigues, A. Veloso, J. Almeida, M. Goncalves, and V. Almeida, “Practical detection of spammers and content promoters in online video sharing systems,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 3, pp. 688–701, June 2012.
A. Luz, E. Valle, and A. A. Arajo, “Non-collaborative content detecting on video sharing social networks,” Multimedia Tools and Applications, vol. 1, pp. 1–19, 2012.
V.Potdar,F.Ridzuan,P.Hayati,A.Talevski,E.A.Yeganeh,N.Firuzeh, and S. Sarencheh, “Spam 2.0: The problem ahead,” in ICCSA (2)’10, 2010, pp. 400–411.
C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna, “A reference collection for web spam,” SIGIR Forum, vol. 40, no. 2, pp. 11–24, Dec. 2006.
M. Erde ́lyi, A. A. Benczu ́r, J. Masane ́s, and D. Siklo ́si, “Web spam filtering in internet archives,” in Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, ser. AIRWeb ’09. New York, NY, USA: ACM, 2009, pp. 17–20. [Online]. Available: http://doi.acm.org/10.1145/1531914.1531918
J. Martinez-Romo and L. Araujo, “Web spam identification through language model analysis,” in Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, ser. AIRWeb ’09. New York, NY, USA: ACM, 2009, pp. 21–28.
L. Araujo and J. Martinez-Romo, “Web spam detection: New classifi- cation features based on qualified link analysis and language models,” Information Forensics and Security, IEEE Transactions on, vol. 5, no. 3, pp. 581–590, 2010.
K. L. Goh, A. Singh, and K. H. Lim, “Multilayer perceptrons neural network based web spam detection application,” in Signal and Infor- mation Processing (ChinaSIP), 2013 IEEE China Summit International Conference on, July 2013, pp. 636–640.
M. Luckner, M. Gad, and P. Sobkowiak, “Stable web spam detection using features based on lexical items,” Computers & Security, vol. 46, pp. 79–93, 2014. [Online]. Available: http://dx.doi.org/10.1016/j.cose.2014.07.006
R. Colbaugh and K. Glass, “Predictive defense against evolving ad- versaries,” in Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on, June 2012, pp. 18–23.
M. Bru ̈ckner and T. Scheffer, “Nash equilibria of static prediction games,” in NIPS, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Curran Associates, Inc., 2009, pp. 171–179.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: An update,” SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, Nov. 2009. [Online]. Available: http://doi.acm.org/10.1145/1656274.1656278
J. Mathew, A. K. Singh, K. L. Goh, and A. K. Singh, “Proceedings of the 4th international conference on eco-friendly computing and communication systems comprehensive literature review on machine learning structures for web spam classification,” Procedia Computer Science, vol. 70, pp. 434 – 441, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050915032330
M. Erde ́lyi, A. Garzo ́, and A. A. Benczu ́r, “Web spam classification: A few features worth more,” in Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, ser. WebQuality ’11. New York, NY, USA: ACM, 2011, pp. 27–34. [Online]. Available: http://doi.acm.org/10.1145/1964114.1964121
L. Shengen, N. Xiaofei, L. Peiqi, and W. Lin, “Generating new features using genetic programming to detect link spam,” in Proceedings of the 2011 Fourth International Conference on Intelligent Computation
Technology and Automation - Volume 01, ser. ICICTA ’11. Washington,
DC, USA: IEEE Computer Society, 2011, pp. 135–138.
M. Mahmoudi, A. Yari, and S. Khadivi, “Web spam detection based on discriminative content and link features,” in Telecommunications (IST),
5th International Symposium on, 2010, pp. 542–546.
S.AlgurandN.Pendari,“Hybridspamicityscoreapproachtowebspam detection,” in Pattern Recognition, Informatics and Medical Engineering
(PRIME), 2012 International Conference on, 2012, pp. 36–40.
C. Dong and B. Zhou, “Effectively detecting content spam on the web using topical diversity measures,” in Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01, ser. WI-IAT ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 266–273. [Online]. Available: http://dl.acm.org/citation.cfm?id=2457524.2457693
I. B ́ıro ́, D. Siklo ́si, J. Szabo ́, and A. A. Benczu ́r, “Linked latent dirichlet allocation in web spam filtering,” in Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, ser. AIRWeb ’09. New York, NY, USA: ACM, 2009, pp. 37–40.
[Online]. Available: http://doi.acm.org/10.1145/1531914.1531922
G. V. Cormack, M. D. Smucker, and C. L. Clarke, “Efficient and effective spam filtering and re-ranking for large web datasets,” Inf. Retr.,
vol. 14, no. 5, pp. 441–465, Oct. 2011.
A. Heydari, M. ali Tavakoli, N. Salim, and Z. Heydari, “Detection
of review spam: A survey,” Expert Systems with Applications, vol. 42, no. 7, pp. 3634 – 3642, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417414008082
X.-C. Yin, K. Huang, C. Yang, and H.-W. Hao, “Convex ensemble learning with sparsity and diversity,” Information Fusion, vol. 20, pp. 49 – 59, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1566253513001413
B. Manaskasemsak and A. Rungsawang, “Web spam detection using trust and distrust-based ant colony optimization learning,” International Journal of Web Information Systems, vol. 11, no. 2, pp. 142–161, 2015. [Online]. Available: http://dx.doi.org/10.1108/IJWIS-12-2014-0047
S. M. Lee, D. S. Kim, J. H. Kim, and J. S. Park, “Spam detection using feature selection and parameters optimization,” in Complex, Intelligent and Software Intensive Systems (CISIS), 2010 International Conference on, 2010, pp. 883–888.
A. Alarifi and M. Alsaleh, “Web spam: A study of the page language effect on the spam detection features,” in Machine Learning and Applications (ICMLA), 2012 11th International Conference on, vol. 2, 2012, pp. 216–221.
N. Dai, B. D. Davison, and X. Qi, “Looking into the past to better classify web spam,” in Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, ser. AIRWeb ’09. New York, NY, USA: ACM, 2009, pp. 1–8.
T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne, “Tracking web spam with html style similarities,” ACM Trans. Web, vol. 2, no. 1, pp. 3:1–3:28, Mar. 2008.
J. Piskorski, M. Sydow, and D. Weiss, “Exploring linguistic features for web spam detection: a preliminary study,” in Proceedings of the 4th international workshop on Adversarial information retrieval on the web, ser. AIRWeb ’08. New York, NY, USA: ACM, 2008, pp. 25–28.
J. Fdez-Glez, D. Ruano-Ordas, J. R. Me ́ndez, F. Fdez-Riverola, R. Laza, and R. Pavo ́n, “A dynamic model for integrating simple web spam classification techniques,” Expert Syst. Appl., vol. 42, no. 21, pp. 7969–7978, Nov. 2015. [Online]. Available: http://dx.doi.org/10.1016/j.eswa.2015.06.043
J. Fdez-Glez, D. Ruano-Ordas, J. R. Me ́ndez, F. Fdez-Riverola, R. Laza, and R. Pavo ́n, , “Wsf2: A novel framework for filtering web spam,” Scientific Programming, p. 18, 2016.
C. Seiffert, T. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “Rus- boost: A hybrid approach to alleviating class imbalance,” Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, vol. 40, no. 1, pp. 185–197, Jan 2010.
B. Scho ̈lkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, Jul. 2001. [Online]. Available: http://dx.doi.org/10.1162/089976601750264965
V. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
W. Homenda, M. Luckner, and W. Pedrycz, “Classification with rejection based on various SVM techniques,” in 2014 International Joint Conference on Neural Networks, IJCNN 2014, Beijing, China, July 6-11, 2014. IEEE, 2014, pp. 3480–3487. [Online]. Available: http://dx.doi.org/10.1109/IJCNN.2014.6889655
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
P. Faltstrom, P. E. Hoffman, and A. M. Costello, “Internationalizing domain names in applications (idna),” Internet RFC 3490, March 2003.
N. Japkowicz and M. Shah, Evaluating Learning Algorithms: A Classification Perspective. New York, NY, USA: Cambridge University Press, 2011.
Downloads
Published
Issue
Section
License
Copyright (c) 2019 International Journal of Electronics and Telecommunications

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
1. License
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on https://creativecommons.org/licenses/by/4.0/.
2. Author’s Warranties
The author warrants that the article is original, written by stated author/s, has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author/s. The undersigned also warrants that the manuscript (or its essential substance) has not been published other than as an abstract or doctorate thesis and has not been submitted for consideration elsewhere, for print, electronic or digital publication.
3. User Rights
Under the Creative Commons Attribution license, the author(s) and users are free to share (copy, distribute and transmit the contribution) under the following conditions: 1. they must attribute the contribution in the manner specified by the author or licensor, 2. they may alter, transform, or build upon this work, 3. they may use this contribution for commercial purposes.
4. Rights of Authors
Authors retain the following rights:
- copyright, and other proprietary rights relating to the article, such as patent rights,
- the right to use the substance of the article in own future works, including lectures and books,
- the right to reproduce the article for own purposes, provided the copies are not offered for sale,
- the right to self-archive the article
- the right to supervision over the integrity of the content of the work and its fair use.
5. Co-Authorship
If the article was prepared jointly with other authors, the signatory of this form warrants that he/she has been authorized by all co-authors to sign this agreement on their behalf, and agrees to inform his/her co-authors of the terms of this agreement.
6. Termination
This agreement can be terminated by the author or the Journal Owner upon two months’ notice where the other party has materially breached this agreement and failed to remedy such breach within a month of being given the terminating party’s notice requesting such breach to be remedied. No breach or violation of this agreement will cause this agreement or any license granted in it to terminate automatically or affect the definition of the Journal Owner. The author and the Journal Owner may agree to terminate this agreement at any time. This agreement or any license granted in it cannot be terminated otherwise than in accordance with this section 6. This License shall remain in effect throughout the term of copyright in the Work and may not be revoked without the express written consent of both parties.
7. Royalties
This agreement entitles the author to no royalties or other fees. To such extent as legally permissible, the author waives his or her right to collect royalties relative to the article in respect of any use of the article by the Journal Owner or its sublicensee.
8. Miscellaneous
The Journal Owner will publish the article (or have it published) in the Journal if the article’s editorial process is successfully completed and the Journal Owner or its sublicensee has become obligated to have the article published. Where such obligation depends on the payment of a fee, it shall not be deemed to exist until such time as that fee is paid. The Journal Owner may conform the article to a style of punctuation, spelling, capitalization and usage that it deems appropriate. The Journal Owner will be allowed to sublicense the rights that are licensed to it under this agreement. This agreement will be governed by the laws of Poland.
By signing this License, Author(s) warrant(s) that they have the full power to enter into this agreement. This License shall remain in effect throughout the term of copyright in the Work and may not be revoked without the express written consent of both parties.