This study focuses on enhancing the detection and reporting of phishing emails and URLs through semi-supervised learning, a machine learning approach that leverages both a small set of labeled data and a larger pool of unlabeled data. This method is particularly effective for scenarios where obtaining large amounts of labeled data is challenging or costly.
Phishing, a method of deceiving internet users into divulging sensitive information, predominantly occurs via deceptive emails and malicious websites. Statistics reveal a troubling trend: 96% of phishing attacks are conducted through emails, and a smaller portion through harmful websites. The severity and frequency of phishing and ransomware attacks have been on the rise:
In 2020, nearly 7 million new phishing and scam pages emerged. The same year saw a dramatic increase in the average ransom payment, up by 171% from 2019, reaching an average of $312,493.
The peak of average ransom payments in September 2020 was $233,817. Phishing attempts notably surged by 510% from January to February 2020.
Concurrently, the pandemic fueled a 20.7% increase in online transactions in 2020, which unfortunately provided opportunities for fraudsters to mask their activities. This period saw ransomware attacks grow by over 40% and email malware attacks increase by 600% compared to 2019.
The escalating success of social engineering attacks, which are the primary cause of security breaches in corporate networks, underscores the urgency of improving phishing detection. Effective detection can mitigate not just the direct costs of ransomware attacks but also the extensive legal and reputational damages arising from data breaches.
The aim of this research is to apply semi-supervised learning techniques to significantly enhance the detection and reporting of phishing attempts. By doing so, it seeks to address a critical need in cybersecurity, contributing to the reduction of successful social engineering attacks.
The primary dataset for this project comprises of webpages sourced from PhishTank. These webpages have undergone extensive preprocessing to facilitate efficient analysis and model training. Due to constraints in time and computing resources, the model is trained on a selected subset of this extensive dataset.
The preprocessing of the webpages involves several critical steps:
Retrieval of HTML Content: Using BeautifulSoup, the raw HTML content of each webpage is retrieved.
Cleaning HTML Content: The HTML content is then cleaned to remove unnecessary elements such as HTML tags, script and style tags. This is accomplished using regular expressions.
Tokenization: The cleaned HTML content is tokenized into words or phrases utilizing the Natural Language Toolkit (NLTK).
Normalization: The tokens are normalized by converting them to lowercase, removing punctuation, and excluding stop words to ensure consistency and relevance in the analysis.
In addition to standard preprocessing, the following features are extracted to aid in the identification of phishing websites:
title_clean. The contents of the <title> element.
is_english. Uses the langdetect module to assess whether the webpage is in English.
img_count. The number of <img> elements.
has_form. Specify whether the web page contains one or more <form> objects.
has_login_form. Specify whether the webpage contains one or more <form> objects containing an input of type password.
has_js. Specify whether the webpage contains one or more <javascript> object.
js_include_b64. Specify whether or not the <javascript> objects contain base64-encoded strings.
nb_tokens. Number of tokens remaining after the initial cleaning up of the parsing phase.
classification. The binary classification of the web site (maliciaous or benign).
nb_title_entities. The number of words in the title.
nb_text_entities. The number of words in the body.
jpmorgan_chase. The number of references to JP Morgan Chase in the body.
bank_of_america. The number of references to Bank of America in the body.
wells_fargo. The number of references to Wells Fargo in the body.
hsbc. The number of references to HSBC in the body.
deutsche_bank. The number of references to Deutsche Bank in the body.
mitsubishi_ufj. The number of references to Mitsubishi UFJ in the body.
citibank. The number of references to Citibank in the body.
rbc. The number of references to RBC in the body.
paypal. The number of references to PayPal in the body.
scotiabank. The number of references to Scotiabank in the body.
apple. The number of references to Apple in the body.
microsoft. The number of references to Microsoft in the body.
amazon. The number of references to Amazon in the body.
google. The number of references to Google in the body.
samsung. The number of references to Samsung in the body.
facebook. The number of references to Facebook in the body.
steam. The number of references to Steam in the body.
netflix. The number of references to Netflix in the body.
ups. The number of references to UPS in the body.
fedex. The number of references to Fedex in the body.
dhl. The number of references to DHL in the body.
tnt. The number of references to TNT in the body.
usps. The number of references to USPS in the body.
royal_mail. The number of references to Royal Mail in the body.
purolator. The number of references to Purolator in the body.
canada_post. The number of references to Canada Post in the body.
youtube. The number of references to YouTube in the body.
whatsapp. The number of references to WhatsApp in the body.
facebook_messenger. The number of references to Facebook Messenger in the body.
wechat. The number of references to WeChat in the body.
instagram. The number of references to Instagram in the body.
tiktok. The number of references to TikTok in the body.
qq. The number of references to QQ in the body.
weibo. The number of references to JWeibo in the body.
linkedin. The number of references to LinkedIn in the body.
twitter. The number of references to Twitter in the body.
These features are meticulously stored as columns in the resulting CSV file, along with the tokens derived from the text.