Search engines miss infected sites, putting businesses at risk


We have all come across warnings when visiting suspicious websites. Your browser or search engine may even prevent you from entering, displaying a message that this site may harm your device. But what if the site you’re trying to visit isn’t flagged as malicious?

According to SiteLock’s 2022 Security Report, 92% of infected websites are not blacklisted by search engines. This means that businesses and individuals are vulnerable to attack when visiting these sites.

There are a number of reasons why search engines miss infected sites. First, it can take weeks or even months for a website to be identified as malicious. Indeed, attackers are constantly changing tactics to evade detection. Second, many businesses don’t realize their site has been hacked until it’s too late. And third, even if a website is flagged, there is no guarantee that users will avoid it.

So what can be done to protect businesses and users from these threats? Just as cybercriminals use AI to automate their attacks, we can also use AI to defend businesses. It’s not just theory; An IEEE analysis of AI-based malware detection techniques concluded that they “provide significant advantages”, particularly in terms of accuracy, speed and scalability.

For example, SafeDNS uses “continuous machine learning”, achieving 98% accuracy in detecting malware. They use a “malware database” to feed machine learning models that analyze the data to look for new patterns of behavior that could indicate a threat. This allows them to identify threats quickly and effectively, before they can cause damage.

If we want to stay ahead of cybercriminals, we need to use AI to defend our businesses. Recent research is a wake-up call: it’s time to act and invest in AI-based solutions.

There are many ways to detect and protect against malware. In this section, we’ll look at one such method: using Python to detect malware based on a data set of executable files. See the full related code here.

The dataset we are going to use comes from Kaggle’s “Malware Executable Detection” dataset. It consists of 373 samples of executable files out of which 301 are malicious files and 72 are non-malicious. As you can see, the dataset is unbalanced, with normal files outnumbering malicious files.

There are 531 features represented in the dataset, from F1 to F531, and a label column indicating whether the file is malicious or non-malicious. We won’t use all of these features, but we will use a variety of them to build our models.

We will start by importing the necessary libraries for our demo. We will use the pandas, numpy and scikit-learn libraries:

import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,accuracy_score,confusion_matrix,recall_score,precision_score,f1_score, auc, roc_auc_score
from sklearn.model_selection import train_test_split

Next, we’ll load into the dataset:

df = pd.read_csv('uci_malware_detection.csv')

Now that we’ve looked at the dataset, let’s go ahead and split it into training and testing sets. We’ll also map string labels to numbers and remove duplicates:

df['Label'] = df['Label'].map({'malicious': 0, 'non-malicious': 1})
df = df.drop_duplicates(keep=False)

X, y = df.drop("Label", axis=1), df["Label"]
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.2, random_state=42)

We are now ready to build our models. We will use a simple logistic regression model:

lr_model =  LogisticRegression(max_iter=1,000), y_train)

We can now evaluate the performance of our model on the test set:

lr_model.score(X_test, y_test)
y_pred = lr_model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print('ROC-AUC score', roc_auc_score(y_test,y_pred))
print('Confusion matrix:n ', confusion_matrix(y_test, y_pred))

Running this code gives us the following output:

ROC-AUC score 0.9705882352941176
Confusion matrix:
 [[57  0]
 [ 1 16]]

In the end, we managed to create an accurate model with both high precision and high recall. Not bad! Of course, this is only a proof of concept, as the real situation is an order of magnitude more complex. At scale, AI systems trained in big data can make a real difference in the fight against malware.


About Author

Comments are closed.