Lets see a simple example using the isolation forest model to detect anomalies in time-series data. Below, we have imported a sales data set that contains the day of an order, information about the product, geographical information about the customer, and the amount of the sale. To keep this example simple, lets just look at one feature (sales) over time.
See data here: https://www.kaggle.com/datasets/rohitsahoo/sales-forecasting (GPL 2.0)
#packages for data manipulation
import pandas as pd
from datetime import datetime#packages for modeling
from sklearn.ensemble import IsolationForest
#packages for data visualization
import matplotlib.pyplot as plt
#import sales data
sales = pd.read_excel("Data/Sales Data.xlsx")#subset to date and sales
revenue = sales[['Order Date', 'Sales']]
revenue.head()
As you can see above, we have the total sale amount for every order on a particular day. Since we have a sufficient amount of data (4 years worth), let’s try to detect months where the total sales is either noticeably higher or lower than the expected total sales.
First, we need to conduct some preprocessing, and sum the sales for every month. Then, visualize monthly sales.
#format the order date to datetime month and year
revenue['Order Date'] = pd.to_datetime(revenue['Order Date'],format='%Y-%m').dt.to_period('M')#sum sales by month and year
revenue = revenue.groupby(revenue['Order Date']).sum()
#set date as index
revenue.index = revenue.index.strftime('%m-%Y')
#set the fig size
plt.figure(figsize=(8, 5))#create the line chart
plt.plot(revenue['Order Date'],
revenue['Sales'])
#add labels and a title
plt.xlabel('Moth')
plt.ylabel('Total Sales')
plt.title('Monthly Sales')
#rotate x-axis labels by 45 degrees for better visibility
plt.xticks(rotation = 90)
#display the chart
plt.show()
Using the line chart above, we can see that while sales fluctuates from month-to-month, total sales trends upward over time. Ideally, our model will identify months where total sales fluctuates more that expected and is highly influential to our overall trend.
Now we need to initialize and fit our model. The model below uses the default parameters. I have highlighted these parameters as they are the most important to the model’s performance.
- n_estimators: The number of base estimators in the ensemble.
- max_samples: The number of samples to draw from X to train each base estimator (if “auto”, then
max_samples = min(256, n_samples)). - contamination: The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples.
- max_features: The number of features to draw from X to train each base estimator.
#set isolation forest model and fit to the sales
model = IsolationForest(n_estimators = 100, max_samples = 'auto', contamination = float(0.1), max_features = 1.0)
model.fit(revenue[['Sales']])
Next, lets use the model to display the anomalies and their anomaly score. The anomaly score is the mean measure of normality of an observation among the base estimators. The lower the score, the more abnormal the observation. Negative scores represent outliers, positive scores represent inliers.
#add anomaly scores and prediction
revenue['scores'] = model.decision_function(revenue[['Sales']])
revenue['anomaly'] = model.predict(revenue[['Sales']])
Lastly, lets bring up the same line chart from before, but highlighting the anomalies with plt.scatter.
The model appears to do well. Since the data fluctuates so much month-to-month, a worry could be that inliers would get marked as anomalies, but this is not the case due to the bootstrap sampling of the model. The anomalies appear to be the larger fluctuations where sales deviated from the trend a ‘significant’ amount.
However, knowing the data is important here as some of the anomalies should come with a caveat. Let’s look at the first (February 2015) and last (November 2018) anomaly detected. At first, we see that they both are large fluctuations from the mean.
However, the first anomaly (February 2015) is only our second month of recording sales and the business may have just started operating. Sales are definitely low, and we see a large spike the next month. But is it fair to mark the second month of business an anomaly because sales were low? Or is this the norm for a new business?
For our last anomaly (November 2018), we see a huge spike in sales that appears to deviate from the overall trend. However, we have run out of data. As data continues to be recorded, it may not have been an anomaly, but perhaps an identifier of a steeper upwards trend.