Outlier detection is a machine learning task that aims to identify rare items, events, or observations that deviate from the “norm” or general distribution of the given data.

An anomaly is something that arouses suspicion that it was generated by different data generating mechanism

**In the outlier detection task, the goal is train an unsupervised model to find anomalies** subject to two constraints:

- Minimize false negatives (aka catch as many anomalies as possible).
- Minimize false positives (aka when an anomaly is flagged, don’t be wrong).

In many applications, there is a third constraint: **the “ground truth” of what are true…**

This is an exciting development! This will certainly be useful for many practitioners.

`PyOD`

is a Python library with a comprehensive set of scalable, state-of-the-art (SOTA) algorithms for detecting outlying data points in multivariate data. This task is commonly referred to as Outlier Detection or Anomaly Detection.

The outlier detection task aims to identify rare items, events, or observations that deviate from the “norm” or general distribution of the given data.

My favorite definition: **An anomaly is something that arouses suspicion that it was generated by different data generating mechanism**

Common applications of outlier detection include fraud detection, data error detection, intrusion detection in network security, and fault detection in mechanics.

Practically speaking…

Outliers, or anomalies are data points that deviate from the norm of a dataset. They arouse suspicion that they were generated by a different mechanism.

Anomaly detection is (usually) an unsupervised learning task where the objective is to identify suspicious observations in data. The task is constrained by the cost of incorrectly flagging normal points as anomalous and failing to flag actual anomalous points.

Applications of anomaly detection include network intrusion detection, data quality monitoring, and price arbitrage in financial markets.

Copula-Based Outlier Detection — COPOD — is a new algorithm for anomaly detection. …

Most state-of-the-art (SOTA) time series classification methods are limited by high computational complexity. This makes them slow to train on smaller datasets and effectively unusable on large datasets.

Recently, ROCKET (RandOM Convolutional KErnel Transform) has achieved SOTA of accuracy in just a fraction of the time as other SOTA time series classifiers. ROCKET transforms time series into features using random convolutional kernels and passes the features to a linear classifier.

**MiniRocket is even faster!**

MiniRocket** **(MINImally RandOm Convolutional KErnel Transform) is a (nearly) deterministic reformulation of Rocket that is **75 times faster **on larger datasets and boasts roughly equivalent accuracy.

…

The world is inherently dynamic and nonstationary — constantly changing.

It is common for the performance of machine learning models to decline over time. This occurs as data distributions and target labels (“ground truth”) evolve. This is especially true for models related to people.

Thus, an essential component of machine learning systems is monitoring and adapting to such changes.

In this article, I will introduce this idea of *concept drift *or *regime change *and then discuss three ways to handle it and what you should consider.

New tools for model monitoring are emerging, but it is still important to understand…

Clustering is an unsupervised learning task where an algorithm groups similar data points without any “ground truth” labels. Clustering different time series into similar groups is a challenging because each data point is an ordered sequence.

In a previous article, I explained how the k-means clustering algorithm can be adapted to time series by using Dynamic Time Warping, which measures the similarity between two sequences, in place of standard measures like Euclidean distance.

Unfortunately, the k-means clustering algorithm for time series can be very slow!

Hierarchical clustering is faster than k-means because it operates on a matrix of pairwise distances…

“The task of time series classification can be thought of as involving learning or detecting signals or patterns within time series associated with relevant classes.” — Dempster, et al 2020, authors of ROCKET paper

Most time series classification methods with state-of-the-art (SOTA) accuracy have high computational complexity and scale poorly. This means they are slow to train on smaller datasets and effectively unusable on large datasets.

ROCKET (RandOM Convolutional KErnal Transform) can achieve the same level of accuracy in just a fraction of the time as competing SOTA algorithms, including convolutional neural networks. …

A common task for time series machine learning is **classification. **Given a set of time series with class labels, can we train a model to accurately predict the class of new time series?

In machine learning with time series, using features extracted from series is more powerful than simply treating a time series in a tabular form, with each date/timestamp in a separate column. Such features can capture the characteristics of series, such as trend and autocorrelations.

But… what sorts of features can you extract and how do you select among them?

In this article, I discuss the findings of two papers that analyze feature-based representations of time series. …

Data scientist working in the financial services industry