Interpreting Isolation Forest’s predictions — and not only

Photo by Joshua Golde on Unsplash

The problem: how to interpret Isolation Forest’s predictions

More specifically, how to tell which features are contributing more to the predictions. Since Isolation Forest is not a typical Decision Tree (see Isolation Forest characteristics here), after some research, I ended up with three possible solutions:

1) Train on the same dataset another similar algorithm that has feature importance implemented and is more easily interpretable, like Random Forest.

2) Reconstruct the trees as a graph for example. The most important features should be the ones on the shortest paths of the trees. …

Leverage Machine Learning to defend against DDoS

Baskerville logo source:

Manual identification and mitigation of (DDoS) attacks on websites is a difficult and time-consuming task with many challenges.

This is where Baskerville comes in.

Baskerville is an open-source Security Analytics Engine, a system to identify the attacks (currently) directed to Deflect protected websites as they happen and give the infrastructure the time to respond properly. It uses Machine Learning, Anomaly Detection more specifically, to distinguish between normal and abnormal traffic.

Its main advantage is that it does not need a labeled dataset to operate, thus, it is trained on mostly normal web traffic.

Currently we’re working on transforming Baskerville into…

Thank you for your kind words! And thank you so much for sharing such an amazing project! Watching... :)

Very happy to know the article and code were useful to you. Of course you can use the code, thanks for asking!

First of all, I really enjoyed your thorough analysis, excellent article, thanks!

For the highlighted part, I of course agree about the NoSQL case, but the `no relationship` part is not exactly true, right? I mean there is the UserID that links the two tables, it is just going to be handled differently.

Time to check your spark clusters

Image by Author

It is well known — or should be — that spark is not secured by default. It is right there in the docs

Security in Spark is OFF by default

So you should be well aware that you’ll need to put the effort to secure your cluster. And there are many things to consider, like the application UI, the master UI, the workers UI, data encryption, and ssl for the communication between nodes and so on. I’ll probably make another post covering the above at some point.

One thing you probably don’t have in mind is that spark has a…

Research and awareness needed

source: author’s quick drawing


It seems that more and more people are agreeing that G6PDd can be a risk factor for COVID-19, not only in terms of the medication that is used to combat the virus, but regarding one’s susceptibility to the virus and the severity of its side-effects. There is an urgent need to verify this through numbers and research. The focus of this article is to raise awareness to the matter.


Watch Weighing the Benefits & Risks of the Covid-19 Vaccine for the G6PD Deficiency Pop. 4–8–21

From the first few days of the lock-down, as…

Lessons learned

Debugging PySpark and Isolation Forest — Image by author

So, after a few runs with the PySpark ml implementation of Isolation Forest presented here, I stumbled upon a couple of things and I thought I’d write about them so that you don’t waste the time I wasted troubleshooting.

Only Dense Vectors

In the previous article, I used VectorAssembler to gather the feature vectors. It so happened that the test data I had, created only DenseVectors, but when I tried the example on a different dataset, I realized that:

  • VectorAssembler can create both Dense and Sparse vectors in the same dataframe (which is smart and other spark ml argorithms can leverage it and…

How to leverage spark to read in parallel from a database

Spark Parallelization

A usual way to read from a database, e.g. Postgres, using spark would be something like the following:

However, by running this, you will notice that the spark application has only one task active, which means, only one core is being used and this one task will try to get the data all at once. To make this more efficient, if our data permits it, we can use:

  • numPartitions: the number of data splits
  • column: the column to partition by, e.g. id,
  • lowerBound: the minimum value for the column — inclusive,
  • upperBound: the maximum value of the column —be…

Maria Karanasou

A mom and a Software Engineer who loves to learn new things & all about ML & Big Data. Buy me a coffee to help me keep going

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store