Member-only story

Explaining the predictions— Shapley Values with PySpark

Interpreting Isolation Forest’s predictions — and not only

10 min readMar 20, 2021

The problem: how to interpret Isolation Forest’s predictions

More specifically, how to tell which features are contributing more to the predictions. Since Isolation Forest is not a typical Decision Tree (see Isolation Forest characteristics here), after some research, I ended up with three possible solutions:

1) Train on the same dataset another similar algorithm that has feature importance implemented and is more easily interpretable, like Random Forest.

2) Reconstruct the trees as a graph for example. The most important features should be the ones on the shortest paths of the trees. This is because of how Isolation Forest works: the anomalies are few and distinct, so it should be easier to single them out — in fewer steps, thus, shortest paths.

3) Estimate the Shapley values: the marginal feature contribution, which is a more standard way of identifying feature importance ranking.

Options 1 and 2 were not deemed as the best solutions to the problem, mainly because of the difficulties in how to pick a good algorithm — Random Forest, for example, operates differently than Isolation…

Explaining the predictions— Shapley Values with PySpark

Interpreting Isolation Forest’s predictions — and not only

The problem: how to interpret Isolation Forest’s predictions

Create an account to read the full story.

Written by Maria Karanasou

Responses (1)