Explaining the predictions— Shapley Values with PySpark

Interpreting Isolation Forest’s predictions — and not only

Maria Karanasou
10 min readMar 20, 2021

--

The problem: how to interpret Isolation Forest’s predictions

More specifically, how to tell which features are contributing more to the predictions. Since Isolation Forest is not a typical Decision Tree (see Isolation Forest characteristics here), after some research, I ended up with three possible solutions:

1) Train on the same dataset another similar algorithm that has feature importance implemented and is more easily interpretable, like Random Forest.

2) Reconstruct the trees as a graph for example. The most important features should be the ones on the shortest paths of the trees. This is because of how Isolation Forest works: the anomalies are few and distinct, so it should be easier to single them out — in fewer steps, thus, shortest paths.

3) Estimate the Shapley values: the marginal feature contribution, which is a more standard way of identifying feature importance ranking.

Options 1 and 2 were not deemed as the best solutions to the problem, mainly because of the difficulties in how to pick a good algorithm — Random Forest, for example, operates differently than Isolation…

--

--

Maria Karanasou

A mom and a Software Engineer who loves to learn new things & all about ML & Big Data. Buy me a coffee to help me keep going buymeacoffee.com/mkaranasou