Explaining the predictions— Shapley Values with PySpark
Interpreting Isolation Forest’s predictions — and not only
The problem: how to interpret Isolation Forest’s predictions
More specifically, how to tell which features are contributing more to the predictions. Since Isolation Forest is not a typical Decision Tree (see Isolation Forest characteristics here), after some research, I ended up with three possible solutions:
1) Train on the same dataset another similar algorithm that has feature importance implemented and is more easily interpretable, like Random Forest.
2) Reconstruct the trees as a graph for example. The most important features should be the ones on the shortest paths of the trees. This is because of how Isolation Forest works: the anomalies are few and distinct, so it should be easier to single them out — in fewer steps, thus, shortest paths.
3) Estimate the Shapley values: the marginal feature contribution, which is a more standard way of identifying feature importance ranking.
Options 1 and 2 were not deemed as the best solutions to the problem, mainly because of the difficulties in how to pick a good algorithm — Random Forest, for example, operates differently than Isolation…