More specifically, how to tell which features are contributing more to the predictions. Since Isolation Forest is not a typical Decision Tree (see Isolation Forest characteristics here), after some research, I ended up with three possible solutions:
1) Train on the same dataset another similar algorithm that has feature importance implemented and is more easily interpretable, like Random Forest.
2) Reconstruct the trees as a graph for example. The most important features should be the ones on the shortest paths of the trees. …
Great reasoning, just a small question about the other allowed characters like `-`, `?`, `_`, `%`, `=` ?
Also, would you do anything differently if there was a requirement to include urls in other languages
… a unique hash (e.g. MD5, SHA256, etc.) of the given URL. The hash can then be encoded for display. This encoding could be base36 ([a-z ,0–9]) or base62 ([A-Z, a-z, 0–9]). If we add
/, we can use Base64 encoding. A reasonable question would be, “What should be the length of the short key? 6, 8, or 10 characters…
The Educative Team
First of all, I really enjoyed your thorough analysis, excellent article, thanks!
For the highlighted part, I of course agree about the NoSQL case, but the `no relationship` part is not exactly true, right? I mean there is the UserID that links the two tables, it is just going to be handled differently.
The best type of database to use would be a NoSQL database store like DynamoDB or Cassandra since we are storing billions of rows with no relationships between the objects.
The Educative Team
It is well known — or should be — that spark is not secured by default. It is right there in the docs
Security in Spark is OFF by default
So you should be well aware that you’ll need to put the effort to secure your cluster. And there are many things to consider, like the application UI, the master UI, the workers UI, data encryption, and ssl for the communication between nodes and so on. I’ll probably make another post covering the above at some point.
One thing you probably don’t have in mind is that spark has a…
Research and awareness needed
It seems that more and more people are agreeing that G6PDd can be a risk factor for COVID-19, not only in terms of the medication that is used to combat the virus, but regarding one’s susceptibility to the virus and the severity of its side-effects. There is an urgent need to verify this through numbers and research. The focus of this article is to raise awareness to the matter.
Watch Weighing the Benefits & Risks of the Covid-19 Vaccine for the G6PD Deficiency Pop. 4–8–21
From the first few days of the lock-down, as…
So, after a few runs with the PySpark ml implementation of Isolation Forest presented here, I stumbled upon a couple of things and I thought I’d write about them so that you don’t waste the time I wasted troubleshooting.
In the previous article, I used
VectorAssembler to gather the feature vectors. It so happened that the test data I had, created only
DenseVectors, but when I tried the example on a different dataset, I realized that:
VectorAssemblercan create both Dense and Sparse vectors in the same dataframe (which is smart and other spark ml argorithms can leverage it and…
A usual way to read from a database, e.g. Postgres, using spark would be something like the following:
However, by running this, you will notice that the spark application has only one task active, which means, only one core is being used and this one task will try to get the data all at once. To make this more efficient, if our data permits it, we can use:
numPartitions: the number of data splits
column: the column to partition by, e.g.
lowerBound: the minimum value for the column — inclusive,
upperBound: the maximum value of the column —be…
As a first time mom of a now two-month old beautiful baby, I find myself most of the time confined in awkward and tiring positions to feed her and to help her relax and sleep. My most wanted superpower at those times would definitely be telekinesis, because once you are in place for feeding e.g., if you don’t have all the things you need with you and someone to help you, depending on the baby’s mood, you are very much doomed to starring at the thing you are trying to reach for and is right out of your grasp, the…
Isolation Forest is an algorithm for anomaly / outlier detection, basically a way to spot the odd one out. We go through the main characteristics and explore two ways to use Isolation Forest with Pyspark.
Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. […] [Isolation Forest] explicitly isolates anomalies instead of profiles normal points
Isolation means separating an instance from the rest of the instances