Time to check your spark clusters

Image for post
Image for post
Image by Author

Security in Spark is OFF by default


Image for post
Image for post
source: author’s crappy quick drawing

TL;DR

It seems that more and more people are agreeing that G6PDd can be a risk factor for COVID-19, not only in terms of the medication that is used to combat the virus, but regarding one’s susceptibility to the virus and the severity of its side-effects. There is an urgent need to verify this through numbers and research. The focus of this article is to raise awareness to the matter.


Lessons learned

Image for post
Image for post
Debugging PySpark and Isolation Forest — Image by author

Only Dense Vectors

  • VectorAssembler can create both Dense and Sparse vectors in the same dataframe (which is smart and other spark ml argorithms can leverage it and work with…


How to leverage spark to read in parallel from a database

Image for post
Image for post
Spark Parallelization
  • numPartitions: the number of data splits
  • column: the column to partition by, e.g. id,
  • lowerBound: the minimum value for the column — inclusive,
  • upperBound: the maximum value of the column —be careful, it is…


Superpower I wish I had : Telekinesis

Image for post
Image for post
Photo by Lacie Slezak on Unsplash


Main characteristics and ways to use Isolation Forest in PySpark

Image for post
Image for post

Isolation Forest is an algorithm for anomaly / outlier detection, basically a way to spot the odd one out. We go through the main characteristics and explore two ways to use Isolation Forest with Pyspark.

Isolation Forest for Outlier Detection

Isolation means separating an instance from the rest of the instances

Basic Characteristics of Isolation Forest

  • it uses normal samples as the training set and can allow a few instances of abnormal samples (configurable). You basically feed the algorithm your normal data and it doesn’t mind if your dataset is not that well curated, provided you tune the contamination parameter. In other words it learns what normal looks like to be able to distinguish the…


Image for post
Image for post
Photo by Ozgu Ozden on Unsplash


Image for post
Image for post


A PySpark case

Image for post
Image for post


On G6PD Deficiency

Image for post
Image for post
Photo by Michael Longmire on Unsplash

What is Glucose-6-phosphate dehydrogenase (G6PD) Deficiency

About

Maria Karanasou

A Software Engineer that loves to learn new things and is fascinated by ML and Big Data. Writing to better understand what I know and to get to know even more.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store