Time to check your spark clusters

Image for post
Image for post

It is well known — or should be — that spark is not secured by default. It is right there in the docs

Security in Spark is OFF by default

So you should be well aware that you’ll need to put the effort to secure your cluster. And there are many things to consider, like the application UI, the master UI, the workers UI, data encryption, and ssl for the communication between nodes and so on. I’ll probably make another post covering the above at some point.

One thing you probably don’t have in mind is that spark has a…


Research and awareness needed

Image for post
Image for post

TL;DR

It seems that more and more people are agreeing that G6PDd can be a risk factor for COVID-19, not only in terms of the medication that is used to combat the virus, but regarding one’s susceptibility to the virus and the severity of its side-effects. There is an urgent need to verify this through numbers and research. The focus of this article is to raise awareness to the matter.

From the first few days of the lock-down, as I watched the news with the number of patients affected and the casualties, I wondered if there…


Lessons learned

Image for post
Image for post

So, after a few runs with the PySpark ml implementation of Isolation Forest presented here, I stumbled upon a couple of things and I thought I’d write about them so that you don’t waste the time I wasted troubleshooting.

Only Dense Vectors

In the previous article, I used VectorAssembler to gather the feature vectors. It so happened that the test data I had, created only DenseVectors, but when I tried the example on a different dataset, I realized that:

  • VectorAssembler can create both Dense and Sparse vectors in the same dataframe (which is smart and other spark ml argorithms can leverage it and…


How to leverage spark to read in parallel from a database

Image for post
Image for post

A usual way to read from a database, e.g. Postgres, using spark would be something like the following:

However, by running this, you will notice that the spark application has only one task active, which means, only one core is being used and this one task will try to get the data all at once. To make this more efficient, if our data permits it, we can use:

  • numPartitions: the number of data splits
  • column: the column to partition by, e.g. id,
  • lowerBound: the minimum value for the column — inclusive,
  • upperBound: the maximum value of the column —be…


Superpower I wish I had : Telekinesis

Image for post
Image for post

As a first time mom of a now two-month old beautiful baby, I find myself most of the time confined in awkward and tiring positions to feed her and to help her relax and sleep. My most wanted superpower at those times would definitely be telekinesis, because once you are in place for feeding e.g., if you don’t have all the things you need with you and someone to help you, depending on the baby’s mood, you are very much doomed to starring at the thing you are trying to reach for and is right out of your grasp, the…


Main characteristics and ways to use Isolation Forest in PySpark

Image for post
Image for post

Isolation Forest is an algorithm for anomaly / outlier detection, basically a way to spot the odd one out. We go through the main characteristics and explore two ways to use Isolation Forest with Pyspark.

Isolation Forest for Outlier Detection

Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. […] [Isolation Forest] explicitly isolates anomalies instead of profiles normal points

source: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

Isolation means separating an instance from the rest of the instances

Basic Characteristics of Isolation Forest

  • it uses normal samples as the training set and can allow a few instances of…


Image for post
Image for post

For those who are familiar with pandas DataFrames, switching to PySpark can be quite confusing. The API is not the same, and when switching to a distributed nature, some things are being done quite differently because of the restrictions imposed by that nature.

I recently stumbled upon Koalas from a very interesting Databricks presentation about Apache Spark 3.0, Delta Lake and Koalas, and thought that it would be nice to explore it.

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

pandas is the de…


Image for post
Image for post

Debugging a spark application can range from a fun to a very (and I mean very) frustrating experience.

I’ve started gathering the issues I’ve come across from time to time to compile a list of the most common problems and their solutions.

This is the first part of this list. I hope you find it useful and it saves you some time. Most of them are very simple to resolve but their stacktrace can be cryptic and not very helpful.


A PySpark case

Image for post
Image for post

Unittesting Spark applications is not that straight-forward. For most of the cases you’ll probably need an active spark session, which means that your test cases will take a long time to run and that perhaps we’re tiptoeing around the boundaries of what can be called a unit test. But, it is definitely worth doing it.

So, should I? Well, yes! Testing your software is always a good thing, and it will most likely save you from many headaches, plus, you’ll be forced to have your code implemented in smaller bits and pieces that’ll be easier to test, thus, gain in…


On G6PD Deficiency

Image for post
Image for post

This is a different kind of article than the ones I usually write, but I thought it was important to write it. It will have some technical details at the end about a relevant side project, but it is definitely not technical in content. I wrote this in hope that it will be helpful to someone, because my experience could have been avoided with a little bit of more information.

What is Glucose-6-phosphate dehydrogenase (G6PD) Deficiency

Glucose-6-phosphate dehydrogenase deficiency (G6PDD) is an inborn error of metabolism that predisposes to red blood cell breakdown.[1] Most of the time, those who are affected have no symptoms.[3] Following a…

Maria Karanasou

A mom and a Software Engineer that loves to learn new things & is fascinated by ML & Big Data. Writing to better understand what I know & to get to know more

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store