Pyspark — Parallel read from database

How to leverage spark to read in parallel from a database

Image for post
Image for post
Spark Parallelization
import osq = '(select min(id) as min, max(id) as max from table_name where condition) as bounds'
db_url = 'localhost:5342'
partitions = os.cpu_count() * 2 # a good starting point
conn_properties = {
'user': 'username',
'password': 'password',
'driver': 'org.postgresql.Driver', # assuming we have Postgres
}
# given that we partition our data by id, get the minimum and the maximum id:
bounds = spark.read.jdbc(
url=db_url,
table=q,
properties=self.conn_properties
).collect()[0]

Written by

A mom and a Software Engineer that loves to learn new things & is fascinated by ML & Big Data. Writing to better understand what I know & to get to know more

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store