I work with PySpark in Databricks on a daily basis. My work as a Data Scientist requires me to deal with large amounts of data in many different tables. It is a challenging job, many times.
As much as the Extract, Transform and Load (ETL) process sounds like something simple, I can tell that it is not always like that. When we work with Big Data, a lot of our thinking has to change for two reasons:
- The amounts of data are way bigger than regular datasets.
- When working with parallel computing in clusters, we must take into account that the data will be split among many worker nodes to perform part of the job and then be brought together as a whole. And this process, many times, can become very time consuming if the query is too complex.
Knowing that, we must learn how to be write smart queries for Big Data. In this post, I will show a few of my favorite functions from the module pyspark.sql.functions
, aiming to help you during your Data Wrangling in PySpark.
Now let’s move on to the content in this post.
Just like many other languages, PySpark has the benefit of the modules, where you can find many ready-to-use functions for the most different purposes. Here’s the one we will load to our session:
from pyspark.sql import functions as F
If you want to see how extended is the list of functions inside pyspark.sql.functions
, go to this web site, where the API Reference is. Have in mind that this is for the version 3.5.0. Some older versions may not carry all the functions I will show in this post.
Dataset
The dataset to be used as our example is the Diamonds, from ggplot2, shared under MIT License.
# Point file path
path = '/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv'# Load Data
df = spark.read.csv(path, header=True, inferSchema= True)