Learn PySpark in Apache Zeppelin Notebook (Part-1)

Learn PySpark in an interactive way

5 min readDec 6, 2021

There’re many tutorials on the internet about how to learn PySpark in the Jupyter notebook. But most of the tutorials you find are telling you how to run PySpark in local mode in the Jupyter notebook. If you want to run PySpark in distributed mode (yarn or k8s), it would be a big pain point to do that in the Jupyter notebook. In this series of articles, I’d like to introduce Apache Zeppelin Notebook where you can learn PySpark without an extra complex setup. Only a few configurations are needed if you want to run PySpark in yarn or k8s. I would split this series of tutorials into 2 parts:

Learn PySpark in Zeppelin Notebook docker (Part-1)
Setup production PySpark environment (on yarn) in Zeppelin (Part-2)

The easiest way to learn PySpark on Zeppelin is via its docker. It is super easy to learn PySpark in Zeppelin docker.

Setup PySpark environment via Zeppelin docker

Step 1

In this article, I would use Spark 3.1.2. You can download it here, https://spark.apache.org/downloads.html，after downloading please untar it like following:

tar -xvf spark-3.1.2-bin-hadoop2.7.tgz

Step 2

Start Zeppelin in docker via the following command. To be noticed, /Users/jzhang/Java/lib/spark-3.1.2-bin-hadoop3.2 is my spark location, use your own spark location that you download in Step 1

docker run -u $(id -u) -p 8080:8080 -p 4040:4040 --rm -v /Users/jzhang/Java/lib/spark-3.1.2-bin-hadoop3.2:/opt/spark -e SPARK_HOME=/opt/spark  --name zeppelin apache/zeppelin:0.10.0

Step 3

Open http://localhost:8080 in your browser, you can see the Zeppelin home page

You can see there’re many Spark tutorials shipped in Zeppelin, since we are learning PySpark, just open note: 3. Spark SQL (PySpark)

SparkSession is the entry point of Spark SQL, you need to use SparkSession to create DataFrame/Dataset, register UDF, query table and etc. Zeppelin has created SparkSession(spark) for you, so don’t create it by yourself.

PySpark API

PySpark API is very similar to pandas api which has so many APIs, I won’t suggest you remember every api of PySpark. Instead, you can put them into the following 3 main categories, and just learn some often-used api, for these not often-used api, you can search the doc when needed.

Create DataFrame
Transmation on single DataFrame
Transformation on multiple DataFrames

Create DataFrame

Overall, there’re 2 ways to create DataFrame

Create a dataframe from python objects
Create a dataframe from files

Transformation on single DataFrame

Add column

Remove column

Select subset of columns

Filter rows

Group by

Transformation on Multiple DataFrames

Join on a single column

Join on multiple columns

Use SQL on DataFrame

One advantage of Zeppelin is that you can collaborate with multiple languages, e.g. you can use both Python and SQL together in one notebook, and they share the same SparkContext/SparkSession.

Visualization

Besides the above Zeppelin built-in visualization capability, you can also use Python’s visualization libraries, such as matplotlib,

Matplotlib

Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. The usage of Matplotlib in Zeppelin is the same as Jupyter Notebook. The key is to put %matplotlib inline before using Matplotlig. Below is one simple example, for more usage of Matplotlib, you can refer to this link.

Seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Its usage in Zeppelin is the same as in Jupyter. For seaborn usage please refer to this link

Plotnine

plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.

Altair

Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite, and the source is available on GitHub.

Plotly

plotly.py is an interactive, open-source, and browser-based graphing library for Python

Summary

This is part-1 of the PySpark tutorial series in Zeppelin, just mention the basic features of PySpark in Zeppelin. In the next part, I would take more such as the configuration of PySpark in yarn cluster, how to customize Python runtime environment and etc.

References

Spark Interpreter for Apache Zeppelin

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python…

zeppelin.apache.org

Python 2 & 3 Interpreter for Apache Zeppelin

Zeppelin supports python language which is very popular in data analytics and machine learning. %python…