Learn PySpark in Apache Zeppelin Notebook (Part-1)
Learn PySpark in an interactive way
There’re many tutorials on the internet about how to learn PySpark in the Jupyter notebook. But most of the tutorials you find are telling you how to run PySpark in local mode in the Jupyter notebook. If you want to run PySpark in distributed mode (yarn or k8s), it would be a big pain point to do that in the Jupyter notebook. In this series of articles, I’d like to introduce Apache Zeppelin Notebook where you can learn PySpark without an extra complex setup. Only a few configurations are needed if you want to run PySpark in yarn or k8s. I would split this series of tutorials into 2 parts:
- Learn PySpark in Zeppelin Notebook docker (Part-1)
- Setup production PySpark environment (on yarn) in Zeppelin (Part-2)
The easiest way to learn PySpark on Zeppelin is via its docker. It is super easy to learn PySpark in Zeppelin docker.
Setup PySpark environment via Zeppelin docker
- Step 1
In this article, I would use Spark 3.1.2. You can download it here, https://spark.apache.org/downloads.html,after downloading please untar it like following:
tar -xvf spark-3.1.2-bin-hadoop2.7.tgz
- Step 2
Start Zeppelin in docker via the following command. To be noticed, /Users/jzhang/Java/lib/spark-3.1.2-bin-hadoop3.2
is my spark location, use your own spark location that you download in Step 1
docker run -u $(id -u) -p 8080:8080 -p 4040:4040 --rm -v /Users/jzhang/Java/lib/spark-3.1.2-bin-hadoop3.2:/opt/spark -e SPARK_HOME=/opt/spark --name zeppelin apache/zeppelin:0.10.0
- Step 3
Open http://localhost:8080 in your browser, you can see the Zeppelin home page
You can see there’re many Spark tutorials shipped in Zeppelin, since we are learning PySpark, just open note: 3. Spark SQL (PySpark)
SparkSession is the entry point of Spark SQL, you need to use SparkSession
to create DataFrame/Dataset, register UDF, query table and etc. Zeppelin has created SparkSession(spark) for you, so don’t create it by yourself.
PySpark API
PySpark API is very similar to pandas api which has so many APIs, I won’t suggest you remember every api of PySpark. Instead, you can put them into the following 3 main categories, and just learn some often-used api, for these not often-used api, you can search the doc when needed.
- Create DataFrame
- Transmation on single DataFrame
- Transformation on multiple DataFrames
Create DataFrame
Overall, there’re 2 ways to create DataFrame
- Create a dataframe from python objects
- Create a dataframe from files
Transformation on single DataFrame
Add column
Remove column
Select subset of columns
Filter rows
Group by
Transformation on Multiple DataFrames
Join on a single column
Join on multiple columns
Use SQL on DataFrame
One advantage of Zeppelin is that you can collaborate with multiple languages, e.g. you can use both Python and SQL together in one notebook, and they share the same SparkContext/SparkSession.
Visualization
Besides the above Zeppelin built-in visualization capability, you can also use Python’s visualization libraries, such as matplotlib,
Matplotlib
Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. The usage of Matplotlib in Zeppelin is the same as Jupyter Notebook. The key is to put %matplotlib inline
before using Matplotlig. Below is one simple example, for more usage of Matplotlib, you can refer to this link.
Seaborn
Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Its usage in Zeppelin is the same as in Jupyter. For seaborn usage please refer to this link
- Plotnine
plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.
Altair
Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite, and the source is available on GitHub.
Plotly
plotly.py is an interactive, open-source, and browser-based graphing library for Python
Summary
This is part-1 of the PySpark tutorial series in Zeppelin, just mention the basic features of PySpark in Zeppelin. In the next part, I would take more such as the configuration of PySpark in yarn cluster, how to customize Python runtime environment and etc.