Flink Development Platform: Apache Zeppelin — (Part 3). Streaming

5 min readMar 30, 2020

In Flink on Zeppelin (Part 1),(Part 2), I talked about how to setup Flink on Zeppelin and do batch tasks on it. In this article, I will talk about how to do stream processing in Flink on Zeppelin via Flink SQL + UDF. Actually most of streaming tasks can be done via Flink SQL + UDF.

Overall, like batch scenario you can use Flink SQL + UDF in 2 main scenarios in Zeppelin.

Streaming ETL (Real time data cleaning and transformation)
Streaming Data Analytics （Build real time dashboard in Zeppelin）

Prerequisites

In this blog, I will use kafka as data source and sink. I will read one kafka topic and do some processing on the data, then write it back to another kafka topic. In order to use kafka connector, you need to configure flink.execution.packages to include the necessary dependencies

org.apache.flink:flink-connector-kafka_2.11:1.10.0,org.apache.flink:flink-connector-kafka-base_2.11:1.10.0,org.apache.flink:flink-json:1.10.0

Zeppelin will download the packages specified in flink.execution.packages and put them under the classpath of flink interpreter so that you don’t need to put them into flink lib folder manully.

Since it is not easy for everyone to set up a kafka cluster, I will use this docker-compose to create kafka cluster. Run the following 2 commands to start the cluster and create the sample data.

docker-compose up -dcurl -X POST http://localhost:8083/connectors \-H 'Content-Type:application/json' \-H 'Accept:application/json' \-d @connect.source.datagen.json

Refer this link for more details. Besides that, you’d better to add the following host mapping into /etc/hosts , otherwise flink job can not connect with this kafka cluster.

127.0.0.1        broker

Streaming ETL

Now I will demonstrate how to do streaming ETL via Flink SQL. You can find all the code here in the tutorial note Flink Tutorial/Streaming ETL which is included in Zeppelin.

Step 1. Create source table to represent the source data.

We will use %flink.ssql to represent that the following sql are streaming sql which will be executed via StreamTableEnvironment.

You can write multiple sql statements in one paragraph, just separate them with semicolon.

Step 2. Create sink table to represent the output data.

Step 3. After we created the source and sink table, we can use insert into statement to trigger the streaming processing job as following. Here I just do filtering and convert event_ts to timestamp.

Step 4. After the streaming job is started, you can use another sql statement to query the sink table to verify your streaming job. Here you can see the top 10 records which will be refreshed every 3 seconds.

You may notice the type=update after %flink.ssql, that means it is in update mode which I will talk that later.

Streaming Data Analytics

After the Streaming ETL job is started, we can do streaming data analytics on the processed table (sink_kafka) via select statement. The result of sql select statement will be pushed to Zeppelin frontend. That means we can use Zeppelin to create real time dashboard. You can find the examples in tutorial note Flink Tutorial/Streaming Data Analysis which is included in Zeppelin.

Zeppelin supports 3 kinds of streaming data analytics:

Single
Update
Append

Single Mode

Single mode is for the case when the result of sql statement is always one row, such as the following example. The output format is HTML, and you can specify paragraph local property template for the final output content template. And you can use {i} as placeholder for the ith column of result.

Update Mode

Update mode is suitable for the case when the output is more than one rows, and always will be updated continuously. Here’s one example where we use group by.

Append Mode

Append mode is suitable for the scenario where output data is always appended. E.g. the following example which use tumble window.

Other Features

You can also use other same features in batch sql, such as using udf, connect with hive. You can refer Part 2 for more details.

Summary

This is how you can do streaming in flink. Here’s several highlights of this article.

You can do most of the streaming work via SQL + UDF
Streaming ETL
Streaming Data Analytics
You can also use udf and connect with hive in the same way as batch

Zeppelin community still try to improve and evolve the whole user experience of Flink on Zeppelin , you can join Zeppelin slack to discuss with community. http://zeppelin.apache.org/community.html#slack-channel

Besides this I also make a series of videos to show you how to do that, you can check them on this youtube link.

Flink Development Platform: Apache Zeppelin — (Part 3). Streaming

Prerequisites

Streaming ETL

Streaming Data Analytics

Other Features

Summary

Flink on Zeppelin Series

References

Written by Jeff Zhang

Responses (1)