Flink Development Platform: Apache Zeppelin — (Part 3). Streaming
In Flink on Zeppelin (Part 1),(Part 2), I talked about how to setup Flink on Zeppelin and do batch tasks on it. In this article, I will talk about how to do stream processing in Flink on Zeppelin via Flink SQL + UDF. Actually most of streaming tasks can be done via Flink SQL + UDF.
Overall, like batch scenario you can use Flink SQL + UDF in 2 main scenarios in Zeppelin.
- Streaming ETL (Real time data cleaning and transformation)
- Streaming Data Analytics (Build real time dashboard in Zeppelin)
Prerequisites
In this blog, I will use kafka as data source and sink. I will read one kafka topic and do some processing on the data, then write it back to another kafka topic. In order to use kafka connector, you need to configure flink.execution.packages to include the necessary dependencies
- org.apache.flink:flink-connector-kafka_2.11:1.10.0,org.apache.flink:flink-connector-kafka-base_2.11:1.10.0,org.apache.flink:flink-json:1.10.0
Zeppelin will download the packages specified in flink.execution.packages and put them under the classpath of flink interpreter so that you don’t need to put them into flink lib folder manully.
Since it is not easy for everyone to set up a kafka cluster, I will use this docker-compose to create kafka cluster. Run the following 2 commands to start the cluster and create the sample data.
docker-compose up -dcurl -X POST http://localhost:8083/connectors \-H 'Content-Type:application/json' \-H 'Accept:application/json' \-d @connect.source.datagen.json
Refer this link for more details. Besides that, you’d better to add the following host mapping into /etc/hosts
, otherwise flink job can not connect with this kafka cluster.
127.0.0.1 broker
Streaming ETL
Now I will demonstrate how to do streaming ETL via Flink SQL. You can find all the code here in the tutorial note Flink Tutorial/Streaming ETL which is included in Zeppelin.
Step 1. Create source table to represent the source data.
We will use %flink.ssql to represent that the following sql are streaming sql which will be executed via StreamTableEnvironment.
You can write multiple sql statements in one paragraph, just separate them with semicolon.
Step 2. Create sink table to represent the output data.
Step 3. After we created the source and sink table, we can use insert into statement to trigger the streaming processing job as following. Here I just do filtering and convert event_ts to timestamp.
Step 4. After the streaming job is started, you can use another sql statement to query the sink table to verify your streaming job. Here you can see the top 10 records which will be refreshed every 3 seconds.
You may notice the type=update after %flink.ssql, that means it is in update mode which I will talk that later.
Streaming Data Analytics
After the Streaming ETL job is started, we can do streaming data analytics on the processed table (sink_kafka) via select statement. The result of sql select statement will be pushed to Zeppelin frontend. That means we can use Zeppelin to create real time dashboard. You can find the examples in tutorial note Flink Tutorial/Streaming Data Analysis which is included in Zeppelin.
Zeppelin supports 3 kinds of streaming data analytics:
- Single
- Update
- Append
Single Mode
Single mode is for the case when the result of sql statement is always one row, such as the following example. The output format is HTML, and you can specify paragraph local property template for the final output content template. And you can use {i} as placeholder for the ith column of result.
Update Mode
Update mode is suitable for the case when the output is more than one rows, and always will be updated continuously. Here’s one example where we use group by.
Append Mode
Append mode is suitable for the scenario where output data is always appended. E.g. the following example which use tumble window.
Other Features
You can also use other same features in batch sql, such as using udf, connect with hive. You can refer Part 2 for more details.
Summary
This is how you can do streaming in flink. Here’s several highlights of this article.
- You can do most of the streaming work via SQL + UDF
- Streaming ETL
- Streaming Data Analytics
- You can also use udf and connect with hive in the same way as batch
Zeppelin community still try to improve and evolve the whole user experience of Flink on Zeppelin , you can join Zeppelin slack to discuss with community. http://zeppelin.apache.org/community.html#slack-channel
Besides this I also make a series of videos to show you how to do that, you can check them on this youtube link.
Flink on Zeppelin Series
- Flink Development Platform: Apache Zeppelin — (Part 1). Get Started
- Flink Development Platform: Apache Zeppelin — (Part 2). Batch
- Flink Development Platform: Apache Zeppelin — (Part 3). Streaming
- Flink Development Platform: Apache Zeppelin — (Part 4). Advanced Usage
References
- http://zeppelin.apache.org/
- Flink on Zeppelin videos
- https://zjffdu.gitbook.io/flink-on-zeppelin/
- https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/sql/
- https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/connect.html
- https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/systemFunctions.html
- https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html