The Streaming Mindset … what, why, how? Marta Paes (@morsapaes) Developer Advocate © 2020 Ververica
A presentation at Bristech Meetup in January 2021 in by Marta Paes
The Streaming Mindset … what, why, how? Marta Paes (@morsapaes) Developer Advocate © 2020 Ververica
About Ververica Original Creators of Apache Flink® 2 @morsapaes Enterprise Stream Processing Part of With Ververica Platform Alibaba Group
Working in DevRel J. Doe ● 00:00 3 @morsapaes
Working in DevRel Me ● 00:01 �� 4 @morsapaes
Working in DevRel Me ● 00:01 5 @morsapaes
Working in DevRel Me ● 00:01 6 @morsapaes
Working in DevRel Me ● 00:01 7 @morsapaes
Where do you start? @morsapaes
1 Go Headfirst ● @morsapaes Stream Processing 101
Analytics…Not that Long Ago OLTP Database(s) ETL … Data Warehouse (DWH) FTP Servers 10 @morsapaes
Analytics…Not that Long Ago The quest for data… Long, nightly jobs OLTP Databases x Someone waking up Re-run long, nightly job ETL Someone complaining … Data Warehouse (DWH) FTP Servers 11 @morsapaes Results But in the end… • Most source data is continuously produced • Not everyone can wait for yesterday’s data • Most logic is not changing that frequently
Everything is a Stream @morsapaes
Everything is a Stream Your static data records become events that are continuously produced and should be continuously processed. Stream Processing Stream Processing Stream Processing … Event Sources Applications, Sensors, Databases, Devices, … Log / Stream Storage Kafka, Kinesis, Pulsar, … Sinks Long-term Storage K/V Store, Database, Log, Application, … S3, HDFS, … … 13 @morsapaes
Stream Processing 101 14 Batch Processing Continuous Streaming query/logic changes fast data changes fast data changes slowly query/logic changes slowly E.g: Ad-hoc queries, data exploration, ML model training E.g: Most business logic nowadays @morsapaes A good starter: Streaming 101: the World Beyond Batch
Stream Processing 101 Batch Processing Continuous Streaming query/logic changes fast data changes fast data changes slowly query/logic changes slowly E.g: Ad-hoc queries, data exploration, ML model training E.g: Most business logic nowadays more batch-like Offline ML Model Training Data Warehousing OLAP / BI / Reporting 15 @morsapaes more real-time Real-time Behavior Modeling Unified Offline/ Online Analytics (e.g. recommenders, pricing) Online ML Model Training/Evaluation Continuous Monitoring Continuous ETL (e.g. position, risk) Real-time Alerting (e.g. fraud, security) Distributed OLTP-style Apps
Stream Processing Use Cases Examples Large-scale Data Pipelines 16 @morsapaes ML-Based Fraud Detection Service Monitoring & Anomaly Detection
Stream Processing Use Cases Examples 17 Large-scale Data Pipelines ML-Based Fraud Detection Service Monitoring & Anomaly Detection Unified Online/Offline Model Training E2E Streaming Analytics Pipelines ML Feature Generation @morsapaes
2 Bridge Concepts @morsapaes ● Bounded vs. Unbounded data ● Event time vs. Processing time ● Fault tolerance
Bounded vs. Unbounded Data Batch Processing 19 Continuous Streaming • Data “at rest” • Data “on the fly” • Hard boundaries (e.g. process 1 day of data) • Ever-growing, infinite data set @morsapaes
Bounded vs. Unbounded Data Batch Processing Continuous Streaming Window • Data “at rest” • Data “on the fly” • Hard boundaries (e.g. process 1 day of data) • Ever-growing, infinite data set Windows split the stream into buckets of finite size, over which you can apply computations 20 @morsapaes
Event Time vs. Processing Time Event time ● Deterministic results ● Handle out-of-order or late events ● Trade-off result completeness/correctness and latency Processing time 21 @morsapaes ● Non-deterministic results ● Best performance and lowest latency ● Speed > completeness/correctness
Fault Tolerance 22 Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines @morsapaes
Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State State 23 @morsapaes
Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State State Persistent Storage checkpointed state 24 @morsapaes checkpointed state checkpointed state Checkpoint
Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State ❌ State 25 @morsapaes
Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines Reset position in input stream State State Persistent Storage checkpointe d state 26 @morsapaes checkpointe d state checkpointe d state Restore Recover all embedded state
3 Pick a Flavour & Build @morsapaes
The Flink API Stack Layered, with different tradeoffs for expressiveness and ease of use. You can mix and match all the APIs! Ease of Use Flink SQL Streaming Analytics & ML Table API (dynamic tables) PyFlink 28 DataStream API (streams, windows) Stateful Stream Processing Expressiveness 28 @morsapaes Building Blocks (events, state, (event) time)
How to Get Hands-On? Start with whatever language and/or abstractions are more familiar to you! Java/Scala 29 SQL Python ● Self-paced Training Course ● Flink SQL Cookbook ● PyFlink Walkthrough ● DataStream API Walkthrough ● Table API Walkthrough ● Zeppelin Notebooks @morsapaes
Starting from the beginning @morsapaes
From being dumbfounded… J. Doe ● 00:00 Me ● 00:01 31 @morsapaes
…to actually having a plan! J. Doe ● 00:00 Me ● 00:01 ✅ Invest in learning the Stream Processing 101 ✅ Take the time to understand how it differs from Batch Processing ✅ Start with something familiar and increase complexity gradually ✅ Ask questions! 32 @morsapaes
Thank you, Bristech! Follow me on Twitter: @morsapaes Learn more about Flink: https://flink.apache.org/ @morsapaes