The Streaming Mindset … what, why, how? Marta Paes (@morsapaes) Developer Advocate © 2020 Ververica

About Ververica Original Creators of Apache Flink® 2 @morsapaes Enterprise Stream Processing Part of With Ververica Platform Alibaba Group

Working in DevRel J. Doe ● 00:00 3 @morsapaes

Working in DevRel Me ● 00:01 �� 4 @morsapaes

Working in DevRel Me ● 00:01 5 @morsapaes

Working in DevRel Me ● 00:01 6 @morsapaes

Working in DevRel Me ● 00:01 7 @morsapaes

Where do you start? @morsapaes

1 Go Headfirst ● @morsapaes Stream Processing 101

Analytics…Not that Long Ago OLTP Database(s) ETL … Data Warehouse (DWH) FTP Servers 10 @morsapaes

Analytics…Not that Long Ago The quest for data… Long, nightly jobs OLTP Databases x Someone waking up Re-run long, nightly job ETL Someone complaining … Data Warehouse (DWH) FTP Servers 11 @morsapaes Results But in the end… • Most source data is continuously produced • Not everyone can wait for yesterday’s data • Most logic is not changing that frequently

Everything is a Stream @morsapaes

Everything is a Stream Your static data records become events that are continuously produced and should be continuously processed. Stream Processing Stream Processing Stream Processing … Event Sources Applications, Sensors, Databases, Devices, … Log / Stream Storage Kafka, Kinesis, Pulsar, … Sinks Long-term Storage K/V Store, Database, Log, Application, … S3, HDFS, … … 13 @morsapaes

Stream Processing 101 14 Batch Processing Continuous Streaming query/logic changes fast data changes fast data changes slowly query/logic changes slowly E.g: Ad-hoc queries, data exploration, ML model training E.g: Most business logic nowadays @morsapaes A good starter: Streaming 101: the World Beyond Batch

Stream Processing 101 Batch Processing Continuous Streaming query/logic changes fast data changes fast data changes slowly query/logic changes slowly E.g: Ad-hoc queries, data exploration, ML model training E.g: Most business logic nowadays more batch-like Offline ML Model Training Data Warehousing OLAP / BI / Reporting 15 @morsapaes more real-time Real-time Behavior Modeling Unified Offline/ Online Analytics (e.g. recommenders, pricing) Online ML Model Training/Evaluation Continuous Monitoring Continuous ETL (e.g. position, risk) Real-time Alerting (e.g. fraud, security) Distributed OLTP-style Apps

Stream Processing Use Cases Examples Large-scale Data Pipelines 16 @morsapaes ML-Based Fraud Detection Service Monitoring & Anomaly Detection

Stream Processing Use Cases Examples 17 Large-scale Data Pipelines ML-Based Fraud Detection Service Monitoring & Anomaly Detection Unified Online/Offline Model Training E2E Streaming Analytics Pipelines ML Feature Generation @morsapaes

2 Bridge Concepts @morsapaes ● Bounded vs. Unbounded data ● Event time vs. Processing time ● Fault tolerance

Bounded vs. Unbounded Data Batch Processing 19 Continuous Streaming • Data “at rest” • Data “on the fly” • Hard boundaries (e.g. process 1 day of data) • Ever-growing, infinite data set @morsapaes

Bounded vs. Unbounded Data Batch Processing Continuous Streaming Window • Data “at rest” • Data “on the fly” • Hard boundaries (e.g. process 1 day of data) • Ever-growing, infinite data set Windows split the stream into buckets of finite size, over which you can apply computations 20 @morsapaes

Event Time vs. Processing Time Event time ● Deterministic results ● Handle out-of-order or late events ● Trade-off result completeness/correctness and latency Processing time 21 @morsapaes ● Non-deterministic results ● Best performance and lowest latency ● Speed > completeness/correctness

Fault Tolerance 22 Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines @morsapaes

Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State State 23 @morsapaes

Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State State Persistent Storage checkpointed state 24 @morsapaes checkpointed state checkpointed state Checkpoint

Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State ❌ State 25 @morsapaes

Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines Reset position in input stream State State Persistent Storage checkpointe d state 26 @morsapaes checkpointe d state checkpointe d state Restore Recover all embedded state

3 Pick a Flavour & Build @morsapaes

The Flink API Stack Layered, with different tradeoffs for expressiveness and ease of use. You can mix and match all the APIs! Ease of Use Flink SQL Streaming Analytics & ML Table API (dynamic tables) PyFlink 28 DataStream API (streams, windows) Stateful Stream Processing Expressiveness 28 @morsapaes Building Blocks (events, state, (event) time)

How to Get Hands-On? Start with whatever language and/or abstractions are more familiar to you! Java/Scala 29 SQL Python ● Self-paced Training Course ● Flink SQL Cookbook ● PyFlink Walkthrough ● DataStream API Walkthrough ● Table API Walkthrough ● Zeppelin Notebooks @morsapaes

Starting from the beginning @morsapaes

From being dumbfounded… J. Doe ● 00:00 Me ● 00:01 31 @morsapaes

…to actually having a plan! J. Doe ● 00:00 Me ● 00:01 ✅ Invest in learning the Stream Processing 101 ✅ Take the time to understand how it differs from Batch Processing ✅ Start with something familiar and increase complexity gradually ✅ Ask questions! 32 @morsapaes

  • Where to ask questions: How do I get help from the Apache Flink community?

Thank you, Bristech! Follow me on Twitter: @morsapaes Learn more about Flink: https://flink.apache.org/ @morsapaes