One Does Not Simply Query a Stream

Slide 1

One Does Not Simply Query a Stream! Viktor Gamov, Confluent @gamussa Iceberg Summit April 8, 2025 @gamussa | @confluentinc | @apacheiceberg

Slide 2

@gamussa | @confluentinc | @apacheiceberg

Slide 3

Viktor GAMOV Principal Developer Advocate | Confluent THE CLOUD CONNECTIVITY COMPANY X and Bluesky: @gamussa Kong Confidential

Slide 4

Simpler times Monolith @gamussa || @confluentinc gamov.dev/rel | | @apacheiceberg @ConfluentInc @gamussa

Slide 5

Simpler analytics ETL and CDC @gamussa || @confluentinc gamov.dev/rel | | @apacheiceberg @ConfluentInc @gamussa

Slide 6

Data Pipelines Streaming data pipelines and Microservices @gamussa | gamov.dev/rel | @ConfluentInc

Slide 7

LOG @gamussa || @confluentinc gamov.dev/rel | | @apacheiceberg @ConfluentInc @gamussa

Slide 8

OLTP stream vs OLAP vs. OLTP in Streams OLAP streams @gamussa || @confluentinc gamov.dev/rel | | @apacheiceberg @ConfluentInc @gamussa

Slide 9

• Connect/Relational DB Our Options • Kafka Streams • Streaming SQL • Real-Time OLAP • Data Warehouse/ Data Lake • Tableflow @gamussa | @confluentinc | @apacheiceberg

Slide 10

Kafka Connect @gamussa | @confluentinc | @apacheiceberg

Slide 11

` Connect/RDBMS • Suitable for smaller data • Transactional • Familiar to users @gamussa | @confluentinc | @apacheiceberg

Slide 12

Connect/RDBMS Broker Broker Broker Cluster Data Source Kafka Connect Kafka Connect @gamussa | @confluentinc | @apacheiceberg Data Sink

Slide 13

@gamussa | @confluentinc | @apacheiceberg

Slide 14

Kafka Streams @gamussa | @confluentinc | @apacheiceberg

Slide 15

Kafka Streams (transactional) • Ingests directly from a topic • KTable • Forms an in-memory key/value store suitable for querying by topic key • Scalable across members of a consumer group • Readable through Interactive Queries @gamussa | @confluentinc | @apacheiceberg

Slide 16

Kafka Streams (transactional) KStream<String, String> stream = builder.stream(inputTopic, Consumed.with(stringSerde, stringSerde)); KTable<String, String> convertedTable = stream.toTable(Materialized.as(“streamconverted-to-table”)); @gamussa | @confluentinc | @apacheiceberg

Slide 17

Kafka Streams (analytical) • Full-featured Java stream processing API • Arbitrary streaming computation • Can emit new streams (not this talk) • KTables queryable by key • Every read pattern requires its own topology • Interactive Queries again @gamussa | @confluentinc | @apacheiceberg

Slide 18

Kafka Streams (analytical) KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split(“\W+”))) .groupBy((key, word) -> word) .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as(“counts-store”)); wordCounts.toStream().to(“WordsWithCountsTopic”, Produced.with(Serdes.String(), Serdes.Long())); @gamussa | @confluentinc | @apacheiceberg

Slide 19

@gamussa | @confluentinc | @apacheiceberg

Slide 20

Streaming SQLs @gamussa | @confluentinc | @apacheiceberg

Slide 21

Streaming Database • SQL for Queries • Streaming Source is 1st class citizen • Persistence / Storage @gamussa | @confluentinc | @apacheiceberg

Slide 22

Streaming SQL • ksqlDB • Materialize • RisingWave • TimePlus @gamussa | @confluentinc | @apacheiceberg

Slide 23

But Viktor, Flink has SQL Why not Flink? @gamussa || @confluentinc gamov.dev/rel | | @apacheiceberg @ConfluentInc @gamussa

Slide 24

ksqlDB • «Streaming Database» • Provides persistent TABLE abstraction • Pull and Push queries • Like Kafka Streams, but in SQL @gamussa | @confluentinc | @apacheiceberg

Slide 25

Materialize • Replacement data warehouse • Integrates with Kafka, Postgres, dbt • The Materialized View is the central abstraction • Views are persistent and queryable • Postgres wire-compatible • Positioned as an analytics solution @gamussa | @confluentinc | @apacheiceberg

Slide 26

Rising Wave • Distributed SQL Streaming database • Cloud and OSS versions • Implementation of Flink in Rust • Kafka, Pulsar, Kinesis integrations • Flink+persistent views • Postgres wire-compatible @gamussa | @confluentinc | @apacheiceberg

Slide 27

@gamussa | @confluentinc | @apacheiceberg

Slide 28

Real-Time Analytics Database

Slide 29

Real-Time OLAP • Designed for high concurrency, low latency queries • Ingests from streaming and batch sources • Intimate integration with Kafka • Conventional tables and SQL @gamussa | @confluentinc | @apacheiceberg

Slide 30

Real-Time OLAP • Analytics shaped like real-time data • Analytics when users are decision makers @gamussa | @confluentinc | @apacheiceberg

Slide 31

Cloud Data Warehouses

Slide 32

Cloud Data Warehouses

Slide 33

Cloud Data Warehouses • The cloud-based heir of legacy DWH • Ingest from batch and streaming sources • Biased towards structured data and batch access

Slide 34

Data Lake @gamussa | @confluentinc | @apacheiceberg

Slide 35

Data Lake Anything else We’ll figure this out @gamussa | @confluentinc | @apacheiceberg

Slide 36

Data Lakes • Started as the HDFS cluster • Became S3 • That didn’t help… • ELT vs. ETL • Iceberg/Hudi/DeltaLake @gamussa | @confluentinc | @apacheiceberg

Slide 37

Data Lakes • Storage and compute are radically decoupled • Structure is relatively less important • Reads are slow • Streaming is historically difficult @gamussa | @confluentinc | @apacheiceberg