Kafka Streams (transactional)
• Ingests directly from a topic • KTable • Forms an in-memory key/value store suitable for querying by topic key • Scalable across members of a consumer group • Readable through Interactive Queries
@gamussa | @confluentinc | @apacheiceberg
Kafka Streams (analytical)
• Full-featured Java stream processing API • Arbitrary streaming computation • Can emit new streams (not this talk) • KTables queryable by key • Every read pattern requires its own topology • Interactive Queries again
@gamussa | @confluentinc | @apacheiceberg
But Viktor, Flink has SQL
Why not Flink?
@gamussa || @confluentinc gamov.dev/rel | | @apacheiceberg @ConfluentInc @gamussa
Slide 24
ksqlDB • «Streaming Database» • Provides persistent TABLE abstraction • Pull and Push queries • Like Kafka Streams, but in SQL
@gamussa | @confluentinc | @apacheiceberg
Slide 25
Materialize • Replacement data warehouse • Integrates with Kafka, Postgres, dbt • The Materialized View is the central abstraction • Views are persistent and queryable • Postgres wire-compatible • Positioned as an analytics solution
@gamussa | @confluentinc | @apacheiceberg
Real-Time OLAP • Designed for high concurrency, low latency queries • Ingests from streaming and batch sources • Intimate integration with Kafka • Conventional tables and SQL
@gamussa | @confluentinc | @apacheiceberg
Slide 30
Real-Time OLAP • Analytics shaped like real-time data • Analytics when users are decision makers
@gamussa | @confluentinc | @apacheiceberg
Slide 31
Cloud Data Warehouses
Slide 32
Cloud Data Warehouses
Slide 33
Cloud Data Warehouses • The cloud-based heir of legacy DWH • Ingest from batch and streaming sources • Biased towards structured data and batch access
Slide 34
Data Lake @gamussa | @confluentinc | @apacheiceberg
Slide 35
Data Lake
Anything else
We’ll figure this out
@gamussa | @confluentinc | @apacheiceberg
Slide 36
Data Lakes • Started as the HDFS cluster • Became S3 • That didn’t help… • ELT vs. ETL • Iceberg/Hudi/DeltaLake
@gamussa | @confluentinc | @apacheiceberg
Slide 37
Data Lakes • Storage and compute are radically decoupled • Structure is relatively less important • Reads are slow • Streaming is historically difficult
@gamussa | @confluentinc | @apacheiceberg