Data lake: how Red Hat maintains data quality across multiple Drupal sites Florida Drupalcamp 2023
1
Melissa Bent
April Sides
Senior Software Engineer
Senior Software Engineer
Data lake: how Red Hat maintains data quality across multiple Drupal sites
2
Problem Discovery Solution Integration ● ● ● ●
Discovery
Requirements ▸ ▸ ▸ ▸
Share data in a scalable and maintainable way Connect with different tech stacks Serve as a single source of truth Provide flexible data model/schema
10
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Discovery
Data Repository Types ▸ Relational database ▸ Data warehouse ・ Data mart ・ Operational data store ▸ Data lake
11
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Discovery
Relational database - Structured transactional data - Main features: - Data normalization - Compliancy Example: Drupal database 12
Discovery
Data Lake Challenges ▸ ▸ ▸ ▸ ▸
Data rot Data governance Data compliance Data security Data availability
15
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Discovery
Data Lake Advantages ▸ Multiple data sources—Consistent access ▸ Protection against downtime ▸ Provides an additional, query-level, caching layer for production
16
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Solution
Tech Stack ▸ Database Backend: MongoDB ▸ Indexing: Search API + Custom module ▸ Retrieval ・ GraphQL ・ Direct query
18
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Solution
Indexing ▸ ▸ ▸ ▸
Custom Search API Backend Search API’s index management Flexible schema for each data source Drupal provides access control, the data model, and the editorial experience
19
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Solution
Retrieval ▸ Single-page applications: GraphQL ▸ Drupal: PHP MongoDB Driver
20
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration: Products
Product Experience access.redhat.com/products
23
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration: Products
Customer Portal Products ▸ Customer Portal helps our customers get the most out of their subscriptions ▸ Product information is core to our data organization ▸ Multiple teams and sites combine to make what is the Customer Portal 24
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration: Products
Strategy Product data managed in Drupal Indexed to the Data Lake Queried via GraphQL Page built via GitLab pipeline (statically generated) ▸ Refreshed every 30 minutes ▸ ▸ ▸ ▸
28
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration: Learning Paths
Learning Path A curated collection of content, directing users to learn more about a particular topic or product.
30
Integration: Learning Paths
developers.redhat.com
31
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration: Learning Paths
developers.redhat.com
Article 32
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Cheat Sheet
Integration: Learning Paths
developers.redhat.com
Article in Learning Path 33
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Cheat Sheet in Learning Path
Integration: Learning Paths
Learning Path content type
34
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration: Learning Paths
35
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Resource content type
Integration: Learning Paths
developers.redhat.com
Article in Learning Path 36
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Cheat Sheet in Learning Path
Integration: Learning Paths
developers.redhat.com
Article in Learning Path 37
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Cheat Sheet in Learning Path
Integration: Learning Paths
Resource displays
hook_preprocess_node()
38
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration: Learning Paths
Shared module
Learning Paths shared module
39
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Data Lake Learning Paths schema - Data Lake query service - Reusable code: - Services - Controller - EventSubscriber - Blocks - Constraints/Validators
Integration: Learning Paths
40
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration
Content Syndicated Patterns ▸ “Shared patterns” with embedded content for banners, footers, marketing content, etc.
42
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Integration
Customer Portal ▸ Autobuilding product pages from the Data Lake (currently are Drupal nodes) ▸ Standardizing Product taxonomy across Customer Portal microsites ▸ Integrating with external systems (product life cycle, case management, developers.redhat.com, etc.) 43
Data lake: how Red Hat maintains data quality across multiple Drupal sites
Thank you Questions?
Melissa Bent
April Sides
Red Hat
linkedin.com/in/melissabent
linkedin.com/in/aprilsides
linkedin.com/company/red-hat
twitter.com/merauluka
twitter.com/weekbeforenext
youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat
44
Data lake: how Red Hat maintains data quality across multiple Drupal sites Florida Drupalcamp 2023
1
Melissa Bent
April Sides
Senior Software Engineer
Senior Software Engineer