How Netlify Migrated to a Multicloud Architecture And no one noticed

ryan@ rybit

Who am I? @ry_boflavin

Who am I?

Dog Dad

Who am I?

Dog Dad

Engineer

Who am I?

Dog Dad

Engineer

Fire Spinner

Engineer of things

Tech Passions

Distributed Systems

Streaming Data System

Infrastructure Automation

System Design

Worked

Raytheon

Palantir Middle East

Ye l p

Netlify

full CI/CD

prerendering

content delivery

lambda deployment

routing layer

split testing

identity provider

dns provider

... What is Netlify? Netlify is the simplest way to build, deploy, and manage web projects on the JAMstack. We're changing the way the web is built by collapsing the modern front-end development process into a single, simplified workflow.

full CI/CD

prerendering

content delivery

lambda deployment

routing layer

split testing

identity provider

dns provider

... What is Netlify? Over

  • 5 million sites
  • 4,000 requests/sec
  • 1,200 deploys/hour

What is Netlify? Over

  • 5 million sites
  • 4,000 requests/sec
  • 1,200 deploys/hour

full CI/CD

prerendering

content delivery

lambda deployment

routing layer

split testing

identity provider

dns provider

...

What am I going to talk about? 1. Intro to the system 2. Why we did all this work 3. How we accomplished it 4. The actual migration 5. Next steps

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Plan for failure

Redundancy is a priority

Everything is horizontally scalable

Everything runs in cluster

Health checking for everything

Getting Data into the system

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Getting Data out of the system

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Cool, but where

are the actual servers?

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

What happens when things go wrong?

CDN stays up

Keep serving cached content

Higher traffic sites are going to be happier What happens when things go wrong?

CDN stays up

Keep serving cached content

Higher traffic sites are going to be happier What happens when things go wrong?

CDN stays up

Keep serving cached content

Higher traffic sites are going to be happier What happens when things go wrong?

Providers fail with no notice

Degraded perf > outage

Same cloud is fastest Multicloud Setup

AWS

Elastic Compute Cloud

S3

GCP

Compute Engine

Cloud Storage

Providers fail with no notice

Degraded perf > outage

Same cloud is fastest Multicloud Setup

RAX

Cloud Servers

Cloud Files

AWS

Elastic Compute Cloud

S3

GCP

Compute Engine

Cloud Storage

Providers fail with no notice

Degraded perf > outage

Same cloud is fastest Multicloud Setup

RAX

Cloud Servers

Cloud Files

Why do all of this? Because clouds fail

But how do we build around that?

But how do we build around that?

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Assumption Checking https://github.com/rybit/cloud-bench

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Replicate it all origin origin origin origin primary CF 1 2 3 {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"),

}

Replicate it all origin origin origin origin primary CF 1 2 3 {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"),

}

Replicate it all origin origin origin origin primary CF 1 2 3 {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"),

“m”: 1 } RAX = 1 AWS = 2 GCP = 4 Upload mask Example: m = 6

→ AWS & GCP

m = 3

→ AWS & RAX

m = 1

→ RAX only

BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors BlobSync

BlobSync {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1 } BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors

BlobSync {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1 } CF BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors

BlobSync {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1 } CF S3 GCS BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors

BlobSync {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1 } {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 7 } CF S3 GCS BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors

Replicate it all origin origin origin origin primary CF 1 2 3 {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"),

“m”: 1 }

Replicate it all {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1, “r”: true } Replication Flag Spares index in mongo origin origin origin origin primary CF 1 2 3

State of the world origin origin origin origin primary CF CDN

State of the world origin origin origin origin primary CF GCS S3 BlobSync CDN

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

State of the world origin origin origin origin CF GCS S3 BlobSync CDN primary

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Cloud Agnostic Origin Services

Generic cloud storage interface

Automatic failover

Prefer staying in cloud

Forceable overrides

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Smart Resolution primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 } origin CDN

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN primary

State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN primary

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Pulling the trigger 1. Spin up enough origin services 2. Fail over the DB
3. Update the consul entry 4. Aggressively stare at monitors

State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN primary

primary State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN

primary State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN

Redundant everything

Cloud agnostic origin and CDN

Programmable infrastructure

Out of band replication

Smart routing

Automated failover Summary

So now what?

Setup trickle of traffic to live standby

Automate the traffic switch

Speedup network scale up

More monitoring

WE ARE HIRING

ryan@ rybit

Find me to talk! @ry_boflavin