Migrate 3 million websites without anybody noticing Vincent Cassé & Horacio Gonzalez 2020-10-13

Who are we? Introducing ourselves and introducing OVH OVHcloud

Vincent Cassé @vcasse Host for millions of websites. Breakfast and HTTPS included. Engineering Manager at

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter

OVHcloud: A Global Leader 200k Private cloud VMs running 1 Dedicated IaaS Europe 30 Datacenters Own 20Tbps Network with 35 PoPs

1.3M Customers in 138 Countries Hosting capacity: 1.3M Physical Servers 360k Servers already deployed

OVHcloud: 4 Universes of Products WebCloud Domain / Email Domain names, DNS, SSL, Redirect Baremetal Cloud General Purpose VM SuperPlan Baremetal Game Collaborative Tools, NextCloud Virtualization T2 >20e T3 >80e Storage Mutu, CloudWeb Plesk, CPanel PaaS with Platform.sh Virtual servers VPS, Dedicated Server Compute Standalone, Cluster Email, Open-Xchange, Exchange PaaS for Web Database T4 >300e Bigdata T5 >600e HCI AI 12KVA /32KVA VDI Cloud Game VPS aaS CRM, Billing, Payment, Stats MarketPlace K8S, IA IaaS PaaS for DevOps Storage File, Block, Object, Archive Databases SQL, noSQL, Messaging, Dashboard Network IP FO, NAT, LB, VPN, Router, DNS, DHCP, TCP/SSL Offload Virtuozzo Cloud Security Wordpress, Magento, Prestashop Wholesales IAM, MFA, Encrypt, KMS IT Integrators, Cloud Storage, CDN, Database, ISV, WebHosting Support, Managed High Intensive CPU/GPU, Support Basic Encrypt Support thought Partners KMS, HSM Managed services Encrypt (SGX, Network, Storage) Hosted Private Cloud Hosted Private Cloud Network pCC DC SaaS Public Cloud IA, DL VMware SDDC, vSAN 1AZ / 2AZ vCD, Tanzu, Horizon, DBaaS, DRaaS Nutanix HCI 1AZ / 2AZ, Databases, DRaaS, VDI OpenStack IAM, Compute (VM, K8S) Stortage, Network, Databases Storage Ontap Select, Nutanix File OpenIO, MinIO, CEPH Zerto, Veeam, Atempo AI ElementAI, HuggingFace, Deepopmatic, Systran, EarthCube Bigdata / Analitics / ML Cloudera over S3, Dataiku, Saagie, Tableau, Hybrid Cloud Standard Tools for AI, AI Studio, vRack Connect, Edge-DC, Private DC IA IaaS, Hosting API AI Dell, HP, Cisco, OCP, MultiCloud Bigdata, ML, Analytics Datalake, ML, Dashboard Secured Cloud GOV, FinTech, Retail, HealtCare

“OVH: We Host You ” (On vous héberge)

Webhosting at OVHcloud Biggest webhoster in Europe 6 millions websites 60 Gb/s 6 billions HTTP requests (except CDN caches) ● 15 000 web servers ● ● ● ●

Webhosting at OVHcloud: small history ● Hosting in P19 (Paris) since 2003 ● Web have changed from 1999 ● New datacenter opening : Gravelines in 2016

What’s the hoster’s job?

apt-get install apache2 php7 mysql-server? ● Store data ● Run code source

Why did we want to leave Paris? ● Hardware end of life ● Too slow natural decreasing

Why it was difficult? Footer can be Da personalized as follow:

Risk management Probability by magnitude Da ● 0,1% for 1 website: 1 in 1000 chance ● 0,1% for 100 websites : 1 in 10 chance ● 0,1% for 3 millions : 3 000 times

Risk management Probability by magnitude ● 0,1% for 1 website: 1 in 1000 chance ● 0,1% for 100 websites : 1 in 10 chance ● 0,1% for 3 millions : 3 000 times Risk = Impact * Probability

Split brain definition Split-brain is a computer term, based on an analogy with the medical Split-brain syndrome. It indicates data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and synchronizing their data to each other. https://en.wikipedia.org/wiki/Split-brain_(computing)

Hosting architecture. Vue for one website

Load balancing and fault tolerance Date

Fault domain

Difference between P19 & Gravelines

Files constraint ● Customer dependencies: source code / images / javascript… ● Rsync limitations ● Bloc copy implies to migrate all customer of a filerz

Clusters constraint ● High cost infras are shared by cluster (load balancer, IP…) ● DNS zone relies on customer configurations ● IP migration implies to migrate all cluster customers Da

Database constraint ● Database linked to one hosting account but… ● Exhaustive knowledge = comprehensive mastery of source code ● Break zero website implies to migrate in same time all websites at

Database constraint² ● Database naming use subdomain of mysql.db ● But “recent” feature (5 years) ● Old usages incompatible in Gravelines datacenter

So, how migrate? Fo ot er ca n be pe rs on ali ze d as Da foll 26

Be punk! Break the rules If we take all constraints: ● Either migrate the sites one by one knowing their website ● Either migrate all at the same time (TCP over Trucks ?)

Database naming

Database naming : ProxySQL

Database naming not know ● Network tunnel between the two datacenters ● Impact : + 10ms latency for each request ● « Best effort »

ProxySQL and latency

SQL proxy and latency +10 ms XXX db 2 …. db 2500 db 1 db 2 …. db 2500 db 1 db 2 …. db 2500 db 1 …. db 250 mysql55-XXX.plan-service mysql55-XXX.plan mysql55-XXX.plan dbXXX.plan P19 Gravelines

Shared IP constraint 127.0.0.1 ::1 Alr ate To gr mi P19 ea dy mi gra ted GRA

Shared IP constraint 127.0.0.1 ::1 To migrate Web Web Web Web Filerz P19 Web Web Web Web Web Filerz Filerz GRA Web Filerz

File constraints Are we able to migrate filerz customer all at the same time?

Party time! Let’s migrate!

IP switch ● Information system adaptation ● Load balancer patch ● Network tunnel ● Tools & monitoring

ProxySQL ● Configuration automatisation ● Risk management deployment : 1 / 10 / 100 / 1000 ● SQL proxying at scale: some surprises ○ MySQL and password storage format… ○ ARP Table ○ Old database management

Migration plan ● Migration filerz by filerz. ● Database related to hosting migrated, migrate at the same time ● 1 IP switch d’IP at a time. So 1 cluster at a time ● Cluster migration order by risk level. Less risky to more risky

Chronological timeline Hardware order D-40 : D-30 : Setup filerz Setup databases hosts D-10 : Cluster tests D-60 : Setup D-30 : D-15 : new cluster Setup filerz Communication D-30 : Communication D-7 : IP Switch D-1 : Night N : Accelerate filerz incremental copies Migration last filerz Decommissioning Night 1 : D+1: Migration X filerz Close P19 cluster infra

IP Switch (D-7) 1. 2. 3. 4. 5. Destination cluster a network tunnel tests Send communication to support and customers SSL jobs redirections Setup all SSL on destination load balancer For each IPv4 / v6 addresses! • Route IP to new load balancer • Tester websites at Paris and Gravelines 6. Route CDN to new infra

Filerz migration: during the night 1. Cluster websites tests 2. Cut monitoring of the cluster 3. Launch incremental 4. Close website (maintenance mode) 5. Wait PHP timeout 6. Close file access from the filerz 7. Launch last incremental 8. When data are in Gravelines: launch database migration 9. Update configurations of migrated hosting (IS, infrastructure…) 10. Reopen hostings accounts 11. Wait end of database migration 12. Test website again and check all is ok 13. Enable cluster monitoring 14. Prevent customer about the end of operations 15. Go to bed!

Migration: and databases? For each databases: 1. 2. 3. 4. 5. 6. Put database in read-only mode Dump database Import database on destination cluster and put the new in read-write mode Redirect DNS name to the new server Setup SQL proxy to new server Close old database in Paris

Migration: and databases? ● Distribution of operations on all servers ● Orchestration ○ storing information inside… a database !

Migration: and databases? ● Distribution of operations on all servers ● Orchestration ○ storing information inside… a database ! Record 13 502 databases migrated in 1 hour 13 minutes

Organisation

Challenges

  • Technical. But it was this presentation up to this slide - Infrastructure work splitted in specialized teams (database, web servers, storage servers, datacenters, server factory, support, load balancers, cdn, network…) - Legacy - Loooong migration

Continue improvement organisation

Build migration plan Implement and test the plan Migrate then improve migration after each week

Change management

—verbose? • Why we decided to migrate three million websites? https://www.ovh.com/blog/web-hosting-why-we-decided-to-migrate-three-million-websites/ • How to host 3 million websites? https://www.ovh.com/blog/web-hosting-how-to-host-3-million-websites/ • How to migrate 3 Million web sites? https://www.ovh.com/blog/web-hosting-how-to-migrate-3-million-web-sites/ • How do our databases work? https://www.ovh.com/blog/web-hosting-how-do-our-databases-work/ • How to win at the massive database migration game https://www.ovh.com/blog/how-to-win-at-the-massive-database-migration-game/ • migrate-datacentre –quiet: How do we seamlessly migrate a datacentre? https://www.ovh.com/blog/migrate-datacentre-quiet-how-do-we-seamlessly-migrate-a-datacentre/ • A day in the life of a ProxySQL at OVHcloud https://www.ovh.com/blog/a-day-in-the-life-of-a-proxysql-at-ovhcloud/ • Another day in ProxySQL life: sharing is caring https://www.ovh.com/blog/another-day-in-proxysql-life-sharing-is-caring/ More soon on https://ovh.com/blog