At Better, we try our best to give our users the greatest possible experience. In order to achieve this, we employ continuous deployment of our website resulting in multiple deploys per day. The overwhelming majority of the time this has no effect on our stability. On the rare occasions where there is an issue - rather than presenting users with scary, ugly, and opaque error pages devoid of any helpful messaging - we’d like to present them with something that reassures them of the work we’re doing to resolve whatever the issue may be. There are also occasions when we wish to completely redeploy our cluster resulting in changes to the public IPs clients should be making requests to.
As our infrastructure runs on AWS we had previously solved these problems using DNS. These solutions were:
- A Route53 health check that updated the DNS record for better.com on outages to point to a static error page.
- Manual DNS changes on cluster redeploys.
As DNS caching occurs at name servers and clients this can result in clients making requests to IPs that are no longer accurate. This can persist for hours in the case of clients with aggressive caches and even when everything works as expected can take many minutes.
Our current solution is to use CloudFront as a reverse proxy with custom error responses for HTTP status codes of 502, 503, or 504 served from S3. This has the benefit of providing user-friendly error pages without the annoyances of DNS record updates and allowing us to completely update the cluster in one fell swoop. As we run all on Kuberenetes having a reverse proxy which exists outside of that cluster makes upgrading the cluster much easier. CloudFront performs this task and has been incredibly easy to manage.
How to enable customized error pages for CloudFront custom origins
Create a CloudFront distribution that forwards requests to a custom origin. Specify an alternate domain, SSL cert, and any caching behavior you would like here.
Add a secondary origin which will forward requests to an S3 bucket, in which you will place your custom error page/s.
Decide upon the path you would like to serve your custom error page/s from. This should not conflict with any path you would like to be served from the custom origin. For this demonstration, we will use/error-pages/*.
Add a behavior that forwards all requests to your S3 bucket for the path you specified in the previous step.
Add any content you would like to display to users on errors to the S3 bucket you are forwarding requests to at the subpath which matches the forwarding path. In our case this would bes3://<bucket_name>/error-pages/.
The final step is to create custom error responses. This is as simple as deciding upon the HTTP status codes you would like to intercept errors for, creating an error response, and customizing it to serve content you have added to your S3 bucket.
- By default CloudFront requires clients that support SNI. In order to add custom IPs to support legacy clients, AWS charges $600/month.
- The maximum response time CloudFront supports before responding to the client with a 504 is 60 seconds. In order to support this, we had to tweak a request made in staging environments.
- Upon testing, we found that changes to the origin could take up to 5 minutes to propagate to edge nodes.
- CloudFront does not currently support WebSockets.
Deploying a few EC2 instances running HAProxy or Nginx could have solved all of the mentioned issues but would have left us managing an entirely new cluster. We decided the operational overhead of this was not worthwhile.
During a recent cluster upgrade, there were no reports of broken IP addresses. In the past, we would have had to update user-facing DNS records which resulted in clients seeing error pages when a DNS cache, either on their machine or somewhere between our nameservers and their machine, became stale. Flushing Google’s DNS cache helped resolve some of these issues, but obviously not all.
Read Scaling Websocket TrafficUsing database replication to scale the systemThu Nov 05 2020—by Timothy Harrington4 min read
Wizard — our ML tool for interpretable, causal conversion predictionsBuilding an end-to-end ML system to automatically flag drivers of conversionFri Dec 27 2019—by Kenny Ning6 min read
Modeling conversion rates and saving millions of dollars using Kaplan-Meier and gamma distributionsAt Better, we have spent a lot of effort modeling conversion rates using Kaplan-Meier and gamma distributions. We recently released convoys, a Python package to fit these models.Mon Jul 29 2019—by Erik Bernhardsson8 min read