> Almost all updates to Google’s services proceed gradually, according to a defined process, with appropriate verification steps interspersed. [..] The first stages of a rollout are usually called "canaries” — an allusion to canaries carried by miners into a coal mine to detect dangerous gases. Our canary servers detect dangerous effects from the behavior of the new software under real user traffic.
> Canary testing is a concept embedded into many of Google’s internal tools used to make automated changes, as well as for systems that change configuration files. [..] If the change doesn’t pass the validation period, it’s automatically rolled back.
Any Googlers around to explain why Canaries didn't catch configuration causing this side effect and stopping the deployment before the issue cascaded globally?
The outage here reminds me of issues CloudFront faced wrt configurations in their earlier years. Here's master presenter Harvo Jones going into a bit more detail: https://youtube.com/watch?v=n8qQGLJeUYA
Kind of vindicates AWS' decision (after major outages to DynamoDB and S3) to not just isolate regions or zones but resources within those regions/zones further into what they call "cells" https://youtube.com/watch?v=swQbA4zub20
Tumblr spoke abt "cells" too in the context of achieving isolation at scale, but in a quite different way: http://highscalability.com/blog/2012/2/13/tumblr-architectur...
As these known-knowns (and the Google sre-book is teeming with these) get designed around, it'd be interesting to see why/how future outages would happen.
Disclaimer. I don't work on the thing that broke here but I am in SRE. Speaking in broad generalities.
I'm not (any longer) a Googler, but according to the postmortem this was a load-related incident. Canary changes only have a small amount of traffic directed at them (by definition) and the problem probably didn't become apparent until after the change was promoted from canary.
Googles 'Canary' system is sub-optimal and risky.
It picks a subset of the system (say 5%), and deploys the change there. It then observes this 5% of the system, and if it's okay, it rolls out to the rest of the system.
Compare that to a continual-canarying system:
* Start deploying the new code gradually (say 1 container/pod/task every second).
* While the above is in progress, continually collect stats from upgraded and non-upgraded systems.
* If the stats differ to any significant level, halt the rollout.
The continual-canarying system is much better because the sample size to see a 'significant' change depends on the size of the change and the amount of natural variation in the sample. The size of the change obviously can't be known beforehand. The variation varies widely depending on the metric of interest (cpu, memory use, request latency, etc.).
It is therefore impossible in googles canarying system to choose a suitable subset size to detect (with a sufficiently high statistical power) small but important changes in a metric while still stopping the rollout early enough for catastrophic changes in a metric.
Continual-canarying effectively solves this issue by gradually increasing the sample size as the rollout goes on.
In reality, neither continual-canarying or google-canarying work well for detecting anomalous metrics because the rollout process itself causes applications to restart, which in itself changes their performance characteristics (cold caches, empty queues, first-use delay, jit optimization, etc.).
The solution to that is to do two simultaneous continuous-canary rollouts - one of the existing version (effectively just a restart) and one of the new version, and use those two groups as input to the logic to decide if any metrics of interest have statistically significant changes.
Please go implement this stuff Google! It won't be hard to do, and will really increase the safety of your rollouts. (Even though I know it wouldn't have helped this time)
Consider the types of issues you run into, there are
- Things that are painful, but not destructive (minor performance regressions)
- Things that are highly destructive (data gets deleted, major performance regressions, etc.)
The first you can handle being deployed to many people, so if you detect at 7.8% of your users instead of 10% doesn't much matter, you can run it at 10% forever without issue.
The second you'll detect on a smaller population because the changes are catastrophic.
>In reality, neither continual-canarying or google-canarying work well for detecting anomalous metrics because the rollout process itself causes applications to restart, which in itself changes their performance characteristics (cold caches, empty queues, first-use delay, jit optimization, etc.).
The problem here is that most of the time, you don't care about startup behavior, but steady-state behavior. Imagine that a new version introduces a memory leak, so the old behavior was to linearly increase to 1GB of memory over 1 hour and then level off, and the new behavior is to increase unbounded until eventually the system OOMs and restarts or whatever.
A "google style" canary would roll out to 1% of tasks or something, wait a few hours, and notice the difference and roll back. You might even experience 1% of tasks restating, but it's likely that the system can sustain that.
With the continuous canary, you'll release to 1 task per minute or whatever, and only be able to notice any change after you've pushed out to 60 tasks, and the change will likely only be detectable with any confidence once you have a notable difference in 120 or so. At that point, you've released to more than 1% of your tasks (either that, or it takes you 8 days to release a new version).
You can fix that by slowing the rate of releases, but now it takes you 8 weeks to release a new version.
Plus, even worse, you're now in a much more fail-open environment. With a 1% release, you can sustain in the kind of bad state if all of your qualification tooling fails. If you're continuously canarying though, you have to be much more careful to make sure that your tooling won't continue to push if the tooling itself is broken or getting unusual results. It's a more risky set up.
A) Your argument is optimised for a certain type of failure which leaves you open to the others ( issues that are detected after high ratio deployment )
B) You can monitor just the jobs impacted by the canary deployment
C) the progressions you discuss seem strawman like. You can have a doubling progression starting at 1 or 5, a tripling progression
D) the point about detecting at 7.5 and living at 10, what if it’s detectable at 20 and you’ve gone all out by then
If you detect a problem with FooBar frontend at 1pm, the first question anyone's going to ask is, "What changed in FooBar frontend or any of its dependencies at or shortly before 1pm?"
If the answer is, "Bazz backend deployed at 12:50pm" then someone with no understanding of the innards of FooBar or Bazz can roll back Bazz and there's a pretty darn good chance they'll have fixed the problem quickly. This is a scalable approach to fixing things that works well in practice.
If the answer is, "Nothing just changed. Every layer of the system has been gradually changing all day" then your oncall person will need an in-depth understanding of every service involved in the system as well as creativity in guessing where to start looking. It can work, but it's going to be slower and more difficult and fewer people will have the system-specific knowledge and debugging skills do it well and quickly.
B) absolutely, but if a large subset of the jobs recently restarted you get noise and nonsense
C) indeed you can do exponential continuous rollouts, but then you are getting pretty close to a 3 or 4 stage where each stage is a 5x increase in size or whatever.
People need to gradually reduce the load after the breaking point to check the system recovers.
Load Tests are also pretty hard to do in distributed systems. If you test the application alone, you probably won't find most of the issues. You'll need to test the application complete with all it's dependencies (databases, load balancers, failover mechanisms, external servers, etc.). You'll also probably want to test it with representative user requests, and all databases filled with representative data. That in turn typically means your loadtest system will need to be as big and expensive as your production system. Have fun explaining to the boss why you need to double the infrastructure costs. If you do it in the cloud, you can just do a loadtest for 10 mins per day and save a bunch of $$$, but you still need tooling to be able to deploy a complete replica of your system, fill it with realistic data and send it realistic requests, all automatically.
Using real user data and logs of user interactions and real user requests is best for loadtesting, but comes with it's own risks. You need to make sure the loadtest systems doesn't send out emails to users, or accidentally communicate with the real production systems in any way. It also means you have to secure your loadtest infrastructure as well as your production infrastructure. GDPR data deletion requests need to apply there too, etc.
One of the tricky things about canaried releases and staggered rollouts is that to implement them, you need another layer of configuration to manage that. And then that's another potential source of outages - and scary ones too, because your configuration is so closely tied to prod itself, . It's kind of the whole "my test needs a test, and the test of my test needs a test, and ..." issue.
Plus, you can only spend so much time testing out disaster scenarios before other priorities take over - if your customers are asking for features A, B, C, well, at some point it becomes a judgment call whether you spend time on the new features or testing the resiliency of your service. Even when you do test the hell out it, it's hard to guarantee you've tested literally every edge case.
The other thing is that, as you point out, as you add more and more mitigations for failure scenarios that you have experience with, the fraction of the failures that you've never seen before starts to increase. It's somewhat obvious, because by definition a regression test safeguards against failures that you've experienced, but it's an important thing to keep in mind when thinking about exactly what the tests you do write will keep you safe from.
Google literally invented this concept, and it has been spread to other companies by Xooglers.
Here's a summary of James Hamilton's paper (then a Microsoft employee) on building large-scale internet applications published in 2007, 8 years (?) before Borg was made public knowledge: https://blog.acolyer.org/2016/09/12/on-designing-and-deployi...
Partition the system in such a way that
partitions are infinitely adjustable
and fine-grained, and not bounded by any
real world entity (person, collection,
https://twitter.com/bcantrill/status/1092849514229059585 I guess the notion that Google is an all-superior tech giant that faces unique problems and solves them in the best way possible is wide spread misplaced notion among Googlers?
My rule of thumb was: if I have a change that will shrink tablets by 10% through e.g. compression and I apply it wholesale with a manual compaction of the entire table, I should assume that in the worst case the table will temporarily take up 190% of current space, until the next GC run or two. If all or most of your data is in one table, this can be a problem. Organic compactions are friendlier because they occur over the course of (typically) days, leaving the GC with plenty of time to clean up.
Now, to increase isolation between services, everyone is off in a partition, and each individual partition is much smaller and therefore can't withstand resource spikes without falling over.
And I bet the "emergency loan" functionality for various resources still isn't automated and still has arcane requirements to meet and a delay before it kicks in. Yay - a 15 minute delay. Thats exactly what I need when my entire service is down! /s
I remember one incident where a team got an alert that they were at 90% of quota or so. They compacted data too aggressively and they actually hit 100% before the GC could do its job. That's why I mentioned that you should compact a fraction of the table, then let GC run. Google being Google, I'm sure such a tool has been written, in several variants, by N different teams.
My view is that at a high level aws cells are isolated, fuzzy-partitioned, migratable, heterogeneous instances of a complete service stack that are managed such that they always fail independently of each other; contact points among these cell instances, if required, are established via highly available intermediate services (like ReplicatedStateMachinesOrStorage); the clients are routed to assigned cells by a highly available routing layer (like NetworkTrafficShapers/DNS), which can also help facilitate migration of clients from one/more cell to another.
I encourage you to view the re:invent presentation on it  if resiliency in distributed systems is something that interests you.
I find them similar too in what they're trying to achieve [blast-radius minimization], but I think AWS' concept of cells [going by re:invent presentation] is more of an architectural pattern with strong inclination towards certain non-negotiable reqs ["thinnest possible routing layer", "workload migration among cells is a first-class citizen", "avoiding critical cross-cell tasks/dependencies other than migration", "creating/removing cells are zero-downtime events"] for different flavours of services [stateful vs stateless (dynamodb vs lambda?), zonal versus regional versus global (ebs vs kinesis vs route53), for example] than a concrete implementation like Borg.
I might be wrong on all accounts, though.
Any idea what this means? What kind of configuration change would reduce storage but overload the system? I imagine turning on compression could do that, but I'd expect it would be easy to reverse. (Edit: Maybe it was the cascading failure that made it more serious?)
Maybe the increased storage usage was due to wrong kind of load rebalancing? I can easily see that a change to reduce _object_ hit rate in any given location, might just increase the _metadata_ hit rate. If storage is balanced by spreading the load to workers that keep their data nearby (for reduced latency), then an unexpected increase in these workers could also cause an increase in required storage.
If you want to balance the load on these properly, you'll have to do more location data lookups. Oops. Now your metadata's metadata is the chokepoint. Massive read contention on blob store address lookups --> cascading failure.
Edit: given the mention of a job that got stopped, perhaps they had a tool that manually forced compaction of data in order to adopt new compression settings or purge deleted rows. Normally, compactions would happen on the tablet servers on a cycle of N days.
Another possible explanation is that they tweaked the garbage collector, which runs on the master.
Facebook down March 13
Apple(iCloud) down March 14
Just some guesses (not affiliated):
I wonder if there was something related to Tombstones and deletion of extra data? That can often be a failure of replicated sharded big data systems using sstables and tombstones for deletion.
Another guess might be that some sort of internal shard count was increased resulting in some lookup process needing to check n more shards per call.
The real question is why don't us systems-designers have better tests for cascading failure? Even Google has very few tests which deliberately overload a system (to cause it to fail), and then verify that it can recover on it's own promptly under typical loads.
Eg. consider a database which serves a web application. If the application times out querying the database, it retries up to 5 times. That is a system with an unrecoverable cascading failure, since if one day we get a small spike in users, the database becomes slow, and a few timeouts occur, some requests will get retried, putting more load on the database, making it slower, causing more timeouts, and the problem gets rapidly worse until a few minutes later all requests are failing. Even when the spike in users is over, the database will remain effectively dead, because it is still failing under load from all those retries.
A load test that didn't push the system over the edge and then test for recovery wouldn't be able to detect that.
Too many people loadtest things to the point of failure and then stop. All it takes is to continue the test for a few minutes longer with a slightly lower load to verify recovery!
Also it's difficult to get a good coverage by just pushing things to its limit. Let's say you have a system which has this RPC chain: A -> B -> C -> D. And not all requests cost the same and not all requests make way to D. In reality it's interesting that different composition of traffic may overload any of them. For example, some queries may hit expensive ML models while some other queries have a large number of candidates. In this case the former hits predictor service hard while the latter may hit index serving systems. I've actually read an outage happened during superbowl because sports-related queries are particularly expensive for some part of the system.
If it was possible to know these a priori, why allow them in the first place?
Or are they saying they will be writing software to roll back configuration changes that cause disruption? ("disallowd")
It's getting increasingly obvious that despite being highly distributed, cloud systems run by Google, Amazon and other big cloud companies are inherently centralized. If nothing else, they have centralized controls.
A decentralized system can, of course, fail, but it would never have "configuration change" as the root cause.
Beyond those bucket mutations, everything else is purely regional scoped.
I'm _really_ surprised to see that GCP has made the same mistakes though and had things tied together. I'm not convinced there's a good excuse. AWS/S3 is understandable. No one knew what things would look like. Not one engineer would have made the same decisions then if they'd known what the future was for the platform. Google came to the cloud market much later. Their engineers should have seen the issues with AWS and the problems that interdependence cause, and specifically architected isolated infrastructure (which from this post-mortem it seems like they didn't). That's both remarkable and disappointing to see.
Could you elaborate on those choices? Also what do you mean by they're "stuck having to tie regions together"?
Decentralization often makes these scenarios harder to avoid, not easier. There are simply more ways that things can interact (and therefore break).
To avoid these failures in any setting, you need to have provable containment of failures while assuming part of your network is malicious. That's asking for a lot more than decentralization.
Let's take your car example.
If you update your car's firmware and get in an accident, it's not a system-level failure.
If manufacturer sends a wireless update and all their autonomous cars start to bump into one another, it's not really a decentralized system anymore - updates are centralized.
If you update your car, it crashes, sends a wireless signal that affects another car, and it crashes and so on... This is an example of a cascading failure in a fully decentralized system, but the real root cause would be cars interfering with one another, not your firmware update.
The point of me making up a stupidly improbable story is that stupidly improbable events keep happening despite all the clever people in the world. And of course the people on hacker news who know even better.
(I believe it was the most upvoted response too, although now I of course can't find the thread)
I don't want to sound negative (that's not the purpose of this message), but my belief is that this incident is a heart palpitation on the way to a major coronary, stroke, or heart attack... Let me explain to you why.
You see, as your company has grown, so has the complexity of your codebase and so has the number of people who maintain that code.
The complexity of your code has increased, perhaps asymptotically (since you seem to like Big-O notation and related subject matter, from what I hear, at your interviews), yet I'd be willing to bet that the overall knowledge of the codebase that the average worker there has, has actually diminished (although perhaps not asymptotically, perhaps linearly) over time.
In other words, you've got too many chefs, and a soup with too many ingredients where every chef knows some of them, but not all of them.
Maybe Jeff Dean is the exception to that, but that's what you've got going on, basically.
When something like this happens, your responsibility is to a) Educate your rank-and-file, even at the expense of new product releases, b) Re-delegate, insuring that there aren't too many chefs to soup, c) Reduce the ingredient count (refactor), and perhaps most important d) Make sure there's someone, anyone, who understands ALL of the codebase from top to bottom (difficult when you've got multiple source-code contributors making changes, and multiple layers of people only responsible for some aspects of the system...).
(Oh, and if it were me, I'd go as far as to buffer the entire incoming datastream so as to be able to replicate the problem, and if you can't do that (run a buffered datastream from a point in time against a point-in-time snapshot of all data at that point in time, and see exactly where in code the problem occurs), then your first goal is to be able to do exactly that. But I'm a little extreme in what I feel qualifies as acceptable testing...)
Either you seriously over estimate the limits of ones understanding, or you severely under estimate the size of Google’s codebase/monorepo.
At this point, I’m pretty sure it’s impossible to even check it out on a single developers workstation due to its size.
...Also, compare that to the Bible's "Tower Of Babel" story:
The Bible's story is about human language, specifically about how human languages encompass greater degrees of abstraction over time (languages built on top of languages, abstractions built on top of abstractions).
But, when you get to a certain level of abstractions, the abstractions start to "leak". That is why there are communication difficulties in societies (people who are more educated typically speak a different language than people who are less educated; academia and lawyers speak their own language).
This is also when/where/why/how software starts to fail.
Too many levels of abstraction, and those abstractions will "leak".
Don't take my word for it, use your own logic, your own common sense and think about it...
Software problem A exists.
Management Team: "We have a problem, can you fix it?"
Programming Team: "Sure." (fixes it, but this causes problem B, which is not recognized until several weeks later)
Management Team: "Now we have this different problem, can you fix it?"
Programming Team: "Sure." (fixes it, but this cause problem A to resurface, which is not detected for several weeks...)
And the cycle repeats...
Was it the programming team's fault?
No, it was the nature of the beast called "complexity". The higher you go in abstractions (and this is especially true for AI), the more you must exercise ENGINEERING DISCIPLINE. You must be able to go BACKWARDS IN TIME (this is accomplished in programming by being able to go back to simpler programs and fully understand / test them before advancing to more complex ones, and PRESERVING THAT CHAIN OF UNDERSTANDING...).
This is probably why advanced societies destroy themselves, too much abstraction, too much "magic" (without understanding of all of the sublevels), and no one has the exact knowledge necessary to pinpoint exactly when/where/why the exact problem occurred...
But again, don't take my word for it. Use your own common sense to determine if what I'm saying has any merit...
It’s like a strange simulacrum of some bazel-like build tools but not as mature as bazel but also can’t be swapped with real bazel due to tech debt issues, and same with canary deployments, monorepo layouts, etc.
It’s become something that I actively seek out information about when I interview for new jobs so I can actively avoid all these Google-but-not-quite-with-enough-resources-to-pull-it-off antipatterns.
Worse yet is that in many of these shops, those early employees who came from Google just stayed long enough to mutate the whole monorepo / homebrewed bazel clone / canary deploys mess into a pile of unmaintainable crap that couldn’t be migrated away from without huge business risks, and thenthey mostly all quit and a lot of them even went back to Google!
I call it the Google Borg Syndrome. Same crap monorepo ideas, same crap canary ideas, same crap bazel-like build system ideas just spread like kudzu. It really shows how bad the ideas themselves are, and how if it wasn’t for a giant money and labor-hours faucet that Google can shoot at these systems, they would be unmasked as deeply poor ways of solving engineering problems.
The best technical solution to problems can depend a lot on company size. These people are using a big-company solution in a small-company. And it nearly works, but not quite.
Big companies reinvent pretty much everything internally - You won't find Chef, Jenkins, Slack, Apache, Nginx, or any of that stuff at Google. Yet a small startup would be stupid not to use tooling developed and maintained by someone else almost everywhere.
In turn, the best solution to a problem depends on which tools you have at your disposal.
- using bazel instead of make
- using a monorepo ever
- using Google-style canary deployment ever
I think they are just nearly totally objectively bad ways to solve problems whether at Google scale or otherwise, with obvious alternatives that are strictly dominating in the sense of being unilaterally better in all use cases.
These techniques have to be kept afloat with the money faucet. I guess I can agree that if you have enough money that you can offer pay packages that cause people to be willing to endure the hardship of propping bad systems up, then you don’t have to care about what systems are effective.
For example, you could shoot lots of holes in a boat and then hire a full time staff on $500k / year to scoop water out fast enough that the boat seems to operate just fine.
Then get the ACM to write a big publicity piece about how hole-filled boats are the best design and anyone not shooting holes in their boat is Doing It Wrong.
This is not even exaggeration to offer as an analogy to Google’s combo use of monorepos + bazel.
I like monorepos - and in my experience you don't need special tooling for them until you have thousands of engineers. I just don't really see the point in splitting all my business logic into 50 different repos all of which have 25 branches and then I'm having versioning, dependency, and incompatibility hell between them. One repo, one 'master' branch, lots of tests, and as soon as all tests pass, auto-deploy!
The 'canary deployment server', complete with webUI I think is a mistake, even for Google. If they were using kubernetes, it would have a 'Deployment' object, and canarying should be a feature of that, just like rolling updates are. Actually how that works is an implementation detail of kubernetes and is abstracted away.
"bazel" is more about type checking the build artifacts, caching and guarantees about hermetic and reproducible builds. The distributed compilation is just supported but not a requirement IIUC. idk if it's the best tool but definitely an upgrade over "make".
Whereas "make" looks like an old pseudo scripting utility which has too many degrees of freedom with not much guarantee and is fragile as a result. I think those who learned its quirks love it but I don't.
I've also liked mono repos at work as long as their operations are not sluggish. They immensely help global refactoring, static analysis and reduce the amount of merge conflicts.
- You don't need the added complexity of having to manage tooling to track dependencies across multiple repos. A single repo is just fine. You'll obviously want to split off repositories when you want to contribute stuff as open source, and in other cases http://danluu.com/monorepo/
- what you describe as "Google-style canary" is not actually Google-style or recommended by Google
See the SRE workbook: https://landing.google.com/sre/workbook/chapters/canarying-r...
(obviously, Google is big, so different teams might do things differently)
As a result, sometimes you can even require _more_ tooling to adequately handle dependency in a monorepo vs multiple repos.
When you consider that a default natural constraint of all software development is that it must adapt and cope with changing tools, usage demands, etc., then it quickly becomes important to allow any given project to use arbitrarily unique build tooling, deployment tooling, libraries, languages, databases, etc. Trying to shoehorn them all into a mandated set of monorepo tooling just doesn’t work, once again meaning you often need much more tooling to reach minimum required levels of flexibility with monorepos as compared with polyrepos.
This is what bazel does.
In practice your second complaint about tooling isn't particularly true. Android apps, webservers, and language runtimes live in the same repo without issue.
I see this cargo-culting all the time, it spreads via breathless blog posts as well. People applying patterns from FAANG totally out of context because it sounds cool.
Google uses Kubernetes, therefore I should hand-roll my own K8s infrastructure to host my 1000 user CRUD app. Oooh and let's make it a SPA with 100 microservices because Facebook does that so it must be a good idea.