Hacker News new | comments | show | ask | jobs | submitlogin
Post-mortem of March 12 Google outage (status.cloud.google.com)
204 points by antoncohen 2 months ago | hide | past | web | 107 comments | favorite

> https://landing.google.com/sre/sre-book/chapters/reliable-pr...

> Almost all updates to Google’s services proceed gradually, according to a defined process, with appropriate verification steps interspersed. [..] The first stages of a rollout are usually called "canaries” — an allusion to canaries carried by miners into a coal mine to detect dangerous gases. Our canary servers detect dangerous effects from the behavior of the new software under real user traffic.

> Canary testing is a concept embedded into many of Google’s internal tools used to make automated changes, as well as for systems that change configuration files. [..] If the change doesn’t pass the validation period, it’s automatically rolled back.

Any Googlers around to explain why Canaries didn't catch configuration causing this side effect and stopping the deployment before the issue cascaded globally?

The outage here reminds me of issues CloudFront faced wrt configurations in their earlier years. Here's master presenter Harvo Jones going into a bit more detail: https://youtube.com/watch?v=n8qQGLJeUYA

Kind of vindicates AWS' decision (after major outages to DynamoDB and S3) to not just isolate regions or zones but resources within those regions/zones further into what they call "cells" https://youtube.com/watch?v=swQbA4zub20

Tumblr spoke abt "cells" too in the context of achieving isolation at scale, but in a quite different way: http://highscalability.com/blog/2012/2/13/tumblr-architectur...

As these known-knowns (and the Google sre-book is teeming with these) get designed around, it'd be interesting to see why/how future outages would happen.

Obviously not authorized to release more details than have already been made public, but when the book hits the real world sometimes you find new failure modes, software has bugs, or humans find creative mistakes. It's also very hard to build global scale systems with zero possibly of global failure. But every time a crack is found, you learn something and do what you can to eliminate the whole class of related failure modes.

Disclaimer. I don't work on the thing that broke here but I am in SRE. Speaking in broad generalities.

> Any Googlers around to explain why Canaries didn't catch configuration causing this side effect and stopping the deployment before the issue cascaded globally?

I'm not (any longer) a Googler, but according to the postmortem this was a load-related incident. Canary changes only have a small amount of traffic directed at them (by definition) and the problem probably didn't become apparent until after the change was promoted from canary.

Going further...

Googles 'Canary' system is sub-optimal and risky.

It picks a subset of the system (say 5%), and deploys the change there. It then observes this 5% of the system, and if it's okay, it rolls out to the rest of the system.

Compare that to a continual-canarying system:

* Start deploying the new code gradually (say 1 container/pod/task every second).

* While the above is in progress, continually collect stats from upgraded and non-upgraded systems.

* If the stats differ to any significant level, halt the rollout.

The continual-canarying system is much better because the sample size to see a 'significant' change depends on the size of the change and the amount of natural variation in the sample. The size of the change obviously can't be known beforehand. The variation varies widely depending on the metric of interest (cpu, memory use, request latency, etc.).

It is therefore impossible in googles canarying system to choose a suitable subset size to detect (with a sufficiently high statistical power) small but important changes in a metric while still stopping the rollout early enough for catastrophic changes in a metric.

Continual-canarying effectively solves this issue by gradually increasing the sample size as the rollout goes on.

In reality, neither continual-canarying or google-canarying work well for detecting anomalous metrics because the rollout process itself causes applications to restart, which in itself changes their performance characteristics (cold caches, empty queues, first-use delay, jit optimization, etc.).

The solution to that is to do two simultaneous continuous-canary rollouts - one of the existing version (effectively just a restart) and one of the new version, and use those two groups as input to the logic to decide if any metrics of interest have statistically significant changes.

Please go implement this stuff Google! It won't be hard to do, and will really increase the safety of your rollouts. (Even though I know it wouldn't have helped this time)

So you'd think this would be the case. I certainly did when I first joined. In fact, I actually proposed something even more extreme than your suggestion (use thompson sampling to control the rate of task restarts and rollbacks). But in practice such a continual canarying process isn't actually any better than a staged canary (like .1% -> 1% -> 10%).

Consider the types of issues you run into, there are

- Things that are painful, but not destructive (minor performance regressions)

- Things that are highly destructive (data gets deleted, major performance regressions, etc.)

The first you can handle being deployed to many people, so if you detect at 7.8% of your users instead of 10% doesn't much matter, you can run it at 10% forever without issue.

The second you'll detect on a smaller population because the changes are catastrophic.

>In reality, neither continual-canarying or google-canarying work well for detecting anomalous metrics because the rollout process itself causes applications to restart, which in itself changes their performance characteristics (cold caches, empty queues, first-use delay, jit optimization, etc.).

The problem here is that most of the time, you don't care about startup behavior, but steady-state behavior. Imagine that a new version introduces a memory leak, so the old behavior was to linearly increase to 1GB of memory over 1 hour and then level off, and the new behavior is to increase unbounded until eventually the system OOMs and restarts or whatever.

A "google style" canary would roll out to 1% of tasks or something, wait a few hours, and notice the difference and roll back. You might even experience 1% of tasks restating, but it's likely that the system can sustain that.

With the continuous canary, you'll release to 1 task per minute or whatever, and only be able to notice any change after you've pushed out to 60 tasks, and the change will likely only be detectable with any confidence once you have a notable difference in 120 or so. At that point, you've released to more than 1% of your tasks (either that, or it takes you 8 days to release a new version).

You can fix that by slowing the rate of releases, but now it takes you 8 weeks to release a new version.

Plus, even worse, you're now in a much more fail-open environment. With a 1% release, you can sustain in the kind of bad state if all of your qualification tooling fails. If you're continuously canarying though, you have to be much more careful to make sure that your tooling won't continue to push if the tooling itself is broken or getting unusual results. It's a more risky set up.

You seem to have thought about it in depth and my comments maybe missing some insight

A) Your argument is optimised for a certain type of failure which leaves you open to the others ( issues that are detected after high ratio deployment ) B) You can monitor just the jobs impacted by the canary deployment C) the progressions you discuss seem strawman like. You can have a doubling progression starting at 1 or 5, a tripling progression

D) the point about detecting at 7.5 and living at 10, what if it’s detectable at 20 and you’ve gone all out by then

Another thing to keep in mind is that you don't just want problems to be detectable: you want them to be diagnosable and fixable.

If you detect a problem with FooBar frontend at 1pm, the first question anyone's going to ask is, "What changed in FooBar frontend or any of its dependencies at or shortly before 1pm?"

If the answer is, "Bazz backend deployed at 12:50pm" then someone with no understanding of the innards of FooBar or Bazz can roll back Bazz and there's a pretty darn good chance they'll have fixed the problem quickly. This is a scalable approach to fixing things that works well in practice.

If the answer is, "Nothing just changed. Every layer of the system has been gradually changing all day" then your oncall person will need an in-depth understanding of every service involved in the system as well as creativity in guessing where to start looking. It can work, but it's going to be slower and more difficult and fewer people will have the system-specific knowledge and debugging skills do it well and quickly.

A) like I say, in practice issues that are only detected after large deployment are small. If they weren't small, you'd have been able to detect them earlier.

B) absolutely, but if a large subset of the jobs recently restarted you get noise and nonsense

C) indeed you can do exponential continuous rollouts, but then you are getting pretty close to a 3 or 4 stage where each stage is a 5x increase in size or whatever.

All excellent points. I'll just add that often times step functions observed in obscure dashboards have been major signposts in identifying arcane root causes... (hey, look what ELSE happened suddenly at about 4.34am!)

Those gradual rollouts exist, but sometimes you have changes that aren't so easy to deploy a fraction of a percent at a time.

So Google don't test for load? They should ;)

Most peoples load tests involve taking a system to breaking point, then turning the load off and going to lunch.

People need to gradually reduce the load after the breaking point to check the system recovers.

Load Tests are also pretty hard to do in distributed systems. If you test the application alone, you probably won't find most of the issues. You'll need to test the application complete with all it's dependencies (databases, load balancers, failover mechanisms, external servers, etc.). You'll also probably want to test it with representative user requests, and all databases filled with representative data. That in turn typically means your loadtest system will need to be as big and expensive as your production system. Have fun explaining to the boss why you need to double the infrastructure costs. If you do it in the cloud, you can just do a loadtest for 10 mins per day and save a bunch of $$$, but you still need tooling to be able to deploy a complete replica of your system, fill it with realistic data and send it realistic requests, all automatically.

Using real user data and logs of user interactions and real user requests is best for loadtesting, but comes with it's own risks. You need to make sure the loadtest systems doesn't send out emails to users, or accidentally communicate with the real production systems in any way. It also means you have to secure your loadtest infrastructure as well as your production infrastructure. GDPR data deletion requests need to apply there too, etc.

Googler here, but haven't looked at any of the details of this incident.

One of the tricky things about canaried releases and staggered rollouts is that to implement them, you need another layer of configuration to manage that. And then that's another potential source of outages - and scary ones too, because your configuration is so closely tied to prod itself, . It's kind of the whole "my test needs a test, and the test of my test needs a test, and ..." issue.

Plus, you can only spend so much time testing out disaster scenarios before other priorities take over - if your customers are asking for features A, B, C, well, at some point it becomes a judgment call whether you spend time on the new features or testing the resiliency of your service. Even when you do test the hell out it, it's hard to guarantee you've tested literally every edge case.

The other thing is that, as you point out, as you add more and more mitigations for failure scenarios that you have experience with, the fraction of the failures that you've never seen before starts to increase. It's somewhat obvious, because by definition a regression test safeguards against failures that you've experienced, but it's an important thing to keep in mind when thinking about exactly what the tests you do write will keep you safe from.

> into what they call "cells"

Google literally invented this concept, and it has been spread to other companies by Xooglers.


> Google literally invented this concept

Did they?

Here's a summary of James Hamilton's paper (then a Microsoft employee) on building large-scale internet applications published in 2007, 8 years (?) before Borg was made public knowledge: https://blog.acolyer.org/2016/09/12/on-designing-and-deployi...

   Partition the system in such a way that
   partitions are infinitely adjustable
   and fine-grained, and not bounded by any
   real world entity (person, collection,
   customer, etc.).
> it has been spread to other companies by Xooglers.

https://twitter.com/bcantrill/status/1092849514229059585 I guess the notion that Google is an all-superior tech giant that faces unique problems and solves them in the best way possible is wide spread misplaced notion among Googlers?

I don't know about this case, it wouldn't surprise me either way. Borg is certainly old, probably pre 2008. The same story as Mr/Hadoop wouldn't surprise me.

It sounds like this public report may just be using the word "configuration" to mean anything other than the software itself. I imagine that what really happened is they got alerted to high disk usage and the operator issued a major compaction which took out the bigtable. That's caused numerous Google outages and is consistent with the report.

Compactions do NOT reduce disk usage. On the contrary, they rewrite the data, so they generate new and presumably smaller SSTables. Which means more disk usage. It's the garbage collector on the master that removes files. You can combine the two, but that requires some tooling that compacts a section of the table, then triggers (or waits for, I can't recall) the garbage collector.

Well: in many cases, compactions can make more files eligible to be garbage collected. So if you want to accelerate the freeing up of space, running a compaction may serve that goal. It will temporarily increase usage, but once the garbage collector catches up the net effect is to reduce disk usage. Unless I'm mis-remembering how things work. (Caveat: I left Google almost 9 years ago.) And so "run a compaction to free up space" is a reasonable shorthand most of the time... unless you're so close to the edge that you don't have room for the temporary increase.

You're not misremembering. It's just that timing matters. And GC always got less attention than compactions.

My rule of thumb was: if I have a change that will shrink tablets by 10% through e.g. compression and I apply it wholesale with a manual compaction of the entire table, I should assume that in the worst case the table will temporarily take up 190% of current space, until the next GC run or two. If all or most of your data is in one table, this can be a problem. Organic compactions are friendlier because they occur over the course of (typically) days, leaving the GC with plenty of time to clean up.

The real issue is that Bigtable's used to be in one big shared cell which could handle temporary 190% increases in resources.

Now, to increase isolation between services, everyone is off in a partition, and each individual partition is much smaller and therefore can't withstand resource spikes without falling over.

And I bet the "emergency loan" functionality for various resources still isn't automated and still has arcane requirements to meet and a delay before it kicks in. Yay - a 15 minute delay. Thats exactly what I need when my entire service is down! /s

This will not be of general interest to HN, but I disagree. If compaction is behind then commanding a higher compaction rate can free space, and it can also bring down your service. You can also call for higher rates of splits and merges with the same effect. I have no idea of those actions contributed to this outage.

Compactions happen on the tablet server. They only write data. By definition, they increase disk usage, right away. Only the garbage collector, which needs global state and thus runs on the master, can actually delete files in Colossus and bring down disk usage. It doesn't always run as frequently as one would think or hope. It definitely doesn't run immediately after one tablet has been compacted.

I remember one incident where a team got an alert that they were at 90% of quota or so. They compacted data too aggressively and they actually hit 100% before the GC could do its job. That's why I mentioned that you should compact a fraction of the table, then let GC run. Google being Google, I'm sure such a tool has been written, in several variants, by N different teams.

As long as we're splitting hairs, only the Colossus curator can really delete the file and if the quota manager is behind you can still run out of space.

I dunno what AWS cells are and if the terminology matches, but Google's Borg already has a concept of cells internally, a cluster can hold cells, and each cell can represents a kind of job.

From what you describe the 'AWS cells' are very different than Borg's.

My view is that at a high level aws cells are isolated, fuzzy-partitioned, migratable, heterogeneous instances of a complete service stack that are managed such that they always fail independently of each other; contact points among these cell instances, if required, are established via highly available intermediate services (like ReplicatedStateMachinesOrStorage); the clients are routed to assigned cells by a highly available routing layer (like NetworkTrafficShapers/DNS), which can also help facilitate migration of clients from one/more cell to another.

I encourage you to view the re:invent presentation on it [0] if resiliency in distributed systems is something that interests you.

[0] https://youtube.com/watch?v=swQbA4zub20

That description matches how I think of borg cells as well. (Googler)

Thanks. I read the paper now and I kind of get why you guys think they're similar.

I find them similar too in what they're trying to achieve [blast-radius minimization], but I think AWS' concept of cells [going by re:invent presentation] is more of an architectural pattern with strong inclination towards certain non-negotiable reqs ["thinnest possible routing layer", "workload migration among cells is a first-class citizen", "avoiding critical cross-cell tasks/dependencies other than migration", "creating/removing cells are zero-downtime events"] for different flavours of services [stateful vs stateless (dynamodb vs lambda?), zonal versus regional versus global (ebs vs kinesis vs route53), for example] than a concrete implementation like Borg.

I might be wrong on all accounts, though.

In short; QA forget to ask the bartender where the bathroom was.

> On Monday 11 March 2019, Google SREs were alerted to a significant increase in storage resources for metadata used by the internal blob service. On Tuesday 12 March, to reduce resource usage, SREs made a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data. The increased load eventually lead to a cascading failure.

Any idea what this means? What kind of configuration change would reduce storage but overload the system? I imagine turning on compression could do that, but I'd expect it would be easy to reverse. (Edit: Maybe it was the cascading failure that made it more serious?)

> What kind of configuration change would reduce storage but overload the system?

Maybe the increased storage usage was due to wrong kind of load rebalancing? I can easily see that a change to reduce _object_ hit rate in any given location, might just increase the _metadata_ hit rate. If storage is balanced by spreading the load to workers that keep their data nearby (for reduced latency), then an unexpected increase in these workers could also cause an increase in required storage.

If you want to balance the load on these properly, you'll have to do more location data lookups. Oops. Now your metadata's metadata is the chokepoint. Massive read contention on blob store address lookups --> cascading failure.

Two straightforward changes in Bigtable or Spanner are changing tablet compression settings or tuning compaction more aggressively.

Edit: given the mention of a job that got stopped, perhaps they had a tool that manually forced compaction of data in order to adopt new compression settings or purge deleted rows. Normally, compactions would happen on the tablet servers on a cycle of N days.

Another possible explanation is that they tweaked the garbage collector, which runs on the master.

These seem really likely. Being a bigtable operator is exactly like being Mickey Mouse in the Sorcerer's Apprentice. You want buckets of water? Okay.

At a high level it sounds to me like they're paying Complexity Debt.

Nice post-mortem. Being in a smaller engineering org. these write ups feels really premium and something to strive for to provide to our customers when things hit the fan.

Google down March 12

Facebook down March 13

Apple(iCloud) down March 14

The ides of March are (almost) come. :)

[1] https://en.m.wikipedia.org/wiki/Ides_of_March

IDEs of March - this, folks, is why we stick to compiling from the command line ;)

Elizabeth Warren got her wish, she "busted" big tech and she didn't even need legislation or executive authority to do it. :)

If I were a betting man, I'd say Microsoft (Azure) down March 15 and AWS down March 16.

And Pizza Hut's online ordering system is going to fail on April 20th.


your hypothetical bet missed; AWS happened today.

I am putting my money on "configuration change" as the root cause of iCloud being down.

Is iCloud a prediction or is there an actual outage today?

It looks like it’s resolved now, but according to the iCloud status page some services were down or degraded earlier today. https://www.apple.com/support/systemstatus/

Missing an N for fang

> On Monday 11 March 2019, Google SREs were alerted to a significant increase in storage resources for metadata used by the internal blob service. On Tuesday 12 March, to reduce resource usage, SREs made a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data. The increased load eventually lead to a cascading failure.

Just some guesses (not affiliated):

I wonder if there was something related to Tombstones and deletion of extra data? That can often be a failure of replicated sharded big data systems using sstables and tombstones for deletion.

Another guess might be that some sort of internal shard count was increased resulting in some lookup process needing to check n more shards per call.

The root cause of a cascading failure can be really minor - just something tiny that tips the system over the point of no return.

The real question is why don't us systems-designers have better tests for cascading failure? Even Google has very few tests which deliberately overload a system (to cause it to fail), and then verify that it can recover on it's own promptly under typical loads.

Eg. consider a database which serves a web application. If the application times out querying the database, it retries up to 5 times. That is a system with an unrecoverable cascading failure, since if one day we get a small spike in users, the database becomes slow, and a few timeouts occur, some requests will get retried, putting more load on the database, making it slower, causing more timeouts, and the problem gets rapidly worse until a few minutes later all requests are failing. Even when the spike in users is over, the database will remain effectively dead, because it is still failing under load from all those retries.

A load test that didn't push the system over the edge and then test for recovery wouldn't be able to detect that.

Too many people loadtest things to the point of failure and then stop. All it takes is to continue the test for a few minutes longer with a slightly lower load to verify recovery!

Because there are unlimited number of failure modes for cascading failures. Going through all postmortems, you'll find it hard to find same kind of failures. Every time it's something unique.

Also it's difficult to get a good coverage by just pushing things to its limit. Let's say you have a system which has this RPC chain: A -> B -> C -> D. And not all requests cost the same and not all requests make way to D. In reality it's interesting that different composition of traffic may overload any of them. For example, some queries may hit expensive ML models while some other queries have a large number of candidates. In this case the former hits predictor service hard while the latter may hit index serving systems. I've actually read an outage happened during superbowl because sports-related queries are particularly expensive for some part of the system.

Why are all these high profile services going down in such a short window? Anyone else thing this is odd? Rarely do we see such incidents occur, let alone to pretty much all these companies.

Because load increased across all of them at the same time such that they all hit some (different) submerged iceberg scaling wall?

I think it's very odd. They say odd things come in threes.

Apple does use Google's storage (among others) for iCloud, but it could be unrelated.

> We will make software measures to prevent any configuration changes that cause overloading of key parts of the system.

If it was possible to know these a priori, why allow them in the first place?

Or are they saying they will be writing software to roll back configuration changes that cause disruption? ("disallowd")

I read that as either rate/resource consumption limits, or a hard time-limit for a task to complete before it's aborted (rolled back).

99 cases out of a 100 it's a "configuration change" in a very large system that's so large and complex nobody can comprehend it anymore. Decentralize, folks.

Not sure why this gets downvoted. This is an entirely valid comments.

It's getting increasingly obvious that despite being highly distributed, cloud systems run by Google, Amazon and other big cloud companies are inherently centralized. If nothing else, they have centralized controls.

A decentralized system can, of course, fail, but it would never have "configuration change" as the root cause.

S3 is (for want of a better term) a legacy system in AWS, having been around since pretty much the beginning. A series of choices made when the idea that there would ever be more than two regions, leave them unfortunately stuck having to tie regions together to a limited degree (bucket names are global).

Beyond those bucket mutations, everything else is purely regional scoped.

I'm _really_ surprised to see that GCP has made the same mistakes though and had things tied together. I'm not convinced there's a good excuse. AWS/S3 is understandable. No one knew what things would look like. Not one engineer would have made the same decisions then if they'd known what the future was for the platform. Google came to the cloud market much later. Their engineers should have seen the issues with AWS and the problems that interdependence cause, and specifically architected isolated infrastructure (which from this post-mortem it seems like they didn't). That's both remarkable and disappointing to see.

GCP may have come later than AWS but googles internal blob storage could of predated AWS for all we know.

>"A series of choices made when the idea that there would ever be more than two regions, leave them unfortunately stuck having to tie regions together to a limited degree (bucket names are global)."

Could you elaborate on those choices? Also what do you mean by they're "stuck having to tie regions together"?

The main thing is that bucket namespace is global. There is only one bucket "foo" worldwide. That means that every bucket mutation event (e.g. bucket deletion or creation) requires co-ordination between regions. When that big us-east-1 region outage occurred a year or so back every other region worked fine except for bucket mutations.

The worlds infrastructure is inherently fragile. It should be difficult to take down these large distributed systems, but they are over-coupled.

It's very easy, apparently, to take them down by a configuration change. Happens several times a year and takes hours (and untold millions of dollars in productivity loss) to recover from.

Sure it can. Let's back up a bit to something physical as an example. If every car on the road were self driving, that would form a decentralized and distributed system. Each car independently monitors and reacts to its surroundings. That doesn't prevent accidents between two cars, nor does it prevent huge cascading accidents that can cause highway closures.

Decentralization often makes these scenarios harder to avoid, not easier. There are simply more ways that things can interact (and therefore break).

To avoid these failures in any setting, you need to have provable containment of failures while assuming part of your network is malicious. That's asking for a lot more than decentralization.

Like I said, a decentralized system can fail, but the root cause of such a failure cannot be "configuration change". If a configuration change is sufficient to break your system then it's not truly decentralized.

Let's take your car example.

If you update your car's firmware and get in an accident, it's not a system-level failure.

If manufacturer sends a wireless update and all their autonomous cars start to bump into one another, it's not really a decentralized system anymore - updates are centralized.

If you update your car, it crashes, sends a wireless signal that affects another car, and it crashes and so on... This is an example of a cascading failure in a fully decentralized system, but the real root cause would be cars interfering with one another, not your firmware update.

If a manufacturer with 10 percent market share sends out an update that causes their vehicles to suddenly turn right, that would ruin conditions for everyone.

If that's the case, then the conclusion based on the upthread reasoning is that this system isn't decentralized enough when any single manufacturer has a 10 percent market share.

It prevents all cars failing at the same time worldwide.

Unless you push out a bug that is triggered by some combination of events so distant from the release that you could never have caught it in testing. Perhaps a bizarre interaction between multiple unrelated systems from different vendors that all work tested in isolation and only fall over when a series of 1 bits in a certain pattern in the real time clock chip creates a blip on a bus that gets corrected in a way that was interpreted differently by two different teams and the fallback system tries to take over but the primary system comes back online at just the wrong time and trips a bus contention bug that was never seen because testing never hit this perfect storm of race conditions and, well, the front falls off.

The point of me making up a stupidly improbable story is that stupidly improbable events keep happening despite all the clever people in the world. And of course the people on hacker news who know even better.

Decentralized portions of an architecture must have pathways between them. These pathways then contribute their own points of failure, both in isolation and in combination.

A lot of these kinds of systems are decentralized. It's not a magic bullet.

It's not that simple. I'm not an expert at gurangutan scaling and I'd bet you're not either.

I am. I used to work at Google and on Cloud in fact. Fact of the matter is, 99.99% of Cloud users don't need "gargantuan" (I'm assuming that's what you meant) scaling. Most customers don't need terabyte-scale anything, let alone petabyte-scale anything. Their stuff could run just as well (and more reliably) on an isolated instance that Google SREs can't mess up. And don't get me wrong, this is not a dig at Google SREs -- they're far and away the best in the industry, but configuration of large systems is a de-facto SPOF.

Whoever had "Configuration Change" in the betting pool in the other thread wins all the upvotes!

(I believe it was the most upvoted response too, although now I of course can't find the thread)

"configuration" is what we call code that has the greatest potential impact and the least amount of static checking.

IIRC that commenter was pushing a line about quarterly bonuses which struck me as odd.

In Firefox I get a blank page with 2 "PARSER.process false" messages in the javascript console...

The FB outage yesterday was also apparently caused by a configuration change: https://twitter.com/facebook/status/1106229690069442560

LOL at the first reply: "I was hoping that @FBI was raiding your campus and arresting your leadership."

Keeping my fingers crossed.

Configuration changes are often riskier than code changes. They're faster and often have outsized effects on the characteristics of your service and downstream services (even assuming they don't turn on new features.)

"destroy the skynet" john connor

When SREs are the reliably problem...

Dear Google:

I don't want to sound negative (that's not the purpose of this message), but my belief is that this incident is a heart palpitation on the way to a major coronary, stroke, or heart attack... Let me explain to you why.

You see, as your company has grown, so has the complexity of your codebase and so has the number of people who maintain that code.

The complexity of your code has increased, perhaps asymptotically (since you seem to like Big-O notation and related subject matter, from what I hear, at your interviews), yet I'd be willing to bet that the overall knowledge of the codebase that the average worker there has, has actually diminished (although perhaps not asymptotically, perhaps linearly) over time.

In other words, you've got too many chefs, and a soup with too many ingredients where every chef knows some of them, but not all of them.

Maybe Jeff Dean is the exception to that, but that's what you've got going on, basically.

When something like this happens, your responsibility is to a) Educate your rank-and-file, even at the expense of new product releases, b) Re-delegate, insuring that there aren't too many chefs to soup, c) Reduce the ingredient count (refactor), and perhaps most important d) Make sure there's someone, anyone, who understands ALL of the codebase from top to bottom (difficult when you've got multiple source-code contributors making changes, and multiple layers of people only responsible for some aspects of the system...).

(Oh, and if it were me, I'd go as far as to buffer the entire incoming datastream so as to be able to replicate the problem, and if you can't do that (run a buffered datastream from a point in time against a point-in-time snapshot of all data at that point in time, and see exactly where in code the problem occurs), then your first goal is to be able to do exactly that. But I'm a little extreme in what I feel qualifies as acceptable testing...)

>Make sure there's someone, anyone, who understands ALL of the codebase from top to bottom

Either you seriously over estimate the limits of ones understanding, or you severely under estimate the size of Google’s codebase/monorepo.

At this point, I’m pretty sure it’s impossible to even check it out on a single developers workstation due to its size.


...Also, compare that to the Bible's "Tower Of Babel" story:


The Bible's story is about human language, specifically about how human languages encompass greater degrees of abstraction over time (languages built on top of languages, abstractions built on top of abstractions).

But, when you get to a certain level of abstractions, the abstractions start to "leak". That is why there are communication difficulties in societies (people who are more educated typically speak a different language than people who are less educated; academia and lawyers speak their own language).

This is also when/where/why/how software starts to fail.

Too many levels of abstraction, and those abstractions will "leak".

Don't take my word for it, use your own logic, your own common sense and think about it...

Software problem A exists.

Management Team: "We have a problem, can you fix it?"

Programming Team: "Sure." (fixes it, but this causes problem B, which is not recognized until several weeks later)

Management Team: "Now we have this different problem, can you fix it?"

Programming Team: "Sure." (fixes it, but this cause problem A to resurface, which is not detected for several weeks...)

Management Team: "We have a problem, can you fix it?"

And the cycle repeats...

Was it the programming team's fault?

No, it was the nature of the beast called "complexity". The higher you go in abstractions (and this is especially true for AI), the more you must exercise ENGINEERING DISCIPLINE. You must be able to go BACKWARDS IN TIME (this is accomplished in programming by being able to go back to simpler programs and fully understand / test them before advancing to more complex ones, and PRESERVING THAT CHAIN OF UNDERSTANDING...).

This is probably why advanced societies destroy themselves, too much abstraction, too much "magic" (without understanding of all of the sublevels), and no one has the exact knowledge necessary to pinpoint exactly when/where/why the exact problem occurred...

But again, don't take my word for it. Use your own common sense to determine if what I'm saying has any merit...

I’ve worked in three separate companies that had each hired a ton of ex Google engineers in their early years, and the systems they have are all identical to each other and seem like a poorly recalled-from-memory version of these types of patterns.

It’s like a strange simulacrum of some bazel-like build tools but not as mature as bazel but also can’t be swapped with real bazel due to tech debt issues, and same with canary deployments, monorepo layouts, etc.

It’s become something that I actively seek out information about when I interview for new jobs so I can actively avoid all these Google-but-not-quite-with-enough-resources-to-pull-it-off antipatterns.

Worse yet is that in many of these shops, those early employees who came from Google just stayed long enough to mutate the whole monorepo / homebrewed bazel clone / canary deploys mess into a pile of unmaintainable crap that couldn’t be migrated away from without huge business risks, and thenthey mostly all quit and a lot of them even went back to Google!

I call it the Google Borg Syndrome. Same crap monorepo ideas, same crap canary ideas, same crap bazel-like build system ideas just spread like kudzu. It really shows how bad the ideas themselves are, and how if it wasn’t for a giant money and labor-hours faucet that Google can shoot at these systems, they would be unmasked as deeply poor ways of solving engineering problems.

>they would be unmasked as deeply poor ways of solving engineering problems.

The best technical solution to problems can depend a lot on company size. These people are using a big-company solution in a small-company. And it nearly works, but not quite.

Big companies reinvent pretty much everything internally - You won't find Chef, Jenkins, Slack, Apache, Nginx, or any of that stuff at Google. Yet a small startup would be stupid not to use tooling developed and maintained by someone else almost everywhere.

In turn, the best solution to a problem depends on which tools you have at your disposal.

I agree to a large extent. But specifically in the case of

- using bazel instead of make

- using a monorepo ever

- using Google-style canary deployment ever

I think they are just nearly totally objectively bad ways to solve problems whether at Google scale or otherwise, with obvious alternatives that are strictly dominating in the sense of being unilaterally better in all use cases.

These techniques have to be kept afloat with the money faucet. I guess I can agree that if you have enough money that you can offer pay packages that cause people to be willing to endure the hardship of propping bad systems up, then you don’t have to care about what systems are effective.

For example, you could shoot lots of holes in a boat and then hire a full time staff on $500k / year to scoop water out fast enough that the boat seems to operate just fine.

Then get the ACM to write a big publicity piece about how hole-filled boats are the best design and anyone not shooting holes in their boat is Doing It Wrong.

This is not even exaggeration to offer as an analogy to Google’s combo use of monorepos + bazel.

Bazel is designed for big projects and distributed compilation. Unless your company has >500 engineers, it isn't going to make sense.

I like monorepos - and in my experience you don't need special tooling for them until you have thousands of engineers. I just don't really see the point in splitting all my business logic into 50 different repos all of which have 25 branches and then I'm having versioning, dependency, and incompatibility hell between them. One repo, one 'master' branch, lots of tests, and as soon as all tests pass, auto-deploy!

The 'canary deployment server', complete with webUI I think is a mistake, even for Google. If they were using kubernetes, it would have a 'Deployment' object, and canarying should be a feature of that, just like rolling updates are. Actually how that works is an implementation detail of kubernetes and is abstracted away.

I've used Unix make for personal projects (during and after school) and bazel at work.

"bazel" is more about type checking the build artifacts, caching and guarantees about hermetic and reproducible builds. The distributed compilation is just supported but not a requirement IIUC. idk if it's the best tool but definitely an upgrade over "make".

Whereas "make" looks like an old pseudo scripting utility which has too many degrees of freedom with not much guarantee and is fragile as a result. I think those who learned its quirks love it but I don't.

I've also liked mono repos at work as long as their operations are not sluggish. They immensely help global refactoring, static analysis and reduce the amount of merge conflicts.

- I'm not sold on Bazel (I'm more of a Nix person), but hermetic builds are obviously a good idea

- You don't need the added complexity of having to manage tooling to track dependencies across multiple repos. A single repo is just fine. You'll obviously want to split off repositories when you want to contribute stuff as open source, and in other cases http://danluu.com/monorepo/

- what you describe as "Google-style canary" is not actually Google-style or recommended by Google

See the SRE workbook: https://landing.google.com/sre/workbook/chapters/canarying-r...

(obviously, Google is big, so different teams might do things differently)

Whether using a monorepo or not, you should be creating versioned artifacts that are stored in a separate artifact repository, and different sections of your codebase (whether sharing the same repo or not) should express dependency via consumption of versioned artifacts from the artifact repository.

As a result, sometimes you can even require _more_ tooling to adequately handle dependency in a monorepo vs multiple repos.

When you consider that a default natural constraint of all software development is that it must adapt and cope with changing tools, usage demands, etc., then it quickly becomes important to allow any given project to use arbitrarily unique build tooling, deployment tooling, libraries, languages, databases, etc. Trying to shoehorn them all into a mandated set of monorepo tooling just doesn’t work, once again meaning you often need much more tooling to reach minimum required levels of flexibility with monorepos as compared with polyrepos.

There's no reason to rely on versioned artifacts as dependencies if each binary bundles all deps.

This is what bazel does.

In practice your second complaint about tooling isn't particularly true. Android apps, webservers, and language runtimes live in the same repo without issue.

> spread like kudzu.

I see this cargo-culting all the time, it spreads via breathless blog posts as well. People applying patterns from FAANG totally out of context because it sounds cool.

Google uses Kubernetes, therefore I should hand-roll my own K8s infrastructure to host my 1000 user CRUD app. Oooh and let's make it a SPA with 100 microservices because Facebook does that so it must be a good idea.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact