Serious question, have you been part of an org that had to scale orders of magnitude very quickly?
Anyone who has been part of that journey knows how painful it really is. A lot of times the systems to fail at all levels, and you have to redesign it from the first principles.
> Serious question, have you been part of an org that had to scale orders of magnitude very quickly?
I have, but it depends what you mean.
Scenario 1: e-commerce SaaS (think: Amazon but whitelabel, and before CPUs even had AES instructions); Christmas was "fun".
Scenario 2: Video Games. The first day is the worst day when it comes to scale. Everything has to be flawless from day 0 and you get no warning as to what can go wrong.
Yet, somehow, I managed to make highly reliable systems.
In scenario 1; I had an existing system that had to scale up and down with load, this was before there was cloud and hardware had a 3-4 month lead time, so most of the effort was around optimising existing code, increasing job timeouts and "quenching" sources that were expensive. We used to also do so 'magic' when it came to serving requests that had session token or shopping cart cookie.
In scenario 2; we have a clean-room implementation and no legacy, which is a blessing but also a curse, there's no possibility to sample real usage: but you also don't need to worry about making breaking changes that are for the better. With legacy you have to figure out how to migrate to the new behaviour gradually.
So, pro's and con's... but it's not like handling huge load hasn't been done before, computers are faster than they ever have been and while my personal opinion is that operational knowledge is dying (due to general distain for people who actually used to run systems that scale: not just write hopeful "eventually consistent" yaml that they call deterministic) - the systems that do exist today hold your hand much better than they did for me 20 years ago.
And I ran 1% of web traffic with an ops team of 5 back then. So, idk what's going on here.
EDIT: Likely people are flagging me because I sound arrogant (or I hurt their feelings by talking bad about YAML-ops), but all I am doing is answering the question presented based on my experience.
It really, really depends on what you mean. Specifically, it depends on the application and its various compute, I/O and access patterns. Scaling ecommerce and games is well-known by now (e.g. Amazon and Blizzard have been dealing with insane scale for two decades now.) However, anything outside a well-known pattern can be very tricky to scale.
I once worked on a team had to 100x scale a system whose downstream dependencies were various 3rd party APIs and data sources, most of which had no real SLAs to speak of and had extremely high variance in latencies and data transfer patterns. This basically required rearchitecting everything including our clients because the typical transactional request/response access pattern was too tightly coupled, and any hiccup in an external API quickly rippled up through the call-tree and caused outages 3+ services removed from ours. In some cases, the re-architecting went all the way to the UI.
Years later, I led a company-wide effort to optimize our entire user-facing application infrastructure to not fall over from sharply spiking user traffic, touching dozens of services across dozens of teams. We did a brief study and realized there was not a single common solution recommendation (like "tune your caches") we could give that would help all the teams because each one had very different resource usage patterns and hence different bottlenecks. Our approach was basically to farm the task out to each team and say "here are some common metrics to look into and some common issues to look for and some common solutions, get back to us if you need help." We spent a lot of time on the help.
I have no idea what the patterns for GitHub are, but I'll note its much more than just a DB and it has a dependency (Actions) with extremely high variance in latencies and resource usage.
I wrote this in response to the below comment, which is now edited and unfortunately dead, so posting here:
I understand, that wasn't a comment on your efforts back then, just that it is a solved problem today. But that does not mean other scaling problems are comparable or comparably solved. The universe of scaling problems is immense!
Worse, different problems occur at different scales. In the 3rd party API system, years after the first re-architecting, some use-cases developed issues at scale that exceeded the already high operational parameters we benchmarked at, and required us to re-architect the service again, including building out a whole new cluster so we could isolate that traffic entirely.
It is really hard to predict how things will break until they do.
(As an aside, I remember reading a lot of interesting things about Blizzard's technology, even if Blizzard didn't publish those themselves. There were many people who researched their products and published their findings. For instance, someone analyzed wireshark traces and published a very detailed report about how they tuned their server-side networking stack. One thing that stood out was Blizzard used TCP for WoW, whereas the conventional wisdom was UDP for real-time multiplayer!)
We used TCP for The Division, this was a major mistake and I don't think it was something people should repeat.
For example, if you have TCP_NODELAY and a few thousand players, you'll be swimming in about 1.2M packets per second pretty quickly.
This is enough to completely crush any stateful firewalls (UDP would pass through because no need to check state), so we had to do ACLs in network hardware instead, and append a magic number so that we could prevent flooding instead.
Another thing we found was that Windows networking activity only happens on Core0 (Windows 2012 R2); and that at 1.2M PPS: the driver crashes.
Logging in to a Windows machine which is AD connected when its network interface is dead is not ideal.
Makes sense, and that was the surprising thing about WoW using TCP. I wonder if Blizzard chose to put in all that extra effort to make TCP work because they encountered enough crappy home routers out there that mangled any non-TCP traffic...
The root comment asked if I'd been part of an org scaling orders of magnitude quickly, so I'll actually answer it: Venda at Christmas peak (pre-cloud, hardware on 4 month lead times, ~1% of global web traffic at peak) and The Division at launch (new IP, day-zero always-online AAA, ops team of 2). Different shapes, same playbook, both worked. So with the credentialing question out of the way..
GitHub's own April post-mortem names the causes in their own words: tight coupling allowing localised failures to cascade, and inability to shed load from misbehaving clients. Their March report says one of the March outages "shared the same underlying cause" as a February one - i.e. they hit the same rake twice in two months. Cascade isolation has a dedicated chapter in the SRE book from 2016. Load shedding is older than that; the Erlang/OTP people were writing about it in the 80s. This isn't research territory, it's a syllabus, and GitHub is fumbling it with Microsoft's chequebook behind them.
Amazon and Blizzard aren't the slam-dunk examples you want them to be either. Prime Day 2018 fell over because their auto-scaling failed and they had to manually add servers - that's not "well-known by now", that's a company at literal planetary scale getting caught short on the one day of the year it was guaranteed to matter. And Blizzard's Lord of Hatred launch this week is doing the exact same login-queue routine Diablo's done at every launch in living memory. If those are your "two decades of solved problems", the bar is on the floor.
Your 100x rearchitecture story actually argues my position, by the way. You described tight coupling causing cascading failures across services, and the fix was to decouple. That is the boring operational discipline I'm saying has atrophied - you and your team did the work. The point is GitHub, a decade later, with Microsoft's resources and thirty times the headcount, is putting out post-mortems that read like undergraduate distributed systems coursework.
So no - the question isn't whether GitHub's problem is hard. Every scaling problem looks hard from inside. The question is whether the operational discipline that solved this class of problem in the 2000s and 2010s is still being practised, or whether the industry has quietly decided "it's complicated" is sufficient cover.
Agreed, the techniques in general (caching, backpressure, exponential backoffs, etc.) are well-known, but a couple of things:
1) The general cause of issues in these cases is that certain assumptions no longer hold, and above a certain level of complexity, there are too many assumptions to keep track of, and so things fail in surprising ways. Like, the need for auto-scaling was well-known and Amazon did have that solution in place. But I recall the 2018 Prime Day was record-breaking, so it is likely the very same auto-scaling service that was supposed to save them fell over because they forecast too conservatively! (As an aside, I follow a senior AMZN engineer who's made his career out of load-testing their services, and he has many fun war stories.)
2) The resiliency work is not done upfront because it is additional complexity that may not be needed. "You're not Google" and YAGNI is sound advice most of the times. So the system is designed with some "reasonable" assumptions (which... see above!) At larger companies, resiliency mechanisms (load-shedding etc.) are built into standard components, but then...
3) Different performance profiles require different resiliency mechanisms, and it's not always clear what they would be.
Going back to the example of the 3rd party API service, when we inherited it around ~2012, it was built on standard infrastructure components with in-built resiliency mechanisms... but those were designed for internal services with latencies expected in milliseconds, whereas our downstream calls could go into seconds or even minutes. Still, with the traffic then, with a little tuning it worked fine and served the company well... until we (or the 3rd party APIs!) hit a certain scale and started seeing issues. At this point we extrapolated the trends, benchmarked heavily, and re-architected. And then we hit new scales and new use-cases that surfaced new issues, so we had to re-architect again!
The point is, the system's performance profile was very different from typical web services (the primary culprits being extremely high variance in downstream characteristics and very non-linear growth) and it was non-obvious to scale with conventional wisdom. I do not know what's happening at GitHub, but I suspect they have some similarly unique performance aspects.
I think you meant "green fields" and not "clean room"? Clean room refers to reverse engineering an existing program to create specifications, then having another team implement the specifications without legal risk from involving the original.
They say it is at least one order of magnitude[1]; "our plan to increase GitHub’s capacity by 10X in October 2025 .. By February 2026, it was clear that we needed to design for a future that requires 30X today’s scale."
I wouldn't be surprised. Have you not noticed the sheer volume of slop being posted everywhere these days? Almost all of that is hosted on Github. And some of those repos have insane commit frequencies.
I believe it to be possible to vote for someone and also not be 100% aligned with every outcome they bring about.
In fact, it may even be possible for this to be more in line with the vision of people who did not vote for Trump than the people who voted for him.
If you look at presidents historically, there are some occasions where the vision they described during the campaign to get votes is not in fact the vision they bring about while governing.
Also - what is your presumption of what my vision of the USA might be?
Nobody wanted a war in Iran, recall how Trump wasnt going to start any new wars? We are in this conflict bc of Zionists - it had nothing to do with US interests at all.
The best way to get a Trump type to do what you want is to inflate their ego and then tell they can't actually do the thing you want them to do. If you can setup them correcting you for something stupid at this point - your like half there already.
If words alone don't get them dancing to the exact tune you intended - throw a few momentary glances of pity at them, make certain they clearly see you but in the same second look away - like you are tying to hide that you pity them and you pity them so much, you just cant help but "see them that way" - apologize profusely - say things like, "I'm sure when you were younger" and "you have to be realistic" - "Can't have your hand in everything" - etc.
When they agree just go off about how much they don't have to do the thing, they dont owe you anything and have proven themselves already and dont need to again, blah, blah, blah (they soak this part up, so drag out the praise) on their way out - tell them how ashamed you are for being so incapable compared to them, tell them how much of an inspiration they are to you and how glad you are to know them, bc really - only they can save you.
999 times out a 1000 - the above bs will work like a charm.
I dont actually think DJT would be the above "hard" to manipulate - Bibi likely made several passing comments and then outright begged him - which I think literally happened bc Trump said he "practically had to beg Israel to help" - so I'm pretty sure thats exactly what happened reversely. It was likely very theatrical.
This war is Bibis wet dream fantasy made real by a narcissist with an immense amount of actual power. Nobody would have to beg him, he has been begging for this since the 90s.
> If China ever feels emboldened enough to go for Taiwan and the US descends into complete chaos, the rest of the world running on AI will be at the mercy of authoritarian regimes.
Alternative being the current reality and world being dominated by US. Let's ask people in Middle East/Asia/South America about how they feel about that. In this current day and age, how is this statement even relevant?
If you think AI can replace an SRE in 2026 April, I've got a bridge to sell you. I'm not saying "don't use AI." I'm saying don't turn off your brain and let AI drop your production database.
That is the near-universal definition, but I don't think it captures the essence of it.
The difference between terrorism and warfare is the degree of top-down control. Warfare is done in uniform, by people in a hierarchy.
The reason for the distinction is that there is somebody taking responsibility. You end a war by agreeing to a treaty with the top level. You can hold the top level responsible for violations of the rules of war.
Terrorism, by contrast, is harder to stop. There is no authority to end it. Even state-sponsored terrorism need not end when the sponsoring state agrees; they can find a different sponsor.
That doesn't make one morally worse or better than the other. It's just a distinction worth drawing, because it governs how you go about bringing an end to it.
The US law for terrorism is about attacks against it, and they combat those differently from how they'd go about fighting a war against a conventional enemy.
What the US is doing to Iran is almost certainly unlawful, but I think that calling it "terrorism" obscures the fact that there is an authority to end it. The attack is legal in its own terms -- it at least has a law, which terrorists do not.
Again, not better. Arguably, much worse. Which is why I find the definition problematic.
You're conflating terrorism with irregular warfare. The Oradour-sur-Glane massacre was terrorism committed by regular forces; the French resistance blowing up a German supply train was non-terrorist action by irregular forces.
The US has constantly been at war for like 250 years. How can you conclude war is easier to stop than terrorism? Can we make the USA stop waging war? Because that would be a nice change.
Just take Iran, they agreed to a treaty with the top level of the USA. But the next top level ripped up the agreement and now is threatening total destruction of their civilization. Should Iran sign a new deal with that guy, and what's to stop him from tearing that up and bombing them again?
The US stopped individual wars. They went on to attack somebody else, but the host of the previous war was happy to see it over. That's why they negotiated a peace treaty, and the US mostly respected that. (Except with the native Americans.)
There is nothing to stop the next guy from changing his mind, but it generally doesn't happen.
It certainly could, and yeah, there's a really strong case that with the current administration the US has gone completely off the rails. My last comment was speaking generally about civilized countries. It doesn't account for rogue states, and the US is increasingly fitting that definition.
Can a rogue state commit terrorism by my definition? Not with a uniformed army. That's another part of my definition of terrorism: it puts civilians in jeopardy by hiding its combatants among them. Uniformed soldiers are legitimate targets, which means it's possible to fight back only against legitimate targets, even if those legitimate targets are committing acts that would otherwise be terrorism.
I don't think targeting civilians is a sufficient definition for terrorism, because militaries have been doing that since forever. It's basically part of war, even if we wish to pretend otherwise.
> The US stopped individual wars. They went on to attack somebody else, but the host of the previous war was happy to see it over.
Right, until they come back and attack again though. USA has invaded several countries multiple times including Iraq, Haiti, DR, Cuba, Panama, Nicaragua... Seems to me "terrorist" is just something states call the warriors of the people they themselves are terrorizing.
Frankly, the current administration is just recycling the propaganda and playbook of the bloodthirsty Neocons, so I don't see how this current administration is an aberration.
With the same logic, you could also justify bombing white house since they're clearly using weapons to destroy civilian infrastructure in other countries, and also murder civilians. That would be classified as a terror act though.
I'm sorry if my general discourse is grating in this thread/in general on HN. It really is just kind of who I am and how I figure things out for myself, I'm very off the cuff and unafraid of looking/being stupid if it wins me some knowledge. I think the point of my question is that I predict this is how the conversation is going to go in the next few days (if the bombs start dropping), and I guess I'm trying to predict for myself where this conflict is headed. Big picture, Strait of Hormuz is just one front on the war to get Iran to submit which as stated is to remove their ability to achieve the bomb (if you believe the initial justifications). I don't know if that sheds any light on what my goals were or if it gave you a migraine. Apologies if the second.
> as stated is to remove their ability to achieve the bomb
As stated by the people committing the violence today and threatening massive civilian death and destruction, Iran's ability to achieve the bomb was already destroyed many months ago.
Non proliferation is over and done with. Every country that can afford it and has the capability will have the bomb. There is no way this genie will go back in the bottle. And that means that a future nuclear war - if not today - is all but a certainty.
So I guess we just gotta bomb every country until they call uncle!
If the capability of making a bomb is enough to allow bombing them, and every country has the capability to make a bomb, than what is the rubric again? When do we stop bombing?
You can believe whatever you want but I actually don't believe that Iran would use that bomb offensively even if they have been threatening to do so. In that sense they would be no different than all but two other nuclear powers. Only one country has ever used nuclear weapons and only two countries have threatened to use them in recent memory. Iran was more or less bottled up until Trump decided to cancel the deal that was in place and since then it has been going down further and further. You don't escalate like this unless you are prepared to deal with the consequences and it is pretty clear that Trump had absolutely no plan post the initial bombings and has been winging it. That makes him super dangerous, not just to Iran but to pretty much the whole world.
This conflict has the potential to spiral out of control in ways that make my hair stand on end and to not put too fine a point on it, it's the first time in my life that I actually wonder if there will be a tomorrow as we know it because apparently some idiot thinks he has the power to decide over life and death for 100's of thousands of people and for once the person with that power is deranged enough that he might actually use it unprovoked. In a sane world this guy would be behind bars instead of at the top of the food chain.
The Strait of Hormuz was open 6 weeks ago and Iran was not as determined to 'get the bomb' then as they are now.
I don't how exaggerated this story is, but one of my buddies did his internship at TD. One of his skip managers told him if you know COBOL there are departments that will give you a blank cheque during salary ngotiation.
Yeah it's hard to say but I believe there's at least some truth to that. I took COBOL off my resume over a decade ago just to combat the volume of recruiters trying to drag me away from the cloud back to on-prem land.
A good friend of mine who worked on a CICS based credit card processing application at that bank doubled his salary twice inside of 4 yrs. First by quitting the bank and going to a boutique consultancy to build competing software (which they sold to other banks) and then by quitting that job and coming back to the bank to takeover the abysmal state the CICS app had lapsed into in his absence.
And that was circa 2010.
One thing that was true of the bank then and I'm sure is true now is that when they see a nail they truly have just the one hammer. When a problem comes along, hit it with a huge sack of cash until it goes away.
I don't think "know COBOL" is enough. I'm pretty sure I can learn COBOL in a week. It's more about "know COBOL and know all this old stuff like CLIs, etc, and know all these old approaches".
Typically it's not just about knowing COBOL as a language, the bottleneck is having real expertise wrt. highly specific, fiddly proprietary frameworks that are implemented on top of COBOL.
Anyone who has been part of that journey knows how painful it really is. A lot of times the systems to fail at all levels, and you have to redesign it from the first principles.
reply