Oct 21, 2024 was the beginning of a rough week for us here at Namespace. Our infrastructure didn’t meet our own reliability bar, and you ended up suffering substantial disruption, whether from increased queue times or network-related failures that led to job failures.
This is not OK. We care deeply about providing you an experience you can depend on, and we didn’t meet that bar.
We’re sharing the backstory of what happened because we value transparency, but also because we learned from these back-to-back issues. Hopefully others can learn from them too.
Timeline
Packet drops in fra3
start being observed, SLO starts going down.
Packet drops in fra3
are leading to job failures.
fra3
rebuild as an attachment to zrh1
starts.
The testing of the new network setup using Wireguard to mesh the sites is completed.
With fra3
unavailability, global capacity is short, with an emphasis on linux/arm64 first.
Overall global capacity is low, queuing times visible across the fleet. Some remote builds fail due to lack of capacity.
70% of the rebuild is complete, and queue times start to recover
Most of the rebuild is complete.
zrh1
starts to degrade in performance
zrh1
core fails; workloads to zrh1
fail to start.
Workloads are redirected away from zrh1
, some recovery is observed.
zrh1
is fully restored
Background
Not everyone may know that Namespace manages its own infrastructure: we deploy workloads to bare metal machines we manage and orchestrate. We do so because offering the best performance and at the best price point is paramount to us.
We started by working with various well-known providers that offer bare metal and using their management and networking capabilities. Through using them, we experienced various pain points. Namespace manages many, many machines. Having access to high-performance hardware, being able to automate it, and having reliable networking are all crucial. And no single provider hit all the marks.
By the end of 2023 it became clear that running our own hardware would be important. And that’s where Namespace’s journey to deploy and run our own hardware started.
We first built out our North American presence, and then expanded into Europe. And after a few initial snags, we found the trifecta: the best hardware, with the automation we required, and with reliable networking.
Our story didn’t stop there, though, and we’ve been lucky enough to see continued growth through 2024.
While we’ve been adding new capacity to our “native” sites, we retained capacity in the previous compute providers as well. Our original goal was to be completely “native” by the end of 2024.
What happened leading up to the outage
Towards the end of Q3 we started running a bit hotter. Our growth outpaced our planning.
To manage that growth, we had a substantial increase in compute capacity ordered to be deployed to one of our European regions by Oct 15.
As the date approached, we heard from our supplier that a mistake had been made. The wrong DIMMs had been ordered, and because of the size of our order, no immediate supply was available. They’d have to order them from Asia, which would add 3 to 4 weeks to the delivery of our new hardware.
Not great – we’re running hot, and our planned capacity increase is not here to close the gap.
To close that gap, we ordered additional capacity from one of our existing bare metal providers – which would typically take 3 to 5 days to be delivered.
But we started observing something was off.
Early signs
To connect all of the bare metal machines in the same L2 segment, we use a feature this provider calls vSwitch. This feature provides similar functionality to having machines connected to an additional VLAN. To maintain deployment flexibility, we route ingress and egress traffic from different IP addresses to this vSwitch.
Packets going through the vSwitch started observing random, sporadic drops. At first—through Oct 17 and 18—these drops led to some flakiness in the system, but for the most part, it was tackled automatically with various retries—whether from just using TCP and QUIC or by retries issued by our control plane.
The impact of these types of failures is hard to observe, as they’re not deterministic.
To understand the impact of errors and retries over a longer time period, we rely on SLO-based monitoring. We’ve established latency targets on a percentile for some key interactions in our system. And that’s where we started seeing the impact: our SLO error budget started burning down.
We notified our provider of the problem, and they confirmed there was an issue.
The sporadic drops increased heading into the weekend, but the weekend down peak made the magnitude of the problem less impactful.
Difficult decision
Monday came in, with a major drop in SLO.
Unfortunately, our provider did not promptly tackle the problem. But that’s on us – we selected this partner (and it is also one of the reasons why we moved to our own hardware; we need to have the tools to address problems within our core infrastructure).
Without support, we took the difficult decision to move to a completely different network configuration that didn’t rely on vSwitches: we configured each individual physical machine at the provider to connect to a nearby Namespace-managed “native” cluster, as if that machine would be connected physically (using Wireguard).
This was a big and difficult decision, but it was necessary. We had to choose between (1) some jobs failing irregularly because of network packet drops at runtime and being unable to control the situation or (2) changing the setup in a major way so that we are in control. However, solution (2) meant that there would be a short-term impact on our customers since it was clear we relied on these machines to serve peak capacity.
We knew we had to do something to regain control and we got to work.
To ensure customers could still use our product, we diverted all customer traffic from the affected region to other regions, while we worked on a short-term solution to connect those machines from our provider to our own cluster.
Working towards a stable solution that lead to the partial outage
Diverting the traffic from a whole region did have a broader impact. All customers running jobs during the outage were affected by increased queue times. We had to work fast to keep the impact as small as possible.
Because we use an “immutable infrastructure” strategy and some key aspects of each physical machine had changed, we had to “repave” each machine so it would obtain a new identity, new certificates, new encryption at rest keys, etc.
The urgency of the change required us to decide whether to delay recovery to retain caches by redistributing them to a different region (cache volumes are regionalized), or whether to drop caches and accelerate recovery. We chose the latter.
Each repave also requires data distribution: to start jobs quickly, each physical machine gets an assigned set of customer images to run. And those now had to be distributed to a large set of machines – this distribution relies on a few distributor machines which are not setup to distribute to many machines concurrently (usually repaving is gradual). That led to an additional slowdown.
To accelerate our automation, we organized a subset of our team to expedite the repaving process by cloning existing datasets in a more distributed manner. That got us to be done faster.
As batches of machines became ready, they were added to the new region. And when critical mass was reached, we added the region back to the pool of compute.
Recovery – or so we thought
As the evening arrived, we finished up the rebuild, and most of the capacity had been restored.
The trail of capacity that was not immediately ready was dominated by a few machines with a slightly different SKU (we otherwise run a uniform fleet), and image distribution.
But a few hours later, we were done, the majority of capacity had been restored.
Second outage
We thought we were out of the woods. But on Wednesday another rare event occurred: one of our control plane nodes started failing partially (backed up I/O, caused by slow NVMe writes).
Namespace relies on Kubernetes to handle per-fleet machine management and part of its data plane. We run various Kubernetes clusters ourselves, in a configuration that is optimized for performance: fast pod creation, and fast API server calls.
With the increase in usage, the rate of change in Kubernetes tipped a point where the I/O in that control plane node could no longer keep up which led to the API server observing very high cpu usage, caused by an increase in retries.
Unlike other regions, this particular control plan node ran on an older machine type, which exacerbated the utilization problems.
With Kubernetes struggling to make progress, various parts of our system started to suffer as well.
The resolution
To minimize downtime, we ended up cloning the key state maintained in that physical control plane node, to another node with higher capacity, changing it in place – the new node took over the role of the old one, with the same IP address, etc. After crunching through a set of pending changes, the Kubernetes cluster was back to normal; each Kubelet just reconnected as if nothing had happened.
The second outage's total downtime was roughly 50 minutes, which was higher than we’d want for this type of problem.
In hindsight, having just come out of a major outage that was rooted in networking, we became anchored on it, and the team first started looking for classes of problems related to the previous outage. While this was a different class altogether.
What we take away from this
1. Customer character matters
While we recognize the impact those outages had on our customers, we also truly value the support and encouragement that we have received over this difficult time. Many customers have shown understanding and were patiently waiting for us to work through it. This is not something we consider a given and we would like to thank you for that.
And to make sure you only pay for real value, we have credited all instances that ran during the outages back to our customers.
2. Preparation matters
As a SOC2 certified company, we regularly run tabletop exercises/crisis simulations, planning for scenarios just like this. When the alert came in, our response wasn’t only immediate — it was orchestrated. It is true that being in a real-life crisis still feels slightly different and at the same time, everyone was ready to get into action immediately.
We have also taken the week after the outage to focus on building up backup regions and solutions so that if we run into a similar scenario again, we do have options that don’t impact our customers.
Curious to read more about our approach? Head over to our Trust center (incl. SOC2 report).
3. The team and culture matter
Here at Namespace, we've built our culture where every team member understands that behind each job running on our platform, there's a customer counting on us. When put to the test, this foundation of trust and preparation proved its worth.
While our team spans multiple locations, we mix distributed flexibility with strategic in-person collaboration to create something special. Every month, we have what we call our "get-together" week — a time when our entire team works from the office. These outages coincided with our monthly get-together. The office vibe quickly shifted into a perfect blend of hackathon intensity and mission control focus.
But here's the thing: while having everyone on-site certainly supported a smooth collaboration, it's our everyday culture of collaboration and readiness that truly made the difference.
Was it all pink and fluffy? No, of course not - everyone was exhausted after this high-stress week of intense and high stake work and needed the weekend to recover.
We'd be happy to hear from you! Your feedback and questions are invaluable to us as we continue evolving our platform and services.