If you want to know more about our current technical scaling issues: Rest assured that we are working hard on it. Read on to learn why this is not as fast as you might expect.
When people like me talk about scaling, it is pretty obvious what we are referring to. It's about increasing computing power, distributed storage, replicated databases and so on. There are all kinds of technology available to solve scaling issues. So why, damn, is Codeberg still having performance issues from time to time?
I recently explained in a chat that we face the "worst" kind of scaling issue in my perception. That is, if you don't see it coming (e.g. because the software gets slower day by day, or because you see how the storage pool fill up). Instead, it appears out of the blue.
This happened to Codeberg early January. It might have been because we didn't notice the growth, since the last days of December (where most people are on holidays) were much calmer than usual (also in terms of external traffic to Codeberg like CI pulls or release downloads).
We see a lot of traffic during European evenings. This makes us happy, on the one side. To our knowledge, counting more than 50.000 users, we are by far the largest public Forgejo/Gitea instance. And we're proud to offer so many additional services like Codeberg Pages, Weblate and CI.
But it's also a problem to make sure the system handles these intense hours gracefully, while it would mostly idle for the rest of the day. Such a scaling issue keeps you busy, and you need careful consideration how much headroom to add in order to handle the spikes, without wasting precious donations and energy, it is not the hardest to fix.
In theory, you could just tune some config, add more computing power, or storage, or apply a patch – once you have identified the cause.
A different kind of scaling issue
In the context of Codeberg, I'm seeing a different kind of scaling issue. I drafted a text early November, but didn't share it anywhere, because my tone was filled with anger and frustration. But I don't want to blame someone, but improve and motivate. And finally share our experience for the next Forgejo instances to not run into this in the first place.
Speaking personally, I invest a lot of time into the project, and the last year at Codeberg was intense: Periods with increased server instability, overflowing mailboxes and angry social media users being mad about their broken Codeberg Pages.
We are still here, and we manage to dig through the backlog. But the active team is small, and the work increasing.
The hardest scaling issue is: scaling human power.
Don't forget the people
In order to solve technical scaling issues, many awesome solutions exist. But that one bottleneck is hard to overcome.
Take a look at our Ceph Cluster, for example: It was added to solve a technical scaling issue. However, it consumes a lot more human power to get things right than traditional file systems.
Consider Forgejo/Gitea, the software Codeberg runs on. It has a few flaws that go unnoticed in small instances (for your private server, high disk I/O might not be a critical problem; for Codeberg it is). It takes awesome developers to investigate these problems and write patches.
Configuration, Investigation, Maintenance, User Support, Communication – all require some effort, and it's not easy to automate. In many cases, automation would consume even more human resources to set up than we have.
At Codeberg, we might have made a mistake recently: When we migrated to our own hardware in a hurry, because of the increased demand for code hosting, we wanted to do "everything right and future-proof".
We transitioned from a basic stack (ext4 filesystem, Postfix, not much more) to one including so many shiny new things: Starting with the duty of maintaining own hardware (which is a journey on its own), we introduced LXC containers, Ansible, BTRFS, ZFS, Ceph and more.
In the past months, I found myself reading a lot of documentation, mostly for Ceph and HAProxy, two tools which were tuned a little to mitigate the ongoing performance issues. I learned a lot about LXC containers and BTRFS. Still, there was no time left to dig deep into Ansible and ZFS, and by now both tools have even been mostly dropped, to reduce the stress on our team members.
Not because they are bad, but because they would consume more human resources than we can afford.
A historical problem
This kind of scaling issue is new to Codeberg, but not to the world. All projects on earth likely went through this at a certain point or will experience it in the future. Availability issues for some platforms even had a nickname. Others had major outages which were hard to fix because communication of the teams was interrupted.
However, most websites we use nowadays have their early years long ago. Their growth pain forgotten, their networks mostly stable and with plenty of headroom.
How can a growing non-profit in 2023 compete with the changed expectations of today? Our "24/7 monitoring" is based on the free time of our core contributors. There are no paid night shifts, not even payment at all. Still, people have become used to the always-available guarantees, and demand the same from us: Occasional slowness in the evening of the CET timezone? Unbearable!
I do understand the demand. We definitely aim for a better service than we sometimes provide. However, sometimes, the frustration of angry social-media-guys carries me away, and I'm even tempted to reply: "If you cannot live without the extreme guarantees of huge proprietary cloud platforms, you cannot have the freedom Codeberg provides".
The reason for this is simple: In order to maximize the availability, the cost (for hardware, energy, and human time!) increases exponentially. Big platforms to this because of competition. For donation-powered non-profits, this uses donations that could have served better, and leads to burn-out in the team.
How can non-profits keep up?
In the context of Codeberg, there are two primary blockers that prevent scaling human resources. The first one is: trust. Because we can't yet afford hiring employees that work on tasks for a defined amount of time, work naturally has to be distributed over many volunteers with limited time commitment. This would also require that more people are granted access to precious user data, but with the downside that unlike companies we don't have employment contracts for everyone at hand that already manage the legal part of this.
And especially with a distributed team, building trust without meeting in person is a problem. Currently, only members with elected roles within Codeberg e.V. receive access privileges.
The second problem is a in part technical. Unlike major players, which have nearly unlimited resources available to meet high demand, scaling Codeberg's systems involves:
- Ordering new hardware. Due to the democratic structure, this is discussed in the presidium first (for a good reason: Members should know that we think twice before investing about 3000 € in a potential storage upgrade)
- Scheduling an operation in the DC with another non-profit which houses our platform.
- Actually going there, touching and working on the server.
Sounds slow? Yes. A popular "solution" is outsourcing the actual server hosting, we also did so too until the end of last year. It's easy: Adding more resources means you can scale your cloud deployment within seconds. You don't need to worry too much about IOPS or hardware failures.
Still, is it the Internet we are striving for? No matter which instance you join or if you host yourself, the data will reside on one of the few big cloud provider's systems.
I don't yet have an answer to this. Maybe extend non-profit cloud hosting? Or have Codeberg grow large enough so that we can finally afford to pay more team members? I appreciate all kinds of discussions about this.
Some positive final words
A scaling issue is an issue, and it needs to be addressed. I'm proud to share that while we still have a steep way before us, we started to do more work in the public than one year ago, which has greater potential to onboard and involve more people. Thanks to those helping us diagnose issues in the "Contributing to Codeberg" Matrix channel (feel free to join!), and I'm glad we even started to distribute the boring office work.
I can name a few optimizations that save a lot of human time recently. But scaling issues that come from growth aren't solved once – they return with ongoing growth. It will be a persistent effort to keep up with the increasing demand, and I invite everyone to help us stay ahead of this special scaling issue.
Are you happy to host some software yourself? Feel free to demonstrate and extend your knowledge by taking over responsibility for a service at Codeberg. Also consider reading the Improving Codeberg guide in the docs and joining the non-profit Codeberg e.V. and actively have your part in running the platform.
If contributions to Codeberg grow similarly like usage of Codeberg grows, I'm confident that we can benefit from some efficiency gains and provide a better service for everyone.
Happy to see you around! And a huge thanks to everyone involved in Codeberg, Forgejo, Gitea and all the other upstreams we rely on, for making our project possible. Thanks a lot!