iTranslated by AI
Speculating on the Future of Git and GitHub
What language/framework would you use to build a service like GitHub from scratch today?
Thinking about what comes after GitHub is quite fun, if we forget for a moment how difficult it would be to actually achieve. Let's indulge in that fantasy here.
If we break down the challenges GitHub faces, they can be divided into issues with Git itself and issues with the features GitHub provides. Personally, I think it's quite difficult to beat GitHub nowadays while remaining Git-based. Considering the acquisition by MS and the capital actually being poured into it, you can't gain an advantage using the same Git unless you have a very radical angle.
Therefore, to essentially beat GitHub, I think it might be best to rethink the underlying VCS from the ground up. Fortunately(?), while Git's excellence is recognized, it is also known for its steep learning curve and for not functioning well in specific use cases. So, let's analyze what Git / GitHub are missing based on the features required in the modern era.
Good Points of Git / GitHub
- A command system optimized for the CLI
- Distributed file system / High-efficiency compression (git pack)
- The existence of GitHub to compensate for Git's lack of collaboration features
Challenges of Git
- Absence of a definitive GUI tool
- Inability to perform partial checkouts
- Lack of collaboration features within Git itself
- Poor handling of binaries (not covered this time)
Lack of Collaboration Features in Git Itself
For example, functions equivalent to GitHub's pull-request do not exist in the Git core.
I've been thinking that it might be better to incorporate collaboration features like GitHub's Pull Requests, Reviews, and Issues into the CLI command system itself.
There might be disagreement on this. In terms of the Unix philosophy of "doing one thing well," there's no other tool that does it as well as Git. However, I believe even the command structure for things like merge and push would change if they were designed with PRs as a prerequisite.
Absence of a Definitive Git GUI Tool
Especially in tools for programming, GUIs tend to be a mapping of the CLI. While Git's operation system is intuitive in the CLI, operations in a GUI are cumbersome, and there aren't many tools that represent this well. It is difficult to translate the concept of staging and the operation of committing from a partial add of a file into a GUI operation. That said, I personally don't know a tool more convenient than git add -p.
Fundamentally, in the world of Git, CLI operations are Tier 1, and the GUI is always subordinate to the CLI, so Git training is conducted via the CLI. Once you've learned the operation system of a GUI tool, if you have to change tools for some reason, you often have to relearn most of the operation system.
Learning the CLI operation system involves understanding the file system and learning Unix commands as a set. For those who want to spread version control universally, not just limited to Git, it feels like this has become a blocker for adoption.
Partial Checkout / Sparse Checkout
I have heard that Google and Facebook have single monorepos internally and use internal tools to partially checkout that tree, but these are not public. It feels like monorepo development for general developers like us is accompanied by difficulties due to a lack of tools.
Mercurial has Partial Clone. As an experimental project at Facebook, there is a project called Eden that rewrites Mercurial from Python to Rust.
- facebookexperimental/eden: EdenSCM is a cross-platform, highly scalable source control management system.
- PartialClone - Mercurial
I have mixed feelings about wanting to use Mercurial at this late stage, though...
Anyway, GitHub does have a Sparse Checkout mechanism, but using it results in a rather underwhelming experience.
If you were to design a new VCS
First, it should be designed with online collaboration as the primary focus. I imagine features like being able to perform partial checkouts and having smooth integration with editors.
To talk a bit about Git's internal implementation, Git's distributed file system has a structure that performs self-signing with .git/objects/[sha1]. In other words, even in different repositories, if the file content is the same, it will have the same hash value. This is a very excellent mechanism; as long as you don't collide the sha1, you cannot lie about the content. If your changes match what someone else has, you can determine what is needed and what is unnecessary from the network. By exchanging hashes of the files you have, differences become clear, enabling efficient synchronization.
Recently, through experiments by Google and others, it has been shown that it is theoretically possible to collide sha1 in a realistic calculation time. Therefore, we should adopt sha256 or another quantum-resistant hash algorithm used in the blockchain space.
I personally think Git's excellence lies in git pack, which compresses chunks of similar content within objects based on edit distance. The content of each blob becomes an index to the Packfile.
While inheriting this blob object mechanism, if we think about a new VCS, while the current Git method involves periodically syncing local blobs with git fetch/pull, perhaps the next-generation version control system will be network-first and capable of on-demand retrieval at the time of reference.
In my experience of prototyping web-based editors, Git's requirement of a reference tree from the root element for initialization is not suited for web payloads. Local checkouts take too much time, resulting in a very poor experience. Browsers are also not designed to hold large amounts of data in memory, so loading massive amounts of data into memory makes their behavior unstable.
Assuming partial checkouts, the concept of a repository changes. The unit of GitHub's user > repository becomes merely a namespace.
Files would be shared over a P2P network sharing blob objects, and each person would create a file tree under a namespace where they have write permissions, and those entities would be references to blob objects.
...This is an idea close to IPFS Powers the Distributed Web. In IPFS, a self-signing object tree, there is a mechanism called IPNS that allows you to give aliases to hashes.
What Becomes Possible and What Becomes Difficult
In other words, a new VCS would be expressed as a "file system" with an "append-only P2P network with version control" as the backend. As an interface, there would be an API to manipulate aliases to hashes, upon which the directory structure is layered. It would be possible to create a PR between two online users without an intermediary centralized repository, and to complete a Merge entirely between those two parties.
Currently, changing code from GitHub's GUI isn't a great experience. This is because it involves a trade-off, such as whether to open just one file or check out the whole thing. I want to turn this into an experience where you can partially check out files and open them in an online browser; similar things already exist, like github1s, which lets you view files in VS Code Online by changing github.com to github1s.com.
conwnet/github1s: One second to read GitHub code with VS Code.
However, it's known that "searching" becomes difficult in this scenario. Since you don't have all the files locally, local search won't work. You need a system where someone collects all the data for searching. Also, because Network RTT is involved in fetching blobs on-demand, IDE analysis slows down depending on the depth. Therefore, preloading related resources is necessary. I believe we'd need to devise specifications like specifying Preload in HTTP headers.
If you try building file IO using IPFS, you'll find the experience isn't that great. This is the fate of P2P: unpopular files are slow and disappear quickly. So, to achieve a stable experience like GitHub's, you'd likely need to deploy servers as permanent Peer Nodes that constantly host the actual files.
Something that maintains all these blob objects and provides a search interface in the process of collecting them would become the next-generation GitHub. Furthermore, depending on the quality of the data on the network, encryption might also be necessary.
In a centralized management case, you would implement blob storage using a high-speed KVS that can handle massive data, like Cloud BigTable, or query a P2P distributed file system.
Also, the self-signing method for files is well-suited for the backend of languages like Deno that resolve resources from the network. Since tampering with hashes is difficult, a script once guaranteed to be safe won't be hijacked by a malicious user. In fact, I've had that concern about Deno for a while. There's a project by people with the same awareness called nest.land, which hosts code on a blockchain in an untamperable state.
Thought Experiments are Fun
You realize that if you actually try to build this, you'd need to scratch-build a massive amount of stuff. It's exciting.
This alone doesn't solve issues like the lack of a definitive Git GUI tool mentioned earlier, but it could serve as the underlying foundation.
I'd love to build it if someone would fund it. I'm tired now, so I'll leave it at that.
Discussion
Gitのコマンドの一部で、GitHubではないと思いますが。