Git Was Never Built for Machines – And Yet, It Became Their Library

4. April 2025 by Dr. Markus Pleier

When Linus Torvalds created Git in 2005, he solved a very human problem: how do distributed teams of engineers coordinate changes to a shared codebase without stepping on each other’s feet? The design was elegant, the model brilliant – and entirely anthropocentric. Every concept in Git, from commits to branches to pull requests, was designed to reflect human reasoning about software change.

Nobody thought about machines. Nobody had to.

GitHub today: 420 million repositories – the world’s largest unintentional AI training dataset, built by humans for humans, read by machines for everything.

The Unintended Consequence of 20 Years of Open Source

Fast forward to today. GitHub hosts over 420 million repositories. The platform has become, without anyone planning it that way, the single largest structured dataset of human reasoning about software in existence. Not raw text, not unstructured web crawls – but versioned, annotated, context-rich artifacts of how engineers think, communicate, decide, and build.

When OpenAI trained Codex, when DeepMind built AlphaCode, when Anthropic trained Claude – they all fed on this corpus. The commit messages, the PR discussions, the inline comments, the issue threads, the README files. Git history is, quite literally, the autobiography of software development, and AI learned to read it before most organizations realized what they were giving away.

The irony is profound. A version control system designed for human collaboration became the substrate for training non-human intelligence. Git was the library. GitHub was the librarian. And the AI models were the quiet students who read everything, remembered everything, and said nothing.

What Does a Machine Actually Learn from a Repository?

This is where it gets interesting – and where most CTO conversations I have still miss the point. The common assumption is that AI learns “code” from GitHub. That is technically true but deeply incomplete. What AI systems actually absorb is something far richer: the relationship between intent and implementation.

A commit message that reads “fix edge case in authentication flow when user has expired token” paired with a fifteen-line diff teaches a model not just syntax, but causality. The issue thread that preceded it, the review comments that shaped it, the test that was added afterward – together, they represent a chain of engineering reasoning that no textbook ever captured at scale.

This is fundamentally different from learning from documentation. Documentation describes what code does. Repositories reveal how engineers think about what code should do, why it changes, and under what conditions decisions get revisited. The difference is the difference between reading a law and watching a courtroom.

The Next Evolution: Repos as AI-Native Artifacts

Here is where we stand at an inflection point that most engineering organizations have not yet grasped. If repositories are already the de facto training substrate for AI, the logical next step is to design repositories with that fact in mind. Not reactively, not accidentally – but intentionally.

In the coming years, we will see repositories evolve from pure version control archives into structured knowledge bases that explicitly address AI agents as consumers. This means several concrete developments that are already beginning to emerge in practice.

The first is the emergence of a dedicated specification layer. Today, most codebases contain implicit intent buried in comments, naming conventions, and tribal knowledge that lives only in the heads of the engineers who wrote it. Tomorrow’s repositories will carry explicit machine-readable specifications – linked directly from the code they describe – that articulate not just what a module does, but what it is trying to achieve and what constraints govern its evolution. Formats like OpenSpec and similar frameworks are early experiments in this direction.

The second shift is what might be called the Intent Layer. Beyond the specification of individual components, future repositories will carry structured metadata that describes the reasoning behind architectural decisions. Why was this approach chosen over the alternatives? What trade-offs were consciously accepted? What assumptions does this design rely on? This is the kind of context that AI agents need to reason correctly about a codebase – not just to generate code that compiles, but to generate code that fits.

The third development is the rise of agent-aware commit protocols. If AI agents are both reading and writing to repositories – which they already are in many of our development pipelines today – the commit structure itself needs to evolve. Automated commits should carry provenance metadata: which model generated this, from which specification, against which test harness, with what confidence. Human commits will increasingly need corresponding context flags that distinguish deliberate design choices from pragmatic workarounds.

The Strategic Implication Nobody Is Talking About

There is a competitive dimension here that deserves more executive attention than it currently receives. Organizations that deliberately enrich their repositories with machine-readable intent and specification data are not just improving their own AI-assisted development workflows. They are producing higher-quality training data for the next generation of models. If open-source development continues to feed the pre-training pipelines of frontier AI systems, then the quality of the reasoning encoded in public repositories will shape the quality of the AI that the entire industry relies on.

This creates an asymmetry: companies that treat their internal codebases as structured knowledge assets – not just as source code archives – will build internal AI capabilities that reflect higher-quality reasoning. The gap between organizations that have thought seriously about repository architecture as an AI substrate and those that have not will become a measurable capability difference within this decade.

Git was never built for machines. But machines have made themselves at home, and now the question is whether we redesign the house accordingly.

The answer, for any organization serious about AI-driven development, is yes. And the time to start is now – not when the next model generation arrives, but before it does, while there is still time to shape what it learns from you.

This post is part of a series on the structural shifts in softwawe development brought on by AI integration. Previous entries covered specification-driven development and multi-agent orchestration in enterprise codebases.

Why Standards matters

18. December 2013 by Dr. Markus Pleier

Do we really need standards in infrastructure?

Mandelbrot by Frax for iPad

Since a couple of years I have discussions with CIO`and other technology employees in global companies around standards. In fact the introduction of SAN as a protocol standard enabled the largest consolidation and optimization in the DC, started early 2000 until now. One of the reasons is that there has to be a maturity of the market and also a common sense that there is no benefit of defining the “company” standard.
As we have seen in the telco industry, over years companies have define there own phone infrastructure, it has total changed now to a VOiP driven model with a lot of benefits. Same we see nowadays in the DC with the appearance of the converged infrastructure. Hence this approach is not yet mature enough to have one single standard and the manufacturer of infrastructure and VAR´s define there own way. This currently has major impacts at customer sites since there will be a next generation converged model with is more defined thru organizations like SNIA and open stack models, or it will be a 3rd platform approach more likely, which may be called cloud.
What ever the next years will bring the key here is that decisions are not driven only on price more on the TCO and the future of the proposed architecture.
It is obvious to see that innovations cycles will increase and the demand for more flexibility on the business side with drive different infrastructure needs.

What happened on the manufacturer side?

Keep it simple is the demand which was given by the creators of the 3rd platform. However, in the enterprise of global customers the IT still runs in the 1st or mostly on the 2nd platform. This has to be taken in considerations. Associations like SNIA take this as basis and define standards which span the bridge between the new innovative and the current business demand IT requirements. Industry standards take a while to establish and of the manufacturers develop there own “standard” to keep customers closer. In the early stages of new technology this many be very convenient, but with maturity of the approach the move to an industry adoption will become necessary.
The same happened currently in SNIA. The standards around the traditional SAN is defined and also management capabilities like SMI-S are defined and adopted. New areas like Big Data, Object Storage, Analytics and Flash are more in the definition phase and the manufacturer define there own strategy and API´s. SNIA is working deep with the vendors to get the industry better shaped here. Seeing still a lot of startups appearing and innovation happening there too, the industry standard definition is still in the starting points, but customers would be wise to ask for certificated products to make the TCO and technology adoption more efficient to serve not only the cost model also the new demands of their business.

Madelbrot by Frax

Software defined Networks, The last mile for the infrastructure.

6. August 2012 by Dr. Markus Pleier

With technologies like storage virtualization, or server virtualization new topological and cost effective management of resources have been enabled. In recent history this has happened a disruptive IT operation change in most of the companies. The last missing point was Network Virtualization.

In the past networks where always defined thru protocols. These helped to drive the implementation in silicon, like it was in the processor a decade before. The other effect was that innovation was hindered since everything has to be aligned with the protocol standard.

However, with definition of SDN the protocol stack will be hidden behind the topology and new ways of networking can be archived. Like in Storage virtualization the innovation came with it.

Source: slashdot

In the next years, the arguments around various concepts of virtualization will arise, and there will be more than one champion, but with the acquisition of Nicira, VMware will lead the way.

EMC and VMware have now all components to enable Enterprise customer and SI´s to deploy their own stack of ITaaS ((I,P,S,M)aaS) to drive the private cloud and enable the hybrid approach.

With all the technology the key is to enable the business for more agility and diversification, faster GTM and more advanced offerings without IT dependence, we tend to call this of the BigData.

SDN opens the dimension for better data coverage, restful services and more integration of concepts without dependency of Hardware.

Enterprise can now continue the Journey, and leverage the resources from the infrastructure more cost efficient and agile than it was able in the past. For VMware it is the right step towards a true cloud OS.

ctoblog

CTO insights on Technology, Internet of things, Information Management, Inflection Points between Human and IT

Tag Archives: BigData

Git Was Never Built for Machines – And Yet, It Became Their Library

The Unintended Consequence of 20 Years of Open Source

What Does a Machine Actually Learn from a Repository?

The Next Evolution: Repos as AI-Native Artifacts

The Strategic Implication Nobody Is Talking About

Why Standards matters

Do we really need standards in infrastructure?

What happened on the manufacturer side?

Software defined Networks, The last mile for the infrastructure.

The Unintended Consequence of 20 Years of Open Source

What Does a Machine Actually Learn from a Repository?

The Next Evolution: Repos as AI-Native Artifacts

The Strategic Implication Nobody Is Talking About

Share this:

Do we really need standards in infrastructure?

What happened on the manufacturer side?

Related articles

Share this:

Share this: