Skip to main content

Show Your Work

The AI industry has a fundamental transparency problem that undermines the legitimacy of every model being shipped today, including the ones I use daily.

The basic shape of the problem: companies are scraping the internet -- books, articles, code, art, forum posts, personal blogs, medical records, legal filings, and everything else they can get their hands on -- feeding it into training pipelines, producing models, and then selling access to those models without disclosing what went in. We don't know what's in the training data, whose work was used, what biases were encoded or amplified or laundered through scale. We don't know because the companies doing this work have decided we don't need to know.

For someone whose contributions to open source and open standards is fundamentally gated by intellectual property concerns, this presents... a challenge. Not to mention the fact that it's shady as fuck.

The binary problem

Before open source gained real traction, software was distributed as compiled binaries. You got the executable but not the source. You couldn't inspect it or verify what it actually did. You just had to trust the vendor. The entire open source movement has been, at its core, a rejection of that arrangement, a declaration that users have the right to see how the tools they depend on actually work.

"Open source AI" as currently practiced is a repetition of this binary problem. When a company releases "open" model weights, they're giving you the compiled binary. The weights are the output of a training process -- they are not the process itself. You can run the model and fine-tune it, but you can't meaningfully audit it because you can't see what went in.

This is not open source in any sense that the term has historically meant. Open source means you can read the source. For a model, the "source" isn't the architecture or the weights -- it's the training data, the data curation decisions, the filtering criteria, the augmentation pipeline, the RLHF process, and the evaluation benchmarks.

What's actually in the box

When a model produces output that's biased (which is always) there's currently no way to trace that bias back to its origin. You can measure it in the output and try to patch it with guardrails, but you can't fix the root cause because you can't see it. It's like debugging a production system with no logs and no access to the source.

The intellectual property question is similarly opaque. A model regurgitates passages from copyrighted books or generates code suspiciously similar to a specific open source project, and there's no way to verify whether that material was in the training set. The creators can claim fair use or transformative use, but they can't / won't show their work. The legal arguments are being made in a factual vacuum, and that vacuum exists by design.

The same opacity makes it impossible to answer basic fitness-for-purpose questions. Is this model safe to use in a medical context? A legal one? We can't know, because we don't know what it was trained on. We're deploying systems with unknown provenance into high-stakes environments and calling it innovation.

An AI model spitting out medical advice in a world where RFK exists and the model may have been trained on his statements? Um, yeah, no thank you.

This is not a new problem

The food industry went through a version of this. There was a time when you genuinely didn't know what was in processed food. The response, eventually, was mandatory ingredient labeling, nutrition facts, allergen warnings, and supply chain traceability requirements. Companies didn't volunteer this transparency -- it was imposed because public health demanded it.

The pharmaceutical industry went further. You can't sell a drug without disclosing its composition, documenting how it was tested, publishing the results of clinical trials, and maintaining an auditable chain of evidence from lab to patient. This isn't because pharmaceutical companies are more ethical than tech companies. It's because regulation forced the issue.

The AI industry is currently operating in the pre-regulation window, and in a regulation-hostile political and economic environment. The behavior is predictable. Scrape first, negotiate later. Claim fair use while keeping the evidence that would prove or disprove the claim locked in a vault. The pattern is old and boring. Billionaires deciding what we need to know based on how much it might cost them to tell us.

What an ethical framework actually requires

Start with data provenance. Every piece of training data needs a traceable origin. This is not a vague gesture at "data from the internet" but an actua manifest of what was included, where it came from, under what license or consent mechanism, and when it was collected. This is the equivalent of a software bill of materials. It's a solved problem in other industries. It's not technically impossible; it's commercially and economically inconvenient. Too bad.

The framework also has to address consent. Most people whose work is being used for training have no idea it's happening, have no mechanism to object, and have no recourse if they do. The current default is nominally opt-out, and even that barely works. Moving toward informed consent doesn't have to mean strict opt-in for every individual data point -- there are reasonable intermediate positions involving licensing frameworks, collective bargaining for creators, and clear fair use boundaries. But "we took it because it was on the internet" isn't a consent framework. It's the absence of one.

(Btw, this is not theoretical. I've had Opus regurgitate my own Node.js performance workshop material from years ago back to me when performing code reviews on Node.js pull requests. I know my material was used in the creation of the training set without my knowledge because it spat out a word-for-word transcription of what I wrote about six years ago and I know I never explicitly gave Anthropic my consent.)

Then there's bias. Training data isn't a neutral sample of human knowledge. It's a sample of what's written in dominant languages, hosted on accessible platforms, created by people with internet access and the time to write. That sample encodes specific cultural assumptions and demographic skews. Any serious framework requires disclosure of dataset composition and independent auditing of how that composition manifests in model behavior.

And compensation. If a model generates commercial value from training on someone's creative or intellectual work, there needs to be a mechanism for payment. AI training is not so special that it deserves an exemption from the basic principle that using someone's work for profit requires their participation in that profit.

Open source training sets

The most important structural need is genuinely open source training sets -- inspectable, auditable, forkable, improvable -- with clear provenance, explicit licensing, and documented composition. This should be non-negotiable. A training set should not be considered a trade secret any more than the list of ingredients in my breakfast burrito.

The model for this already exists. It's how we built Linux and Apache and Node.js: a community maintains a shared resource with clear licensing, contributors opt in, anyone can inspect the contents and propose changes, and the resource improves through collective effort rather than corporate secrecy.

Granted, this is genuinely hard. Training datasets are enormous and curation is expensive. A fully open, ethically sourced training set will almost certainly be smaller than one built by scraping the entire internet without permission, and a model trained on that data may be less capable in some dimensions. That's a real cost. But it's the same kind of cost we accepted when we decided software should be built on licensed code rather than pirated code, or that drugs should be tested before being sold. Capability doesn't justify opacity. Some costs are necessary and I'm quite sure the Billionaire Caste can afford it.

The tensions

Transparency is a competitive disadvantage under current market conditions. Companies that invest in ethical data sourcing will move slower and spend more than companies that scrape without asking. This doesn't resolve itself without regulation, and history suggests that's the more reliable mechanism.

The definition of "open source" in AI is genuinely contested, and not always in bad faith. The Open Source Initiative's definition was written for software, and it doesn't map cleanly onto models. That said, you can't meaningfully call something open source if you can't inspect what produced it. Yes there are disagreements around this but those who say otherwise are just Wrong.

Retroactive accountability is messy too. The models that already exist were trained on data that was already scraped. You can't un-train them and you can't retroactively get consent. The framework I'm describing is forward-looking and won't fix what's already been done. You don't stop requiring ingredient labels just because unlabeled food was sold in the past, but it's worth acknowledging the limitation.

I use these models daily, for real work. The practical utility is enormous, and I'm not arguing that we should stop. I'm arguing that we should demand better from the companies building them, and that "better" starts with them showing their work.

Where this goes

The problems are structural and the incentives are misaligned. The regulatory landscape and political climate is years behind the technology. None of that changes the fact that transparency, consent, and accountability are baseline expectations we apply to every other industry that uses other people's work to generate profit. The AI industry isn't special enough to deserve an exemption.

The open source movement proved that transparency and community stewardship can produce software that's more trustworthy than proprietary alternatives. The same principle applies to training data. We need open, auditable, ethically sourced training sets, and we need an industry willing to treat data provenance as a feature rather than a liability.