Many companies stopped publishing their data sets after people published evidence they were mass, copyright infringement. They dropped the specifics of pretraining data from the model cards.
Aside from licensing content, that content creators don’t like redistribution means a lawful model would probably only use Gutenberg’s collection and permissive code. Anything else, including Wikipedia, usually has licensing requirements they might violate.
Yeah I don't think I've seen it linked officially, but Meta does this sort of semi-official stuff all the time, leaking models ahead of time for PR, they even have a dedicated Reddit account for releasing unofficial info.
Regardless, it fits the compute used and the claim that they trained from public web data, and was suspiciously published by HF staff shortly after L3 released. It's about as official as the Mistral 7B v0.2 base model. I.e. mostly, but not entirely, probably for some weird legal reasons.
No. The text is an asset used by the source to train the model. The source can process arbitrary text. Text is just text, it was written for communication purposes, software (defined by source code) processes that text in a particular way to train a model.
In programming, "source" and "asset" have specific meanings that conflict with how you used them.
Source is the input to some built artifact. It is the source of that artifact. As in: where the artifact comes from. Textual input is absolutely the source of the ML model. What you are using "source" as is analogous to the source of the compiler in traditional programming.
Asset is an artifact used as input, that is revered verbatim by the output. For example, a logo baked into an application to be rendered in the UI. The compilation of the program doesn't make a new logo, it just moves the asset into the built artifact.
Surely traditional “open source” also needs some notion of a reproducible build toolchain, otherwise the source code itself is approximately useless.
Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.
Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.
Source code in a vacuum is still valuable as a way to deal with missing/inaccurate documentation and diagnose faults and their causes.
Raw training datasets similarly has some value as you can analyze it for different characteristics to understand why the trained model is under/over-representing different concepts.
But yes real FOSS should be "open-build" and allow anyone to build a test-passing artifact from raw source material.
The source of a language model is the text it was trained on. Llama models are not open source (contrary to their claims), they are open weight.