> If so, then how can current ML models be open source? The source of a language...

moffkalast · on July 23, 2024

You can find the entire Llama 3.0 pretraining set here: https://huggingface.co/datasets/HuggingFaceFW/fineweb

15T tokens, 45 terrabytes. Seems fairly open source to me.

Zambyte · on July 23, 2024

Where has Facebook linked that? I can't find anywhere that they actually published that.

nickpsecurity · on July 24, 2024

Many companies stopped publishing their data sets after people published evidence they were mass, copyright infringement. They dropped the specifics of pretraining data from the model cards.

Aside from licensing content, that content creators don’t like redistribution means a lawful model would probably only use Gutenberg’s collection and permissive code. Anything else, including Wikipedia, usually has licensing requirements they might violate.

moffkalast · on July 23, 2024

Yeah I don't think I've seen it linked officially, but Meta does this sort of semi-official stuff all the time, leaking models ahead of time for PR, they even have a dedicated Reddit account for releasing unofficial info.

Regardless, it fits the compute used and the claim that they trained from public web data, and was suspiciously published by HF staff shortly after L3 released. It's about as official as the Mistral 7B v0.2 base model. I.e. mostly, but not entirely, probably for some weird legal reasons.

verdverm · on July 24, 2024

Says it is ~94TB, with >130k downloads, implying more than 12 exabytes of copying, seems a bit off, wonder how they are calculating downloads

root_axis · on July 23, 2024

No. The text is an asset used by the source to train the model. The source can process arbitrary text. Text is just text, it was written for communication purposes, software (defined by source code) processes that text in a particular way to train a model.

Zambyte · on July 24, 2024

In programming, "source" and "asset" have specific meanings that conflict with how you used them.

Source is the input to some built artifact. It is the source of that artifact. As in: where the artifact comes from. Textual input is absolutely the source of the ML model. What you are using "source" as is analogous to the source of the compiler in traditional programming.

Asset is an artifact used as input, that is revered verbatim by the output. For example, a logo baked into an application to be rendered in the UI. The compilation of the program doesn't make a new logo, it just moves the asset into the built artifact.

Zambyte · on July 24, 2024

I hadn't had my morning coffee yet when I wrote this and I have no idea what I meant instead of "revered", but you get the idea :D

thayne · on July 23, 2024

I think it would also include the code used to train it

pphysch · on July 23, 2024

That would be more analogous to the build toolchain than the source code, but yes

tshaddox · on July 23, 2024

Surely traditional “open source” also needs some notion of a reproducible build toolchain, otherwise the source code itself is approximately useless.

Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.

Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.

pphysch · on July 24, 2024

Source code in a vacuum is still valuable as a way to deal with missing/inaccurate documentation and diagnose faults and their causes.

Raw training datasets similarly has some value as you can analyze it for different characteristics to understand why the trained model is under/over-representing different concepts.

But yes real FOSS should be "open-build" and allow anyone to build a test-passing artifact from raw source material.