https://opensource.org/osd "The source code must be the preferred form in which ...

_heimdall · on July 23, 2024

I still don't quite follow. If Meta were to provide all code required to train a model (it sounds like they don't), and they provided the code needed to query the model you train to get answers how is that not open source?

> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

This definition actually makes it impossible for any LLM to be considered open source until the interpretability problem is solved. The trained model is functionally obfuscated code, it can't be read or interpreted by a human.

We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.

the8thbit · on July 24, 2024

I'm not the person you replied directly to so I can't speak for them, but I did start this thread, and I just wanted to clarify what I meant in my OP, because I see a lot of people misinterpreting what I meant.

I did not mean that LLM training data needs to be released for the model to be open source. It would be a good thing if creators of models did release their training data, and I wouldn't even be opposed to regulation which encourages or even requires that training data be released when models meet certain specifications. I don't even think the bar needs to be high there- We could require or encourage smaller creators to release their training data too and the result would be a net positive when it comes to public understanding of ML models, control over outputs, safety, and probably even capabilities.

Sure, its possible that training data is being used illegally, but I don't think the solution to that is to just have everyone hide that and treat it as an open secret. We should either change the law, or apply it equally.

But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.

I also don't mean that the license that these models are released under is too restrictive to be open source. Though that is also true, and if these models had source code, that would also prevent them from being open source. (Rather, they would be "source available" models)

What I mean is "The trained model is functionally obfuscated code, it can't be read or interpreted by a human." As you point out, it is definitionally impossible for any contemporary LLM to be considered open source. (Except for maybe some very, very small research models?) There's no source code (yet) so there is no source to open.

I think it is okay to acknowledge when something is technically infeasible, and then proceed to not claim to have done that technically infeasible thing. I don't think the best response to that situation is to, instead, use that as justification for muddying the language to such a degree that its no longer useful. And I don't think the distinction is trivial or purely semantic. Using the language of open source in this way is dangerous for two reason.

The first is that it could conceivably make it more challenging for copyleft licenses such as the GPL to protect the works licensed with them. If the "public" no longer treats software with public binaries and without public source code as closed source, then who's to say you can't fork the linux kernel, release the binary, and keep the code behind closed doors? Wouldn't that also be open source?

The second is that I think convincing a significant portion of the open source community that releasing a model's weights is sufficient to open source a model will cause the community to put more focus on distributing and tuning weights, and less time actually figuring out how to construct source code for these models. I suspect that solving interpretability and generating something resembling source code may be necessary to get these models to actually do what we want them to do. As ML models become increasingly integrated into our lives and production processes, and become increasingly sophisticated, the danger created by having models optimized towards something other than what we would actually like them optimized towards increases.