They could release the code that gathers and curates the data. Give a reproducible system for getting the pre training data. And presumably they own the post training RLHF stuff so could open that.
Without those you're locked in to them in terms of licensing of future versions.
Without those you're locked in to them in terms of licensing of future versions.