What kind of hardware does it take to fine tune such a model? I've been day dreaming about self-play with these models. Generate 100 leetcode problems. One variant writes test cases, one variant writes code. Make the loss for the tester high if there are bugs (maybe compare to using a fuzzer on the program or have multiple tester variants generating large test suites). The coder variant has loss related to number of failed test cases or retries or something.
Then, just keep running that in a loop, generating, testing, developing.
Is it possible to self-train like this, or attempt to do so, on commodity hardware? Could I make a meaningful change to 34B sized models if I had one or two 4090s? (Or something for a similar, few thousand dollar, price point).
Then, just keep running that in a loop, generating, testing, developing.
Is it possible to self-train like this, or attempt to do so, on commodity hardware? Could I make a meaningful change to 34B sized models if I had one or two 4090s? (Or something for a similar, few thousand dollar, price point).