Are you planning to switch to programming-language optimized model inside Phind? So, if a user is asking for something related to Python, that the python-optimized model gets used?
If so:
The Object Pascal language is completely out of fashion, and the most non-hyped language there is. However, there are hundreds of thousands of active users of Delphi, FreePascal and Lazarus. And due to the language being stable for over 20 years, there also is a gigantic amount of highest-quality code available. As most of it is neither on Github nor StackOverflow, Pascal code is dramatically underrepresented in GPT3.5, GPT-4 - and therefore also in Phind.
I'd like to finally be able to use AI-assisted programming with Pascal.
In case you are interested in that, I would be willing to internally pay for the work to prepare a good dataset of high quality code with comments/context/prompts.
If you are not interested, is there any chance that you are going to release the code and toolchain used to fine-tune CodeLlama, so I could do it myself?
Yes, this is the direction we're heading towards! We're building a mixture of experts of different coding models that we will deploy for precisely this use case.
Why would it? Do you know how much it costs to finetune one of these models for such a niche language? I'm not just talking about the cost of training, but also the cost of acquiring data because there's much less data about niche languages.
96 x A100 hours for a finetune according to the article.
The cost of the dataset curation for a given language is hard to quantify as there are many unknowns. However, it seems perfectly crowdsourcable to volunteers.
Where are the troves of Pascal code? Also manuals, books, etc. The quality doesn't have to be great. You can label and generate more data once you have enough to bootstrap the model.
Not much, I guess. It's basically writing some scripts that will take the code base of some of the available high quality pascal projects, and then depending on what is available extract/merge documentation available as PDF, PasDoc, RTF, .HLP or method/function source code comments.
I would assume that one of my devs could write the needed scripts in three weeks or so.
So, basically a budget of <$5000.
For me - due to missing competence - the actual challenge would be to get a sample on how training data should optimally look like (for example the Python training set), and someone doing the actual training. For a newbie to get up the required level of competence surely will take more than three weeks.
If so:
The Object Pascal language is completely out of fashion, and the most non-hyped language there is. However, there are hundreds of thousands of active users of Delphi, FreePascal and Lazarus. And due to the language being stable for over 20 years, there also is a gigantic amount of highest-quality code available. As most of it is neither on Github nor StackOverflow, Pascal code is dramatically underrepresented in GPT3.5, GPT-4 - and therefore also in Phind.
I'd like to finally be able to use AI-assisted programming with Pascal.
In case you are interested in that, I would be willing to internally pay for the work to prepare a good dataset of high quality code with comments/context/prompts.
If you are not interested, is there any chance that you are going to release the code and toolchain used to fine-tune CodeLlama, so I could do it myself?