The "intelligence" of large language models needs to be evaluated like the abili...

		ResearchCode on Aug 26, 2023 \| parent \| context \| favorite \| on: Beating GPT-4 on HumanEval with a fine-tuned CodeL... The "intelligence" of large language models needs to be evaluated like the abilities of self-proclaimed psychics. You send your binary to an independent third party and who evaluates it on new problems. It's only a "Human eval" once.