Imagine that. A black box where it isn’t clear that reliability testing is done, yet they still have the gall to charge big bucks for it. You are paying for bad data!
This isn’t a question of opinion. It is the result of testing a basic mathematical question.
From the article:
Over the course of the study researchers found that in March GPT-4 was able to correctly identify that the number 17077 is a prime number 97.6% of the times it was asked. But just three months later, its accuracy plummeted to a lowly 2.4%. Meanwhile, the GPT-3.5 model had virtually the opposite trajectory. The March version got the answer to the same question right just 7.4% of the time — while the June version was consistently right, answering correctly 86.8% of the time. Similarly varying results happened when the researchers asked the models to write code and to do a visual reasoning test that asked the technology to predict the next figure in a pattern.
I’m not impressed. What are we paying for again?