The next generation of AI models are meant to be trained by people paid to have conversations with them, but several of thes…
Workers tasked with generating conversational data for training new AI models are reportedly outsourcing the work to existing large language models (LLMs) like GPT-3.5 or GPT-4.
This practice, if widespread, risks creating a feedback loop where AI models are trained on data generated by other AI models, potentially amplifying biases and limiting the diversity of thought and experience these new systems can represent. This impacts the quality and originality of future AI capabilities, from more nuanced chatbots to advanced reasoning systems, and raises questions about the integrity of the training data for models like Meta's Llama 3 or Google's Gemini.
Future developments should focus on verifiable human input and the effectiveness of AI-generated data in improving model robustness. Monitoring the performance benchmarks of models trained with potentially "inbred" data against those trained with demonstrably human-curated datasets will be crucial to understanding the long-term consequences.