AI Weekly: The challenges of creating open source AI training datasets

In January, AI research lab OpenAI released Dall-E, a machine learning system capable of creating images to fit any text caption. Given a prompt, Dall-E generates photos for a range of concepts, including cats, logos, and glasses.

The results are impressive, but training Dall-E required building a large-scale dataset that OpenAI has so far opted not to make public. Work is ongoing on an open source implementation, but according to Connor Leahy, one of the data scientists behind the effort, development has stalled because of the challenges in compiling a corpus that respects both moral and legal norms.

“There’s plenty of not-legal-to-scrape data floating around that isn’t [fair use] on platforms like social

