Destination

2025-06-06

Researchers build massive AI training dataset using only openly licensed sources

Illustration: Data tiles (diagrams, %, figures) as a mountain, network symbol at the top; symbolizes networked data analysis.


The Common Pile is the first large-scale text dataset built entirely from openly licensed sources, offering an alternative to web data restricted by copyright.


The article Researchers build massive AI training dataset using only openly licensed sources appear [...]

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

Destination

2025-04-17

Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been hav [...]

Match Score: 112.79

venturebeat

2025-10-02

New AI training method creates powerful software agents with just 78 examples

A new study by Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) shows that training large language models (LLMs) for complex, autonomous tasks does not require massive datasets. [...]

Match Score: 112.24

Destination

2025-10-07

TOUCAN is the largest open training dataset for AI agents

A research team from MIT, IBM, and the University of Washington has released TOUCAN, the largest open dataset to date for training AI agents. The dataset contains 1.5 million real tool interactions, a [...]

Match Score: 71.43

venturebeat

2025-09-30

Meta’s new CWM model learns how code works, not just what it looks like

Meta’s AI research team has released a new large language model (LLM) for coding that enhances code understanding by learning not only what code looks like, but also what it does when executed. The [...]

Match Score: 60.35

Destination

2025-07-10

How exactly did Grok go full 'MechaHitler?'

Earlier this week, Grok, X's built-in chatbot, took a hard turn toward antisemitism following a recent update. Amid unprompted, hateful rhetoric against Jews, it even began referring to itself as [...]

Match Score: 60.28

venturebeat

2025-10-01

Thinking Machines' first official product is here: meet Tinker, an API for distributed LLM fine-tuning

Thinking Machines, the AI startup founded earlier this year by former OpenAI CTO Mira Murati, has launched its first product: Tinker, a Python-based API designed to make large language model (LLM) fin [...]

Match Score: 56.83

Destination

2025-06-05

It turns out you can train AI models without copyrighted material

AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could — it's just really hard. To prove it, AI researchers trained a new model th [...]

Match Score: 49.31

engadget

2025-10-01

Peloton updates its Bike, Tread and Row machines with form-checking cameras, rotating screens and lots of AI

It’s been a rough time for Peloton. Last year was marred by deep staff cuts, a change of CEO and a reckoning of where the home fitness company belonged, post-Pandemic boom. The answer is, unfortunat [...]

Match Score: 49.30

Destination

2025-06-27

NordVPN Review 2025: Innovative features, a few missteps

When we say that NordVPN is a good VPN that's not quite great, it's important to put that in perspective. Building a good VPN is hard, as evidenced by all the shovelware VPNs flooding the ma [...]

Match Score: 47.52