Close Menu
My Blog

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Dex is an AI-powered camera device that helps children learn new languages

    August 20, 2025

    New zero-day startup offers $20 million for tools that can hack any smartphone

    August 20, 2025

    Thousands of Grok chats are now searchable on Google

    August 20, 2025
    Facebook X (Twitter) Instagram
    • Home
    • Technology
    • Gaming
    • Phones
    • Buy Now
    Facebook X (Twitter) Instagram Pinterest Vimeo
    My Blog
    • Home
    • Features
      • Example Post
      • Typography
      • Contact
      • View All On Demos
    • Technology

      Is the Hyperloop Doomed? What Elon Musk’s Latest Setback Really Means

      March 10, 2022

      The Best Early Black Friday Deals on Gaming Laptops and Accessories

      March 10, 2022

      Apple Watch’s ECG Can Help Diagnose Heart Problem: Research

      January 19, 2021

      Simple Tips and Tricks to Take Care of Your Expensive DSLR Camera

      January 16, 2021

      Tech Study Reveals Effects of Mobile Technology on Professionals

      January 15, 2021
    • Typography
    • Phones
      1. Technology
      2. Gaming
      3. Gadgets
      4. View All

      Is the Hyperloop Doomed? What Elon Musk’s Latest Setback Really Means

      March 10, 2022

      The Best Early Black Friday Deals on Gaming Laptops and Accessories

      March 10, 2022

      Apple Watch’s ECG Can Help Diagnose Heart Problem: Research

      January 19, 2021

      Simple Tips and Tricks to Take Care of Your Expensive DSLR Camera

      January 16, 2021

      Game Development This Week: Save On Essential Tools and More

      November 19, 2022

      Riot Games Acquires a Wargaming Studio to Help With Live Game Development

      March 10, 2022

      Keep Talking and Nobody Explodes: A Boomer Gaming in VR

      March 12, 2021

      Hologate Announces New Plans for First Large Format World VR Arcade

      January 16, 2021
      8.9

      DJI Avata Review: Immersive FPV Flying For Drone Enthusiasts

      January 15, 2021
      8.9

      Bose QuietComfort Earbuds II: Noise-Cancellation Kings Reviewed

      January 15, 2021

      Thousands Of PC Games Discounted In New Black Friday Sale

      January 15, 2021

      Could Solar-Powered Headphones Be The Next Must-Have?

      January 15, 2021

      Will Using a VPN on Phone Helps Protect You from Ransomware?

      January 14, 2021

      Popular New Xbox Game Pass Game Being Review Bombed With “0s”

      January 14, 2021

      Google Says Surveillance Vendor Targeted Samsung Phones

      January 14, 2021

      Why Are iPhones More Expensive Than Android Phones?

      January 14, 2021
    • Buy Now
    Subscribe
    My Blog
    Home»Uncategorized»EleutherAI releases massive AI training dataset of licensed and open domain text
    Uncategorized

    EleutherAI releases massive AI training dataset of licensed and open domain text

    Y U RajuBy Y U RajuJune 6, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    EleutherAI, an AI research organization, has released what it claims is one of the largest collections of licensed and open-domain text for training AI models.

    The dataset, called The Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, The Common Pile v0.1 was used to train two new AI models from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims perform on par with models developed using unlicensed, copyrighted data.

    AI companies, including OpenAI, are embroiled in lawsuits over their AI training practices, which rely on scraping the web — including copyrighted material like books and research journals — to build model training datasets. While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission.

    EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI companies, which the organization says has harmed the broader AI research field by making it more difficult to understand how models work and what their flaws might be.

    “[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. “Researchers at some companies we have spoken to have also specifically cited lawsuits as the reason why they’ve been unable to release the research they’re doing in highly data-centric areas.”

    The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open-source speech-to-text model, to transcribe audio content.

    EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are evidence that the Common Pile v0.1 was curated carefully enough to enable developers to build models competitive with proprietary alternatives. According to EleutherAI, the models, both of which are 7 billion parameters in size and were trained on only a fraction of the Common Pile v0.1, rival models like Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.

    Parameters, sometimes referred to as weights, are the internal components of an AI model that guide its behavior and answers.

    “In general, we think that the common idea that unlicensed text drives performance is unjustified,” Biderman wrote in her post. “As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”

    The Common Pile v0.1 appears to be in part an effort to right EleutherAI’s historical wrongs. Years ago, the company released The Pile, an open collection of training text that includes copyrighted material. AI companies have come under fire — and legal pressure — for using The Pile to train models.

    EleutherAI is committing to releasing open datasets more frequently going forward in collaboration with its research and infrastructure partners.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleStartups Weekly: It’s buying season
    Next Article Genetics testing startup Nucleus Genomics criticized for its embryo product: ‘Makes me so nauseous’
    Y U Raju

    Related Posts

    Uncategorized

    Dex is an AI-powered camera device that helps children learn new languages

    August 20, 2025
    Uncategorized

    New zero-day startup offers $20 million for tools that can hack any smartphone

    August 20, 2025
    Uncategorized

    Thousands of Grok chats are now searchable on Google

    August 20, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    Anthropic unveils custom AI models for U.S. national security customers

    June 5, 202554 Views

    2025 will be a ‘pivotal year’ for Meta’s augmented and virtual reality, says CTO

    June 6, 202551 Views

    XRobotics’ countertop robots are cooking up 25,000 pizzas a month

    June 9, 202545 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    85
    Featured

    Pico 4 Review: Should You Actually Buy One Instead Of Quest 2?

    thf0oJanuary 15, 2021
    8.1
    Uncategorized

    A Review of the Venus Optics Argus 18mm f/0.95 MFT APO Lens

    thf0oJanuary 15, 2021
    8.9
    Editor's Picks

    DJI Avata Review: Immersive FPV Flying For Drone Enthusiasts

    thf0oJanuary 15, 2021

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Demo
    Most Popular

    Anthropic unveils custom AI models for U.S. national security customers

    June 5, 202554 Views

    2025 will be a ‘pivotal year’ for Meta’s augmented and virtual reality, says CTO

    June 6, 202551 Views

    XRobotics’ countertop robots are cooking up 25,000 pizzas a month

    June 9, 202545 Views
    Our Picks

    Dex is an AI-powered camera device that helps children learn new languages

    August 20, 2025

    New zero-day startup offers $20 million for tools that can hack any smartphone

    August 20, 2025

    Thousands of Grok chats are now searchable on Google

    August 20, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • Home
    • Technology
    • Gaming
    • Phones
    • Buy Now
    © 2026 ThemeSphere. Designed by ThemeSphere.

    Type above and press Enter to search. Press Esc to cancel.