The D.C. Open Government Coalition website is among those used to train AI systems—according to a dataset reviewed by experts and reported in The Washington Post. Data scientists analyzed a file called C4 (Colossal Clean Crawled Corpus), created by Google from data gathered in 2019 by Common Crawl, a California nonprofit.
The result is “a massive snapshot of the contents of 15 million websites” that the Post says has been used “to instruct some high-profile English-language AIs, called large language models” including at Google and Facebook.
Researchers at the Allen Institute for AI at the University of Washington cleaned the raw crawl result by filtering unwanted items such as copyrighted or unlawful materials, then categorized the websites using data from Similarweb, a web analytics company. About 10 million of the C4 web domains could be categorized by source, such as business (16 percent), technology (15 percent), and news and media (13 percent). Law and government comprised 4 percent.
In a final step, analysts ranked the 10 million websites “based on how many ‘tokens’ appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.”
Content from the world-wide site “patents.google.com” was observed the most times; Wikipedia was #2. News sites ranked very high (comprising half the top dozen), with the New York Times being #4 and the Washington Post #11.
In the law and government area, web sources for legal cases showed up highest after numerous patent sites — FindLaw ranked at #23, Justia Supreme Court at #50.
A look-up tool in the article allows a search for any website among the millions analyzed, and entering web address “dcogc.org” shows the D.C. Open Government Coalition ranked #350,092. That ranking resulted from researchers finding 67,000 snippets of Coalition text in the web crawl file (called the “corpus”) –0.00004 percent of the total “tokens” in the C4 dataset. Wikipedia had 290 million or 0.2 percent.
The look-up shows the crawl also included content from other open government organizations. The National Freedom of Information Coalition ranked #176,460, Colorado FOI Coalition #236,041, Virginia Coalition for Open Government #202,469, California’s First Amendment Foundation #426,481 and Florida First Amendment Foundation #607,493.
Muckrock, the FOIA site that helps requesters nationwide and publishes interesting results, showed up much more often, a million tokens, ranking that source #12,500.
The training step in building AI tools is much discussed (as it of course affects the results and biased or incorrect training materials undercut credibility) yet hard to study, according to the Post. Some developers such as Google and Facebook have used the C4 file used in the article. But the company OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT. GPT-3 reportedly began several years ago with a dataset 40 times larger than C4. And some developers don’t want to know their sources, again according to the Post, with privacy and other liability concerns leading them to scrape and use web content without documenting details. Social media (Facebook, Twitter) prohibit the scraping technique used to collect web texts.
The Coalition welcomes a host of new users of our content via this new route of large language models built to power AI systems.