提供者：杜成玉
下载地址：http://commoncrawl.org/the-data/get-started/

概述

数据来源：https://www.zhihu.com/question/63383992/answer/222718972
Common Crawl包含了超过7年的网络爬虫数据集，包含原始网页数据、元数据提取和文本提取。常见的爬行数据存储在Amazon Web服务的公共数据集和遍布全球的多个学术云平台上,拥有PB级规模，常用于学习词嵌入。推荐应用方向：文本挖掘、自然语言理解。

相关论文

[1]Buck C, Heafield K, Van Ooyen B. N-gram Counts and Language Models from the Common Crawl[C]//LREC. 2014, 2: 4.
[2]Smith J R, Saint-Amand H, Plamada M, et al. Dirt cheap web-scale parallel text from the common crawl[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013, 1: 1374-1383.
[3]Spiegler S. Statistcs of the common crawl corpus 2012[R]. Technical report, SwiftKey, 2013.
[4]Mühleisen H, Bizer C. Web Data Commons-Extracting Structured Data from Two Large Web Corpora[J]. LDOW, 2012, 937: 133-145.
[5]Bizer C, Eckert K, Meusel R, et al. Deployment of rdfa, microdata, and microformats on the web–a quantitative analysis[C]//International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013: 17-32.