70+ websites to get large data repositories for free

10th Jun `14, 03:10 PM

Do you require GBs of data to check the performance of your app? The easiest way is to download samples of data from free data repositories available on the Web. But the main disadvantage of this approach is the data will have very less unique content and it may not give desired results. Below are 70+ websites to get large data repositories for free.

Wikipedia:Database offers free copies of all available content to interested users. data is available in multiple languages. Content along with images could be downloaded.

Common crawl builds and maintains an open crawl of the web accessible to everyone. The data is stored in amazon s3bucket and the requester may have spend some money to access it.

EDRM File Formats Data Set, consists of 381 files covering 200 file formats.

Apache Mahout TLP project to create scalable, machine learning algorithms. Mahout has many links to get free and paid corpus data.

EDRM Enron Email Data Set v2 consist of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST.

ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.

DMOZ – Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It has collections of URLs in different category. Dmoz is one main source for internet search engines. – This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.

Project Gutenberg offers over 36,000 free ebooks to download to your PC, Kindle, Android, iOS or other portable device.

Million song data set, has data related to tracks and artist.

AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.

BigML big list of public data sources.

Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.

Bitly data, anonymized clicks on gov links.

Canada Open Data, pilot project with many government and geospatial datasets.

Causality Workbench data repository.

Corral Big Data repository at Texas Advanced Computing Center, supporting data-centric science.

Data Source Handbook, A Guide to Public Data, by Pete Warden, O’Reilly (Jan 2011)., open government data from US, EU, Canada, CKAN, and more., publicly available data from UK (also London datastore.), central guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more.

DataMarket, visualize the world’s economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers.

Datamob, public data put to good use., a clearinghouse of datasets available from the City & County of San Francisco, CA.

DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Goverment datasets.

Delve, Data for Evaluating Learning in Valid Experiments

EconData, thousands of economic time series, produced by a number of US Government agencies.

Enron Email Dataset, data from about 150 users, mostly senior management of Enron.

Europeana Data, contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana – the trusted andcomprehensive resource for European cultural heritage content.

FEDSTATS, a comprehensive source of US statistics and more

FIMI repository for frequent itemset mining, implementations and datasets.

Financial Data Finder at OSU, a large catalog of financial data sets.

GDELT: The Global Data on Events, Location and Tone, described by Guardian as “a big data history of life, the universe and everything.”

GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions,and a curated, online resource for gene expression data browsing, query and retrieval.

GeoDa Center, geographical and spatial data.

Google ngrams datasets, text from millions of books scanned by Google.

Grain Market Research, financial data including stocks, futures, etc.

Hilary Mason research-quality Big Data sets collection – many text and image datasets.

HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine Learning.

ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.

Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.

Investor Links, includes financial data

KDD Cup center, with all data, tasks, and results.

Kevin Chai list of datasets, for text, SNA, and other fields.

KONECT, the Koblenz Network Collection, with large network datasets of all types in order to perform research in the area of network mining.

Linking Open Data project, at making data freely available to everyone.

MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.

ML Data, the data repository of the EU Pascal2 networks.

NASDAQ Data Store, provides access to market data.

National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.

National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.

Open Data Census, assesses the state of open data around the world.

OpenData from Socrata, access to over 10,000 datasets including business, education, government, and fun.

Open Source Sports, many sports databases, including Baseball, Football, Basketball, and Hockey.

Peter Skomoroch dataset Bookmarks PubGene(TM) Gene Database and Tools, genomic-related publications database

Quandl, a collaboratively curated portal to millions of financial and economic time-series datasets.

qunb, a platform to find and visualize quantitative data.

Robert Schiller data on housing, stock market, and more from his book Irrational Exuberance.

SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.

Jerry Smith dataset collection, with Finance, Government, Machine Learning, Science, and other data. Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users’ activities at the project management web site.

StatLib, CMU Datasets Archive.

STATOO Datasets part 1 and STATOO Datasets part 2

Time Series Data Library

Visual Analytics Benchmark Repository.

UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.

UCI Machine Learning Repository.

UCR Time Series Data Archive, offering datasets, papers, links, and code.

United States Census Bureau.

Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources.

Wolfram Alpha disease and patient level dat.

Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and Marketing, Competition

Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research. is a platform where data scientists can find and use a vast array of high-quality open data, collaborate on data projects, and meet other like-minded data nerds.

