Do you require GBs of data to check the performance of your app? The easiest way is to download samples of data from free data repositories available on the Web. But the main disadvantage of this approach is the data will have very less unique content and it may not give desired results. Below are 70+ websites to get large data repositories for free.
Wikipedia:Database offers free copies of all available content to interested users. data is available in multiple languages. Content along with images could be downloaded.
Common crawl builds and maintains an open crawl of the web accessible to everyone. The data is stored in amazon s3bucket and the requester may have spend some money to access it.
Apache Mahout TLP project to create scalable, machine learning algorithms. Mahout has many links to get free and paid corpus data.
EDRM Enron Email Data Set v2 consist of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST.
ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.
DMOZ – Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It has collections of URLs in different category. Dmoz is one main source for internet search engines.
theinfo.org – This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.
Project Gutenberg offers over 36,000 free ebooks to download to your PC, Kindle, Android, iOS or other portable device.
GDELT: The Global Data on Events, Location and Tone, described by Guardian as “a big data history of life, the universe and everything.”
GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions,and a curated, online resource for gene expression data browsing, query and retrieval.
Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.
data.world is a platform where data scientists can find and use a vast array of high-quality open data, collaborate on data projects, and meet other like-minded data nerds.
Data mining is often a difficult and time consuming task. Hence, not having a clear idea how to mine the data will severely affect the project’s focus. Here are 8 things you should remember while mining data- Read the blog post here .