Data Mining

12 common problems in Data Mining

03rd Feb `15, 05:09 PM in Data Mining

The amount of data being generated and stored every day is exponential. A recent study estimated that every…

Guest Contributor

The amount of data being generated and stored every day is exponential. A recent study estimated that every minute, Google receives over 2 million queries, e-mail users send over 200 million messages, YouTube users upload 48 hours of video, Facebook users share over 680,000 pieces of content, and Twitter users generate 100,000 tweets. Besides, media sharing sites, stock trading sites and news sources continually pile up more new data throughout the day. A few year ago, when we began to leverage this “Big Data” to find consistent patterns and insights and almost immediately, a new interrelated research area emerged: Data Mining. In this post, we take a look at 12 common problems in Data Mining.

1. Poor data quality such as noisy data, dirty data, missing values, inexact or incorrect values, inadequate data size and poor representation in data sampling.

2. Integrating conflicting or redundant data from different sources and forms: multimedia files (audio, video and images), geo data, text, social, numeric, etc…

3. Proliferation of security and privacy concerns by individuals, organisations and governments.

4. Unavailability of data or difficult access to data.

5. Efficiency and scalability of data mining algorithms to effectively extract the information from huge amount of data in databases.

6. Dealing with huge datasets that require distributed approaches.

7. Dealing with non-static, unbalanced and cost-sensitive data.

8. Mining information from heterogeneous databases and global information systems.

9. Constant updation of models to handle data velocity or new incoming data.

10. High cost of buying and maintaining powerful softwares, servers and storage hardwares that handle large amounts of data.

11. Processing of large, complex and unstructured data into a structured format.

12. Sheer quantity of output from many data mining methods.