Data Science

8 best python Data Science books

03rd Nov `15, 04:15 PM in Data Science

Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running…

Manu Jeevan
Manu Jeevan Contributor

Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running data analyses on data sets.

There are many libraries, applications and techniques to analyze data in Python that even experts in the field don’t have it all figured out. But for aspiring data scientists, understanding these libraries may be just a few pages away. These books explain everything from the basics of data analysis to the most advanced Python libraries.

Here is my reading list,

1) Python for Data Analysis

Python for data analysis

Wes Mckinney’s Python For Data Analysis is a tour of Pandas, Numpy, Matplotlib for people looking to crunch data with Python. McKinney is the principal author on Pandas, so he mostly talks about Pandas, and shows you how to employ them effectively to your data set. If you’re looking for a book that is going to tell you the types of analyses to do, this is not that book, as author assumes that you already  know what kind of analyses you need to perform on your data.

If you are new to Python, you should look at the appendix section of this book(“it talks about Python basics”). Most importantly, the target audience is not Pythonistas, but rather scientists, educators, statisticians, financial analysts, and the rest of the non-programmers who want to effectively perform data analysis in Python.

2) Building Machine Learning Systems With Python

machine learning systems with Python

This is one of my favorite book on machine learning and Python. You have to know that this book is not intended for beginners, you should have a good grasp of Python and machine learning to understand the code and machine learning techniques used in this book. The book is some thing more than a summary of machine learning algorithms, because it also shows you how to choose the right algorithim for a problem at hand. The book uses scikit learn to implement these machine learning algorithims, you should definitely know scikit-learn to run machine learning algorithms in Python.

If you want to explore more about Scikit-learn, there are two other books Mastering machine learning with scikit learn and  Learning Scikit-learn you should look into.

3) Think Stats2

Think Stats 2

Think Stats is the only book out there that helps you learn statistics and Python. As per my knowledge there is no other book that shows you how to implement statistical concepts in Python. The good thing about this book is that it is available for free online, a special thanks to author Allen Downey.  Even if you are not yet a great programmer, you will find the content accessible and will be able to master it through the examples and exercises, because most of the exercises use short programs to run experiments and help readers develop understanding.  You should also have a look at stats models, it is a python library used to  explore data, estimate statistical models, and perform statistical tests.

4) Think Bayes

Think Bayes

Think Bayes is an introduction to Bayesian statistics using computational methods. You need to know basics of probability and Python programming to understand the code and probability concepts. This is a great book and a good introduction to the application of Bayes’s Theorem in a number of scenarios. The theoretical aspects are well accessible and the Python code is sufficiently clear. The PDF version of  the book is freely available from Green Tea Press.

 5) Programming collective intelligence

 Collective intelligence

Have you ever wondered how some of those “collective intelligence” sites work? How flipkart can suggest products that you’ll like based on your browsing history? How Google can rank and filter results? Toby Segaran does a very good job in revealing and teaching these algorithms in this book. He explains complex algorithms and mathematical concepts with clear examples and code that is both easy to read and useful.  The examples give real-world grounding to abstract concepts like collaborative filtering and bayesian classification. The book was written in the year 2007, but I find it still useful.

6) Mining the social web

Mining the social web

This is more than a “book” – it is a course, and a very well thought through, well supported course at that. The book introduces the APIs provided by some of the larger social platforms, and also gives a good intro to data munging and analysis of data. The clear and easy to follow examples are further enhanced through the accompanying virtual machine of the book, allowing you to escape the headache of installing, configuring, and selecting the right version of all the supporting software and libraries. You can also check out authors website mining the social web, where he writes some really good articles on social media mining.

7) Natural Language processing with python

Natural language processing

The Natural Language Toolkit (NLTK) is an excellent Python library for processing text and language. It has excellent APIs that can preprocess, classify, and analyze your text. Text analytics market is worth $4.90 billion by 2019, so there will be a huge demand for professionals who specialize in text analytics. Some of the examples in the book are really amazing, especially the chapter on supervised classification.

I would like to very strongly recommend this book to Python lovers who would like to explore the world of Natural Language understanding, parsing and processing. It brings out a very strong factor of Python programming language.

8) Bayesian Method for Hackers

Bayesian method for hackers

Bayesian methods for hackers gives a decent introduction to bayesian inference  from both computational and mathematical point of view. The book strongly uses Pymc( a python statistical package) to implement bayesian concepts. If you really want to learn bayesian with practical examples, then this book is for you. Make sure that you are good at Python programming and are familiar with libraries such as Numpy, Scipy and Matplotlib to get the most out of this book.  The book is available for free online. The authors new book Bayesian methods for Hackers is to be launched this May.


I found these 8 books really useful and interesting, but it clearly depends on your area of interest. If you are serious about becoming a data scientist then the first 3 books are a must to have.  If your area of interest is in web intelligence, social media mining, bayesian statistics and natural language processing then you should definitely get the books 5,6 ,7 and 8.

Here are some more Python data science books you may want to read are :

Web Scraping With Python

Machine Learning In Python

Python For Datascience for Dummies

Is there any other Python data science book that I haven’t talked about??