Data Mining

How to scrape data from web using python

Can you guess a simple way you can get data from a web page? It’s through a technique called web scraping.

In case you are not familiar with web scraping, here is an explanation:

“Web scraping is a computer software technique of extracting information from websites”

“Web scraping focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.”

Some web pages make your life easier, they offer something called API, they offer an interface that you can use to download data. Websites like Rotten tomatoes and Twitter provides API to access data. But if a web page doesn’t provide an API, you can use Python to scrape data from that webpage.

I will be using two Python modules for scraping data.

  • Urllib
  • Beautifulsoup

So, are you ready to scrape a webpage? All you have to do to get started is follow the steps given below:

Understanding HTML Basics

Scarping is all about html tags. So you need to understand html  inorder to scrape data.

This is an example for a minimal webpage defined in HTML tags. The root tag is <html> and then you have the <head> tag. The page includes the title of the page and might also have other meta information like the keywords. The <body> tag includes the actual content of the page. <h1>, <h2> , <h3>, <h4>, <h5> and <h6> are different header levels.

data-science

These are some useful html tags you need to know.

Useful tags

I encourage you to inspect a web page and view its source code to understand more about html.

Scraping A Web Page Using Beautiful Soup

I will be scraping data from bigdataexaminer.com. I am importing urllib2, beautiful soup(bs4), Pandas and Numpy.

import urllib2

  import bs4

  import pandas as pd

  import numpy as np

url lib

url lib 2

What beautiful = urllib2.urlopen(url).read() does is, it goes to bigdataexaminer.com and gets the whole html text. I then store it in a variable called beautiful.

Now I have to parse and clean the HTML code. BeautifulSoup is a really useful Python module for parsing HTML and XML files.  Beautiful Soup gives aBeautifulSoup object, which represents the document as a nested data structure.

Prettify

You can use prettify()  function to show different levels of the HTML code.

beautiful soup

html language

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <h1> tag, just say soup.h1.prettify():

soup

Contents

soup.tag.contents will return contents of a tag as a list.

In[18] : soup.head.contents

meta char set

The following function will return the title present inside head tag.

In[45] : x = soup.head.title

Out [45]: <title></title>

.string will return the string present inside the title tag of big data examiner. As big dataexaminer.com doesn’t have a title, the value returned is None.

string

Descendants

Descendants lets you iterate over all of a tags children, recursively.

descendants

meta

You can also look at the strings using .strings generator

soup strings

text string

In[56]: soup.get_text()

extracts all the text from Big data examiner.com

FindALL

You can use Find_all() to find all the ‘a’ tags on the page.

find all

To get the first four ‘a’ tags you can use limit attribute.

soup-findall

To find a particular text on a web page, you can use text attribute along with find All.  Here I am searching for the term ‘data’ on big data examiner.

a tag

Get me the attribute of  the second ‘a’ tag on big data examiner.

big data exam

You can also use a list comprehension to get the attributes of the first 4 a tags on bigdata examiner.

big data examiner

Conclusion

A data scientist should know how to scrape data from websites, and I hope you have found this article useful as an introduction to web scraping with Python. Apart from beautiful soup there is another useful python library called pattern for web scraping. I also found a  good tutorial on web scraping using Python.

Instead of taking the difficult path of web scraping using an in-house setup built by you from scratch, you could always safely trust PromtCloud’s web scraping service to take end-to-end ownership of your project.

Web scraping is not all about “coding” per se, you need to be adept in coding, internet protocols, database warehousing, service-request, code cleansing, converting unstructured data to structured data, and even some machine learning nowadays.

21 Comments
  1. This Post Is Very Helpful for everyone

  2. Bernd 8 months ago
    Reply

    Ԝhen I initially commented І clicked the
    “Notify me when new comments are added” checkbox andd noow еach time a comment is adred I gett four
    emails with the ѕame ϲomment. Is thегe any ԝay үoս ϲan remove people from that service?
    Ꭲhank you!

  3. gamefly free trial 5 months ago
    Reply

    Hurrah! After all I got a webpage from where I can really get useful data concerning
    my study and knowledge.

  4. gamefly free trial 5 months ago
    Reply

    Pretty! This was an incredibly wonderful post.

    Thank you for providing this info.

  5. gamefly free trial 4 months ago
    Reply

    These are actually impressive ideas in on the topic of blogging.
    You have touched some good things here. Any way keep up wrinting.

  6. free ps4 games 4 months ago
    Reply

    I read this article completely concerning the difference of most up-to-date
    and preceding technologies, it’s awesome article.

  7. http://tinyurl.com 4 months ago
    Reply

    I’m really enjoying the design and layout of your blog. It’s a very easy on the eyes which
    makes it much more enjoyable for me to come here and visit more often.
    Did you hire out a developer to create your theme?
    Exceptional work!

  8. Keep functioning ,great job!

  9. I am so happy to read this. This is the type of manual that needs to be given and not the accidental misinformation that’s at the other blogs.
    Appreciate your sharing this best doc.

  10. For the reason that the admin of this site
    is working, no hesitation very quickly it will be renowned, due to its quality contents.

  11. It is not my first time to pay a quick visit this web site, i
    am visiting this website dailly and get nice information from here everyday.

  12. Hey There. I found your blog using msn. That is an extremely neatly written article.
    I’ll make sure to bookmark it and come back to read more
    of your helpful information. Thank you for the post. I’ll certainly
    return.

  13. Hey! Someone in my Facebook group shared this site with us so I came to take a look.

    I’m definitely loving the information. I’m bookmarking and
    will be tweeting this to my followers! Exceptional blog and fantastic style and design.

  14. Lovetta Formosa 4 weeks ago
    Reply

    Hey there! I know this is kind of off topic but I was wondering if you knew where I could get a captcha plugin for my comment form? I’m using the same blog platform as yours and I’m having difficulty finding one? Thanks a lot!|

Leave a Comment

Your email address will not be published.

You may also like

Pin It on Pinterest