Data Mining

How to scrape data from web using python

Can you guess a simple way you can get data from a web page? It’s through a technique called web scraping.

In case you are not familiar with web scraping, here is an explanation:

“Web scraping is a computer software technique of extracting information from websites”

“Web scraping focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.”

Some web pages make your life easier, they offer something called API, they offer an interface that you can use to download data. Websites like Rotten tomatoes and Twitter provides API to access data. But if a web page doesn’t provide an API, you can use Python to scrape data from that webpage.

I will be using two Python modules for scraping data.

  • Urllib
  • Beautifulsoup

So, are you ready to scrape a webpage? All you have to do to get started is follow the steps given below:

Understanding HTML Basics

Scarping is all about html tags. So you need to understand html  inorder to scrape data.

This is an example for a minimal webpage defined in HTML tags. The root tag is <html> and then you have the <head> tag. The page includes the title of the page and might also have other meta information like the keywords. The <body> tag includes the actual content of the page. <h1>, <h2> , <h3>, <h4>, <h5> and <h6> are different header levels.


These are some useful html tags you need to know.

Useful tags

I encourage you to inspect a web page and view its source code to understand more about html.

Scraping A Web Page Using Beautiful Soup

I will be scraping data from I am importing urllib2, beautiful soup(bs4), Pandas and Numpy.

import urllib2

  import bs4

  import pandas as pd

  import numpy as np

url lib

url lib 2

What beautiful = urllib2.urlopen(url).read() does is, it goes to and gets the whole html text. I then store it in a variable called beautiful.

Now I have to parse and clean the HTML code. BeautifulSoup is a really useful Python module for parsing HTML and XML files.  Beautiful Soup gives aBeautifulSoup object, which represents the document as a nested data structure.


You can use prettify()  function to show different levels of the HTML code.

beautiful soup

html language

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <h1> tag, just say soup.h1.prettify():



soup.tag.contents will return contents of a tag as a list.

In[18] : soup.head.contents

meta char set

The following function will return the title present inside head tag.

In[45] : x = soup.head.title

Out [45]: <title></title>

.string will return the string present inside the title tag of big data examiner. As big doesn’t have a title, the value returned is None.



Descendants lets you iterate over all of a tags children, recursively.



You can also look at the strings using .strings generator

soup strings

text string

In[56]: soup.get_text()

extracts all the text from Big data


You can use Find_all() to find all the ‘a’ tags on the page.

find all

To get the first four ‘a’ tags you can use limit attribute.


To find a particular text on a web page, you can use text attribute along with find All.  Here I am searching for the term ‘data’ on big data examiner.

a tag

Get me the attribute of  the second ‘a’ tag on big data examiner.

big data exam

You can also use a list comprehension to get the attributes of the first 4 a tags on bigdata examiner.

big data examiner


A data scientist should know how to scrape data from websites, and I hope you have found this article useful as an introduction to web scraping with Python. Apart from beautiful soup there is another useful python library called pattern for web scraping. I also found a  good tutorial on web scraping using Python.

Instead of taking the difficult path of web scraping using an in-house setup built by you from scratch, you could always safely trust PromtCloud’s web scraping service to take end-to-end ownership of your project.

Web scraping is not all about “coding” per se, you need to be adept in coding, internet protocols, database warehousing, service-request, code cleansing, converting unstructured data to structured data, and even some machine learning nowadays.

  1. This Post Is Very Helpful for everyone

  2. Bernd 1 year ago

    Ԝhen I initially commented І clicked the
    “Notify me when new comments are added” checkbox andd noow еach time a comment is adred I gett four
    emails with the ѕame ϲomment. Is thегe any ԝay үoս ϲan remove people from that service?
    Ꭲhank you!

  3. gamefly free trial 12 months ago

    Hurrah! After all I got a webpage from where I can really get useful data concerning
    my study and knowledge.

  4. gamefly free trial 12 months ago

    Pretty! This was an incredibly wonderful post.

    Thank you for providing this info.

  5. gamefly free trial 12 months ago

    These are actually impressive ideas in on the topic of blogging.
    You have touched some good things here. Any way keep up wrinting.

  6. free ps4 games 12 months ago

    I read this article completely concerning the difference of most up-to-date
    and preceding technologies, it’s awesome article.

  7. 11 months ago

    I’m really enjoying the design and layout of your blog. It’s a very easy on the eyes which
    makes it much more enjoyable for me to come here and visit more often.
    Did you hire out a developer to create your theme?
    Exceptional work!

  8. Keep functioning ,great job!

  9. I am so happy to read this. This is the type of manual that needs to be given and not the accidental misinformation that’s at the other blogs.
    Appreciate your sharing this best doc.

  10. For the reason that the admin of this site
    is working, no hesitation very quickly it will be renowned, due to its quality contents.

  11. It is not my first time to pay a quick visit this web site, i
    am visiting this website dailly and get nice information from here everyday.

  12. Hey There. I found your blog using msn. That is an extremely neatly written article.
    I’ll make sure to bookmark it and come back to read more
    of your helpful information. Thank you for the post. I’ll certainly

  13. Hey! Someone in my Facebook group shared this site with us so I came to take a look.

    I’m definitely loving the information. I’m bookmarking and
    will be tweeting this to my followers! Exceptional blog and fantastic style and design.

  14. Lovetta Formosa 8 months ago

    Hey there! I know this is kind of off topic but I was wondering if you knew where I could get a captcha plugin for my comment form? I’m using the same blog platform as yours and I’m having difficulty finding one? Thanks a lot!|

  15. I like what you guys tend to be up too. This sort of clever work and
    reporting! Keep up the wonderful works guys I’ve included you guys to my personal

  16. 6 months ago

    If some one wants expert view concerning running a blog
    after that i propose him/her to visit this webpage,
    Keep up the pleasant work.

  17. coconut oil are 6 months ago

    Hmm is anyone else experiencing problems with the pictures on this blog loading?
    I’m trying to figure out if its a problem on my end or if it’s
    the blog. Any suggestions would be greatly appreciated.

  18. quest bars cheap 6 months ago

    Great work! That is the kind of information that are meant to be shared around the internet.
    Disgrace on Google for now not positioning this post upper!
    Come on over and visit my web site . Thank you =)

  19. quest bars cheap 5 months ago

    Someone necessarily help to make significantly articles I might state.
    This is the very first time I frequented your web page and to this point?
    I amazed with the research you made to make this particular publish amazing.
    Fantastic activity!

  20. ps4 games 5 months ago

    I know this web page provides quality based content and extra material,
    is there any other site which provides these kinds of stuff in quality?

Leave a Comment

Your email address will not be published.

You may also like

Pin It on Pinterest