About Research+Code Blog

Web Scraping 101


If you want to pull large amounts of data from a website that doesn’t have an official and accessible API, this tutorial will provide you with all the information that you need.

Anything that you can see on a webpage in written form, you should be able to pull off with minimal effort. There are of course, a number of things that developers can do to thwart such efforts, but we’ll cover that later. If you’re lucky, the webpage that you want will have a link to a json or xml file with the data you need. These are well-formatted, downloadable and easy to read and parse through with (mostly) low error rate, in terms of extracting unwanted information and noise. But even if not, the HTML source page will have everything you need and with slightly more effort (read: manual perusal of lots of lines of HTML code) you’ll be able to extract anything that you require.


Tools: a working knowledge of HTML, good [Python/Java/your programming language of choice] skills and at least a few hours to spare on filtering noise in different iterations/reading through the HTML.


I would highly recommend Python as your goto language, not just because of the python packages (BeautifulSoup, lxml, some other less efficient ones) but also because of its ease of use. If you’ve never used Python before but know how to program, you will have absolutely no problem at all. If you have no programming knowledge, this might be slightly harder but you can learn. Let’s begin web scraping.


We’re going to begin with BeautifulSoup because for the most part, you’re not going to easily find xml and json files for what you need. BeautifulSoup is a package in Python (I would recommend 2.7, by the way) which can pull out paragraphs, tables, titles and lists from a webpage, given the url. It doesn’t actually fetch the web page for you, which is why we use Urllib in combination with Beautiful Soup. Here’s what you have to do.


import urllib

import bs4

from bs4 import BeautifulSoup

url_name = “https:// . . .”

page = urllib.urlopen(url_name).read()

soup_obj = BeautifulSoup(page, “html5lib”)


For every package that you use always look at the documentation. You’ll discover so many additional things you can do or could have done, if you only end up reading it after you’ve finished your implementation. Here’s the BeautifulSoup documentation. And here’s the Urllib documentation.

The soup object just created contains all the HTML of the url fed into it.

Now, if you have a working knowledge of HTML, you know that different types of content are places within different tags. If we’re interested only in a certain type of content, and if we’re fortunate and it’s all contained within a specific HTML tag, all we want to do is to grab precisely that information. This is where we use the find_all method.


items = soup_obj.find_all(“div”, class_=“xyz”)


The above statement gives us an utterable collection of objects that contain the information we want. We can run through the collection and extract the text we need at each point in the collection.


for item in items:

print item.a.get_text()


I think the most useful advice from here on, is to actually inspect the source code of the web page you’re interested. Identify the elements you want to extract and identify the tags the information is contained in — and also if they vary. Once you know the tags, all you really have to do is a variant of the above iterable collection to get the bs4 objects you need.



Points to Note:


To make life easier and prevent overloading, you should parallelize your program and ensure that your implementation isn't too hard on the server. Try to make efficient use of processors and threads i.e., make use of a multiprocessing package like this to send request in parallel.


import multiprocessing

n_cpus = multiprocessing.cpu_count()

page_list = ['www...', .. ,]

for cpu in n_cpus:

current = multiprocessing.Process(name=str(cpu), target = perform_extraction)

current.start()



BeautifulSoup is easy to use and is ideal for small projects. If you need to create a significantly larger database and need a framework that handle data at a larger scale, consider the following:

Scrapy i.e., a powerful scraping framework in python if you want to build a web crawler that does more than just extracting a few specific elements.

Selenium i.e., a framework useful when you need data and elements in sites with JavaScript.

MySQL or a similar database backend to keep track of and store data.