Web Scraping

Web Scraping is a technique to extract information from websites in an automated way. The technique involves the conversion of unstructured data or data in HTML format into structured data that can be stored in a database or spreadsheet and can be presented in a table format.
The language mostly used for scraping, that is extracting information, from the web is Python.

There are two basic things to know before moving to web scraping -

Understanding the target data on the internet — It is the kind of data that we need to extract from the web. Example data on eCommerce, sports, etc.
Listing out the websites, that is listing the names of sites from which we can get the required data.

Now we are ready to scrape. Before moving to scrape, let’s know about Parser.

Parser -

A parser is a tool used to interpret or render information from a web document. A parser receives input in the form of program instructions, commands, and markup tags (as you would see in an HTML document), and it outputs the web documents as objects, methods, and their attributes. The parser is used for information validation as well, before processing the information.

Parsing is a crucial step. If this step fails during web scraping, then the data cannot be extracted and stored through the subsequent processes.

The steps involved in Web Scraping are -

Step One
Sent a request to the targeted website from where you want to extract information, in order to collect the required data.
Step Two
Information is then received from the targeted website in HTML or XML format.
Step Three
Parsing is done, which is reading the data and extracting information from the documents. The retrieved information is then parsed through several parsers based on data. HTML parser is used for reading HTML documents and XML parser is used for reading XML documents.
Step Four
The parsed data is stored in the required format. The parsed data can then be stored in the database or can be transferred to other web applications.

Python library used for Web Scraping -

BeautifulSoup

It is an easy and robust python library used for web scraping intensively.
The library has efficient tools needed for dissecting documents and extracting information from web pages.
It has got several sets of methods providing ease in navigating, searching, and modifying a parsed tree (which is a parsed HTML document)
It has a parser supporting both the HTML and XML documents
It can automatically convert all the incoming documents to Unicode, which makes them easy to read and parse. Also, it automatically converts the outgoing documents to UTF-8.

BeautifulSoup supports various parsers -

html.parser is python based and is fast and lenient.
lxml html depends on C and is fast and lenient.
lxml xml is the only XML parser and depends on C.
html5lib is python based but it is slow.

Parsing data into Python Objects -

HTML object after going through the HTML parser gets transformed into a tree of objects, that is it forms a hierarchical structure in which HTML tag is the first node, then head and body tags are its child node, and so on. So all the tags are getting transformed into a tree. Then these objects are used for extracting data or information by searching or navigating through the parts of the documents. There exists a relationship between the objects (which are basically tags) that makes it easier to retrieve information efficiently.

This HTML document which got transformed into a complex tree of python objects after parsing has four types of objects -

Tag is the HTML tags in the documents and have a lot of attributes and methods
Navigable String is the set of characters that correspond to the text present within a tag
BeautifulSoup represents the entire web document and supports navigation and searching of the document tree.
Comment is the information section of the documents. A special type of navigable string.

Demo on HTML document Scraping -


Let’s create a html document, same as you would do while building web pages,
document = “””
<html>
<head>
<title> This is about Scraping </title>
</head>
<body>
<em> <! — A comment line to represent comment object → </em>
<p title = “First page” class = “scrape”> This is about learning scraping </p>
<div> Using <b> many tags </b> and sentences </div>
<h3> From Kolkata </h3>
</body>
</html>
“””


#importing the scraping library
from bs4 import BeautifulSoup
#parsing the document using the html parser
soup_parsed = BeautifulSoup(document, ‘html.parser’)
soup_parsed

Output —
<html>
<head>
<title> This is about Scraping </title>
</head>
<body>
<em> <! — A comment line to represent comment object → </em>
<p class=”scrape” title=”First page”> This is about learning scraping </p>
<div> Using <b> many tags </b> and sentences </div>
<h3> From Kolkata </h3>
</body>
</html>


type(soup_parsed)

Output-
bs4.BeautifulSoup


#Let’s get the paragraph tag
tag = soup_parsed.p
print(tag)
print(type(tag))

Output -
<p class=”scrape” title=”First page”> This is about learning scraping </p>
<class ‘bs4.element.Tag’>


#Let’s get the tag attribute
tag.attrs

Output -
{‘title’: ‘First page’, ‘class’: [‘scrape’]}


#Let’s get the tag value
tag.string

Output -
‘ This is about learning scraping ‘


#let’s repeat the same for div
tag_div = soup_parsed.div
print(tag_div)
print(type(tag_div))

Output -
<div> Using <b> many tags </b> and sentences </div>
<class ‘bs4.element.Tag’>


tag_div.b.string

Output -
‘ many tags ‘

This was an overview of web scraping. Thus, in Conclusion, we can generate loads of data from unstructured web data and can do vast operations on it at a later stage or store it.