cft

Web Scraping Overview

This is to get an overview of web scraping using Python.


user

Nishi Paul

2 years ago | 3 min read

Web Scraping

Web Scraping is a technique to extract information from websites in an automated way. The technique involves the conversion of unstructured data or data in HTML format into structured data that can be stored in a database or spreadsheet and can be presented in a table format.
The language mostly used for scraping, that is extracting information, from the web is Python.

There are two basic things to know before moving to web scraping -

  • Understanding the target data on the internet — It is the kind of data that we need to extract from the web. Example data on eCommerce, sports, etc.
  • Listing out the websites, that is listing the names of sites from which we can get the required data.

Now we are ready to scrape. Before moving to scrape, let’s know about Parser.


Parser -

A parser is a tool used to interpret or render information from a web document. A parser receives input in the form of program instructions, commands, and markup tags (as you would see in an HTML document), and it outputs the web documents as objects, methods, and their attributes. The parser is used for information validation as well, before processing the information.

Parsing is a crucial step. If this step fails during web scraping, then the data cannot be extracted and stored through the subsequent processes.

The steps involved in Web Scraping are -

  • Step One
    Sent a request to the targeted website from where you want to extract information, in order to collect the required data.
  • Step Two
    Information is then received from the targeted website in HTML or XML format.
  • Step Three
    Parsing is done, which is reading the data and extracting information from the documents. The retrieved information is then parsed through several parsers based on data. HTML parser is used for reading HTML documents and XML parser is used for reading XML documents.
  • Step Four
    The parsed data is stored in the required format. The parsed data can then be stored in the database or can be transferred to other web applications.

Python library used for Web Scraping -

BeautifulSoup

  • It is an easy and robust python library used for web scraping intensively.
  • The library has efficient tools needed for dissecting documents and extracting information from web pages.
  • It has got several sets of methods providing ease in navigating, searching, and modifying a parsed tree (which is a parsed HTML document)
  • It has a parser supporting both the HTML and XML documents
  • It can automatically convert all the incoming documents to Unicode, which makes them easy to read and parse. Also, it automatically converts the outgoing documents to UTF-8.

BeautifulSoup supports various parsers -

  • html.parser is python based and is fast and lenient.
  • lxml html depends on C and is fast and lenient.
  • lxml xml is the only XML parser and depends on C.
  • html5lib is python based but it is slow.

Parsing data into Python Objects -

HTML object after going through the HTML parser gets transformed into a tree of objects, that is it forms a hierarchical structure in which HTML tag is the first node, then head and body tags are its child node, and so on. So all the tags are getting transformed into a tree. Then these objects are used for extracting data or information by searching or navigating through the parts of the documents. There exists a relationship between the objects (which are basically tags) that makes it easier to retrieve information efficiently.

This HTML document which got transformed into a complex tree of python objects after parsing has four types of objects -

  • Tag is the HTML tags in the documents and have a lot of attributes and methods
  • Navigable String is the set of characters that correspond to the text present within a tag
  • BeautifulSoup represents the entire web document and supports navigation and searching of the document tree.
  • Comment is the information section of the documents. A special type of navigable string.

Demo on HTML document Scraping -

Let’s create a html document, same as you would do while building web pages,

document = “””

<html>

<head>

<title> This is about Scraping </title>

</head>

<body>

<em> <! — A comment line to represent comment object → </em>

<p title = “First page” class = “scrape”> This is about learning scraping </p>

<div> Using <b> many tags </b> and sentences </div>

<h3> From Kolkata </h3>

</body>

</html>

“””

#importing the scraping library

from bs4 import BeautifulSoup

#parsing the document using the html parser

soup_parsed = BeautifulSoup(document, ‘html.parser’)

soup_parsed

Output —
<html>
<head>
<title> This is about Scraping </title>
</head>
<body>
<em> <! — A comment line to represent comment object → </em>
<p class=”scrape” title=”First page”> This is about learning scraping </p>
<div> Using <b> many tags </b> and sentences </div>
<h3> From Kolkata </h3>
</body>
</html>

type(soup_parsed)

Output-
bs4.BeautifulSoup

#Let’s get the paragraph tag

tag = soup_parsed.p

print(tag)

print(type(tag))

Output -
<p class=”scrape” title=”First page”> This is about learning scraping </p>
<class ‘bs4.element.Tag’>

#Let’s get the tag attribute

tag.attrs

Output -
{‘title’: ‘First page’, ‘class’: [‘scrape’]}

#Let’s get the tag value

tag.string

Output -
‘ This is about learning scraping ‘

#let’s repeat the same for div

tag_div = soup_parsed.div

print(tag_div)

print(type(tag_div))

Output -
<div> Using <b> many tags </b> and sentences </div>
<class ‘bs4.element.Tag’>

tag_div.b.string

Output -
‘ many tags ‘

This was an overview of web scraping. Thus, in Conclusion, we can generate loads of data from unstructured web data and can do vast operations on it at a later stage or store it.




Upvote


user
Created by

Nishi Paul

I am a data analyst at HSBC. I have working experience in the fields of eCommerce, Programmatic Advertising, and currently I am at a FinTech. I think we can learn and grow more while sharing, hence I do blogging. I do freelancing in ML as well. Apart from all these, I am an artist and I love to participate in exhibitions. I like to travel a lot and always keep a day in a week free for refreshments.


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles