How to Retrieving web pages with urllib

 

How to Retrieving web pages with urllib

In this, tutorial we will discuss How to Retrieving web pages with urllib. We can manually send and receive data over HTTP using the socket library. But in this case, we need to construct the request command manually and send it and parse the received data to remove header information.

There is a simpler way to perform this task in Python by using the urllib library.

The urllib treats a web page as a file. You need to simply pass the web page address you would like to retrieve and urllib does the remaining task that is it handles all of the tasks of the HTTP protocol and also header information.

Video Tutorial

The following fragment of code shows how to extract only the content of the web page over the urllib library.

import urllib.request

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())

Output

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

Reading binary files using urllib

Sometimes you want to retrieve a web page containing a non-text (or binary) file such as an image or video file. The data in these files is generally not useful to print out, but you can easily make a copy of a URL to a local file on your hard disk using urllib.

import urllib.request

img = urllib.request.urlopen('http://hsit.ac.in/images/hit1.JPG').read()
fhand = open('12345.jpg', 'wb')
fhand.write(img)
fhand.close()

Output

The image will be written into 12345.jpg

Parsing HTML using regular expressions

Here is a simple web page for a demonstration.

Web page address: http://www.example.com/page1.htm

<h1>The First Page</h1>
<p> If you like, you can switch to the
<a href=”http://www.example.com/page2.htm”> Second Page</a>.
<a href=”http://www.example.com/page3.htm”> Third Page</a>. </p>
<h1>End of First page </h1>
import urllib.request
import re

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
#print (html)
links = re.findall(b'href="(http://.*?)"', html)
#print (links)
for link in links:
    print(link.decode())

Output

Enter - http://www.example.com/page1.htm

http://www.example.com/page2.htm
http://www.example.com/page3.htm

Sample program to extract text in h1 tag

import urllib.request
import re

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall(b'', html)
for link in links:
    print(link.decode())

Output

Enter - http://www.example.com/page1.htm

The First Page
End of First Page

Parsing HTML using BeautifulSoup

There are a number of Python libraries that exist which can be used to parse HTML and extract the required data from the web pages. Every library has its strengths and weaknesses and one can pick based on the requirements of the application.

here we will use the BeautifulSoup library to parse HTML web pages and extract links using the BeautifulSoup library.

BeautifulSoup tolerates highly flawed HTML web pages and still lets you easily extract the required data from the web page.

import urllib.request 
from bs4 import BeautifulSoup

url = input('Enter - ') 
html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html, 'html.parser')

tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag.

Output

Enter - http://www.example.com/page1.htm

http://www.example.com/page2.htm
http://www.example.com/page3.htm

Program to extract the content of h1 tag – using BeautifulSoup

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Enter - ')
html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html, 'html.parser')

print (soup.find_all('h1'))

The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves contents of all of the h1 tags and prints out the content of h1 tag.

Output

Enter - http://www.example.com/page1.htm

The First Page
End of First Page

Program to display tags, tag contents, and tag attributes – using BeautifulSoup

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Enter - ') 
html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html, 'html.parser')

tags = soup('a')
for tag in tags:
   # Look at the parts of a tag
   print('TAG:', tag)
   print('URL:', tag.get('href', None))
   print('Contents:', tag.contents[0])
   print('Attrs:', tag.attrs)

The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag.

Output

Enter - http://www.example.com/page1.htm

TAG: 
Second Page
URL: http://www.example.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.example.com/page2.htm')]

TAG: 
Third Page
URL: http://www.example.com/page3.htm
Content: ['\nThird Page']
Attrs: [('href', 'http://www.example.com/page3.htm')]

Summary:

This tutorial discusses How to Retrieving web pages with urllib. If you like the tutorial share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *