How to retrieve web page over HTTP Python

 

How to retrieve web page over HTTP Python – Networked Programs in Python

Here we discuss how to retrieve web pages using the Hypertext Transfer Protocol (HTTP). Once the web page is retrieved, then we will read through the web page data and parse it. The set of programs are called networked programs in python.

First, we understand the different functions available in python for writing networked programs over HTTP.

Video Tutorial

The World’s Simplest Web Browser

Networked Programs in Python

the socket library is used to retrieve web papges. The socket library has four functions socket(), connect(), send() and recv() functions.

socket() function is used to create a socket. We need to specify the type of address (IPv4 or IPv6) and the kind of connection (connection-oriented or connectionless service) you want to establish by passing the parameter to the socket() function.

connect() function is used to establish the connection to the host. We need to pass the hostname and the port numbers to connect() function.

send() function is used to send the GET request to the hostname. The webpage address and HTTP version followed by \n\r\n\r are the contents of GET request.

recv() function is used to receive the contents of the web page. The number of bytes of information to be received per second is the parameter to recv() function.

Another function, gethostbyname() is used to know the IP address of a hostname.

import socket

ip = socket.gethostbyname('www.vtupulse.com')
print (ip)

Output

139.99.23.75

An example script to connect to Google using socket programming in Python

import socket 
 
try:
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    print ("Socket successfully created")
    
    # default port for socket
    port = 80

    # connecting to the server
    s.connect(('www.vtupulse.com', 80))

    print ("The socket has successfully connected")
except:
    print ("socket creation failed with error")

Output

Socket successfully created
The socket has successfully connected

Python program to retrieve the web page over HTTP

First, import a socket library, then create a socket using the socket() function by passing the socket.AF_INET indicating IPv4 address and socket.SOCK_STREAM indicating connection-oriented service.

Use connect() function to connect to host on port number 80. In this case, the host is ‘data.pr4e.org’

Then, use the send() function to make the GET request by passing the full web address and HTTP version.

Use recv() function to receive the data from the web service. Display the received data and finally close the connection.

import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))

cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()

mysock.send(cmd)
while True:
    data = mysock.recv(20)
    if (len(data) < 1):
        break
    print(data.decode(),end='')
mysock.close()

Output

HTTP/1.1 200 OK
Date: Wed, 15 Apr 2020 07:43:21 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

The output starts with header information. The header information is used to describe the document the webserver has sent.

For example, the Content-Type header indicates that the document is a plain text document (text/plain). Similarly, Content-Length indicates the length of data received and so on.

After the server sends us the header information, the webserver adds a blank line that indicates the end of the header information and then sends the actual data, in this case, the content of file romeo.txt.

How to extract only the content ignoring the header information

The first line indicates the end of the header information. Hence we need to find the first blank line and remove the content up to the first blank line.

\r\n\r\n indicates the equivalent of a blank line.

Find sequence \r\n\r\n and extract the data after the sequence \r\n\r\n, which is actual data.

The following fragment of code shows how to extract only the content of the web page over the HTTP protocol.

import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))

cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
new_data = b''
mysock.send(cmd)
while True:
    data = mysock.recv(20)
    if (len(data) < 1):
        break
    new_data = new_data + data 
pos = new_data.find(b'\r\n\r\n')

print (new_data[pos+4:].decode())
mysock.close()

Output

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

Retrieving an image over HTTP

import socket

HOST = 'hsit.ac.in'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.send('GET http://hsit.ac.in/images/hit1.JPG HTTP/1.0\r\n\r\n'.encode())
picture = b""
while True:
    data = mysock.recv(5120)
    if (len(data) < 1):
        break
    picture = picture + data
mysock.close()

#print (picture)

pos = picture.find(b"\r\n\r\n")
#print (pos)
picture = picture[pos+4:]

#print (picture)

fhand = open("1234.jpg", "wb")
fhand.write(picture)
fhand.close()

Output

The image will be written in to 1234.jpg

Summary:

This tutorial discusses How to retrieve a web page over HTTP Python. If you like the tutorial share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *