How to retrieve web page over HTTP Python – Networked Programs in Python
Here we discuss how to retrieve web pages using the Hypertext Transfer Protocol (HTTP). Once the web page is retrieved, then we will read through the web page data and parse it. The set of programs are called networked programs in python.
First, we understand the different functions available in python for writing networked programs over HTTP.
The World’s Simplest Web Browser
the socket library is used to retrieve web papges. The socket library has four functions socket(), connect(), send() and recv() functions.
socket() function is used to create a socket. We need to specify the type of address (IPv4 or IPv6) and the kind of connection (connection-oriented or connectionless service) you want to establish by passing the parameter to the socket() function.
connect() function is used to establish the connection to the host. We need to pass the hostname and the port numbers to connect() function.
send() function is used to send the GET request to the hostname. The webpage address and HTTP version followed by \n\r\n\r are the contents of GET request.
recv() function is used to receive the contents of the web page. The number of bytes of information to be received per second is the parameter to recv() function.
Another function, gethostbyname() is used to know the IP address of a hostname.
import socket ip = socket.gethostbyname('www.vtupulse.com') print (ip)
An example script to connect to Google using socket programming in Python
import socket try: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) print ("Socket successfully created") # default port for socket port = 80 # connecting to the server s.connect(('www.vtupulse.com', 80)) print ("The socket has successfully connected") except: print ("socket creation failed with error")
Socket successfully created The socket has successfully connected
Python program to retrieve the web page over HTTP
First, import a socket library, then create a socket using the socket() function by passing the socket.AF_INET indicating IPv4 address and socket.SOCK_STREAM indicating connection-oriented service.
Use connect() function to connect to host on port number 80. In this case, the host is ‘data.pr4e.org’
Then, use the send() function to make the GET request by passing the full web address and HTTP version.
Use recv() function to receive the data from the web service. Display the received data and finally close the connection.
import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode() mysock.send(cmd) while True: data = mysock.recv(20) if (len(data) < 1): break print(data.decode(),end='') mysock.close()
HTTP/1.1 200 OK Date: Wed, 15 Apr 2020 07:43:21 GMT Server: Apache/2.4.18 (Ubuntu) Last-Modified: Sat, 13 May 2017 11:22:22 GMT ETag: "a7-54f6609245537" Accept-Ranges: bytes Content-Length: 167 Cache-Control: max-age=0, no-cache, no-store, must-revalidate Pragma: no-cache Expires: Wed, 11 Jan 1984 05:00:00 GMT Connection: close Content-Type: text/plain But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief
The output starts with header information. The header information is used to describe the document the webserver has sent.
For example, the Content-Type header indicates that the document is a plain text document (text/plain). Similarly, Content-Length indicates the length of data received and so on.
After the server sends us the header information, the webserver adds a blank line that indicates the end of the header information and then sends the actual data, in this case, the content of file romeo.txt.
How to extract only the content ignoring the header information
The first line indicates the end of the header information. Hence we need to find the first blank line and remove the content up to the first blank line.
\r\n\r\n indicates the equivalent of a blank line.
Find sequence \r\n\r\n and extract the data after the sequence \r\n\r\n, which is actual data.
The following fragment of code shows how to extract only the content of the web page over the HTTP protocol.
import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode() new_data = b'' mysock.send(cmd) while True: data = mysock.recv(20) if (len(data) < 1): break new_data = new_data + data pos = new_data.find(b'\r\n\r\n') print (new_data[pos+4:].decode()) mysock.close()
But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief
Retrieving an image over HTTP
import socket HOST = 'hsit.ac.in' PORT = 80 mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect((HOST, PORT)) mysock.send('GET http://hsit.ac.in/images/hit1.JPG HTTP/1.0\r\n\r\n'.encode()) picture = b"" while True: data = mysock.recv(5120) if (len(data) < 1): break picture = picture + data mysock.close() #print (picture) pos = picture.find(b"\r\n\r\n") #print (pos) picture = picture[pos+4:] #print (picture) fhand = open("1234.jpg", "wb") fhand.write(picture) fhand.close()
The image will be written in to 1234.jpg
This tutorial discusses How to retrieve a web page over HTTP Python. If you like the tutorial share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.