free energy -- You and HTTP

Contact wefatherley at gmail dot com.


Hypertext transfer protocol is a collection of specifications that facilitate a huge swath of web-based communication. The official documentation for HTTP are laid out by the Internet Engineering Task Force here. Whereas these documents are quite informative as a reference, it is known by the author that reading them serially with no prior knowledge of HTTP will just leave you tired and probably a little hungry. To that end, we venture here mainly to

  • introduce several key aspects of HTTP,
  • and implement a HTTP client with Python

in a way that is accessible for most any reader.

What is HTTP?

HTTP is simply a format for sending messages, and is usually implemented atop the OSI stack in the application layer, right above the transport layer. The transport layer provides bytes corresponding to a HTTP message. HTTP messages always have textual parts that characterize the protocol's client-server paradigm, but also can contain message data, such as a HTML file, or an image file, and so forth. The client sends a message, the request, which may or may not contain message data, but again always contains lines of text that specify how the server should alter it's behavior to handle that request. Here is the general format of the client request:

METHOD URL VERSION
header1: xxx
header2: xxx
header3: xxx
...
headerN: xxx
[data bytes, or not]

As can be seen, a HTTP message is essentially just a few lines of text, and possibly a last line that is data. The first line always specifies a method (such as GET, POST, DELETE), the "path" or URL of the resource being requested (such as /index.html, or /, etc.), and the HTTP version. The remaining lines in this above example are HTTP headers, which specify constraints on the response returned by the sever. Consider the following dict, a key-value mapping of headers that a client could send:


In [1]:
REQUEST_HEADERS = {
    "accept": "text/html, application/xhtml+xml, " \
              + "application/xml;q=0.9, */*;q=0.8",
    "accept-langauge": "en",
    "connection": "keep-alive",
    "keep-alive": "timeout=10, max=10",
    "user-ugent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
                  + "AppleWebKit/537.36 (KHTML, like Gecko) " \
                  + "Chrome/78.0.3904.97 " \
                  + "Safari/537.36"
}

The accept header is read like: "this request with greatest priority would like HTML/XHTML+XML data to interpret, and with least priority any data to interpret (the anything mime-type is */*). accept-langauge asks for english characters. connection is set to keep-alive to ask the server to keep the connection open. keep-alive is asking the server to reset/close the connection after ten seconds or ten requests. user-agent tells the server that the software issuing the request is one of the four web browsers listed. The server's response issues similar headers, such as content-type, which is usually one of the the request's accept choices. The best place to learn more about headers is Mozilla.

The server's response generally looks like this:

STATUS REASON VERSION
header1: xxx
header2: xxx
header3: xxx
...
headerN: xxx
[data bytes, or not]

whereby on the first line REASON is a short description if what happened on the server's end, STATUS is an integer representiation of "REASON" (this integer makes inequality operators like >= useful for rapidly handling responses from one or many servers), and http/x.x is the version of HTTP being used.

At this point, hopefully it's clear that HTTP is simply an architecture of rules and procedures for communicating-- a special way to write and parse messages and nothing more. In fact, you and a human acquaintance can communicate with HTTP over paper and pen. Not being all that difficult or mysterious, let's have a look at some examples.

A simple client

Let's first be a client, and fetch the / resource from www.python.org using Python's http library. This package provides module client, which contains a nice connection object that we can make basic requests with. Usage to fetch the HTML of a website can be as simple as:


In [2]:
# get the class
import http.client as httpc

# instantiate the connection object
conn = httpc.HTTPSConnection("www.python.org")

# "ask" server for data
conn.request("GET", "/")

# get the data
response = conn.getresponse()

# print response items according to protocol order
if response.status == 200:
    print(response.status, response.reason, response.version, "\n")
    print(response.headers)
    print(response.read().decode("latin-1")[:500])
    print("\n\n...")
200 OK 11 

Connection: keep-alive
Content-Length: 49148
Server: nginx
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Via: 1.1 vegur
Via: 1.1 varnish
Accept-Ranges: bytes
Date: Thu, 09 Jul 2020 11:04:20 GMT
Via: 1.1 varnish
Age: 1809
X-Served-By: cache-bwi5120-BWI, cache-dca17757-DCA
X-Cache: HIT, HIT
X-Cache-Hits: 1, 2
X-Timer: S1594292661.858829,VS0,VE0
Vary: Cookie
Strict-Transport-Security: max-age=63072000; includeSubDomains


<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqu


...


Here we are using a secure version of the HTTP connection object because most severs tend to prefer this (which they let you know with headers). You can think of instantiating httpc.HTTPSConnection as equivalent to writing out this much of a HTTP message:

METHOD URL http/1.1
host: www.python.org
accept-encoding: identity

Calling conn's request method as above will then send this message to the Python backend:

GET / http/1.1
host: www.python.org
accept-encoding: identity

Calling conn's getresponse method handles all the HTTP meassge parsing and returns a response object that contains a headers dict, a status and reason atttributes, and is readable like a file to get the response data. Note in the a bove example that the connection object automatically sent two headers to the server, host and accept-encoding. This behavior can be overridden, and the connection object also has some features that facilitate a more fine-grained request, such as when custom headers need to be included:


In [3]:
# update our request headers dict
REQUEST_HEADERS = {
    **REQUEST_HEADERS,
    **{"host": "www.python.org", "accept-encoding": "identity"}
}

# instantiate the connection object
conn = httpc.HTTPSConnection("www.python.org")

# specify a method and resource, but skip auto-writing headers
conn.putrequest("GET", "/", skip_host=True, skip_accept_encoding=True)

# write headers ourself
for header, headerval in REQUEST_HEADERS.items():
    conn.putheader(header, headerval)
conn.endheaders()

# get the data
response = conn.getresponse()

# print response items according to protocol order
if response.status == 200:
    print(response.status, response.reason, response.version, "\n")
    print(response.headers)
    print(response.read().decode("latin-1")[:500])
    print("\n\n...")
200 OK 11 

Connection: keep-alive
Content-Length: 49148
Server: nginx
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Via: 1.1 vegur
Via: 1.1 varnish
Accept-Ranges: bytes
Date: Thu, 09 Jul 2020 11:04:20 GMT
Via: 1.1 varnish
Age: 1809
X-Served-By: cache-bwi5120-BWI, cache-dca17760-DCA
X-Cache: HIT, HIT
X-Cache-Hits: 1, 1
X-Timer: S1594292661.951452,VS0,VE1
Vary: Cookie
Strict-Transport-Security: max-age=63072000; includeSubDomains


<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqu


...


That's a lot of control, and probably not necessary for most folks. The requests library boils all of this down a request to one call:


In [4]:
import requests

response = requests.get("https://www.python.org/")

if response.status_code == 200:
    print(response.status_code, response.reason, response.raw.version, "\n")
    for k,v in response.headers.items(): print(k + ":", v)
    print("\n\n", response.text[:500])
    print("\n\n...")
200 OK 11 

Connection: keep-alive
Content-Length: 49148
Server: nginx
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Via: 1.1 vegur, 1.1 varnish, 1.1 varnish
Accept-Ranges: bytes
Date: Thu, 09 Jul 2020 11:04:21 GMT
Age: 1809
X-Served-By: cache-bwi5120-BWI, cache-dca17745-DCA
X-Cache: HIT, HIT
X-Cache-Hits: 1, 1
X-Timer: S1594292661.169303,VS0,VE1
Vary: Cookie
Strict-Transport-Security: max-age=63072000; includeSubDomains


 <!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqu


...


That's super easy. It's also very sophisticated "under the hood". You see, in the http.client example above, we make a single request on a single connection, and if the status is 200 (a.k.a. OK, meaning the request was fulfilled), then print the response just one time. Said differently, there is no support for handling redirection, the process where a server returns with status >= 300 and < 400, and a location HTTP header specifying the URL that the client should try instead. As it turns out, requests.get will try following redirects automatically. As an aside, there is a rich collection of concurrent/sequential HTTP client libraries for Python, e.g. aiohttp and builtin urllib.request.