{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# [free energy](https://u-ts.net/) -- You and HTTP\n", "\n", "Contact `wefatherley at gmail dot com`.\n", "
\n", "\n", "Hypertext transfer protocol is a collection of specifications that facilitate a huge swath of web-based communication. The official documentation for HTTP are laid out by the Internet Engineering Task Force [here](https://tools.ietf.org/html/rfc7230). Whereas these documents are quite informative as a reference, it is known by the author that reading them serially with no prior knowledge of HTTP will just leave you tired and probably a little hungry. To that end, we venture here mainly to\n", "\n", " * introduce several key aspects of HTTP,\n", " * and implement a HTTP client with Python\n", " \n", "in a way that is accessible for most any reader.\n", "\n", "## What is HTTP?\n", "HTTP is simply a format for sending messages, and is usually implemented atop the [OSI stack](https://en.wikipedia.org/wiki/OSI_model) in the application layer, right above the transport layer. The transport layer provides bytes corresponding to a HTTP message. HTTP messages always have textual parts that characterize the protocol's client-server paradigm, but also can contain _message data_, such as a HTML file, or an image file, and so forth. The client sends a message, the _request_, which may or may not contain message data, but again always contains lines of text that specify how the server should alter it's behavior to handle that request. Here is the general format of the client request:\n", "\n", "> __METHOD URL VERSION
header1: xxx
header2: xxx
header3: xxx
...
headerN: xxx
\\[data bytes, or not\\]__\n", "\n", "As can be seen, a HTTP message is essentially just a few lines of text, and possibly a last line that is data. The first line always specifies a method (such as `GET`, `POST`, `DELETE`), the \"path\" or URL of the resource being requested (such as `/index.html`, or `/`, etc.), and the HTTP version. The remaining lines in this above example are _HTTP headers,_ which specify constraints on the response returned by the sever. Consider the following `dict`, a key-value mapping of headers that a client could send:\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "REQUEST_HEADERS = {\n", " \"accept\": \"text/html, application/xhtml+xml, \" \\\n", " + \"application/xml;q=0.9, */*;q=0.8\",\n", " \"accept-langauge\": \"en\",\n", " \"connection\": \"keep-alive\",\n", " \"keep-alive\": \"timeout=10, max=10\",\n", " \"user-ugent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) \" \\\n", " + \"AppleWebKit/537.36 (KHTML, like Gecko) \" \\\n", " + \"Chrome/78.0.3904.97 \" \\\n", " + \"Safari/537.36\"\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "The `accept` header is read like: \"this request with greatest priority would like HTML/XHTML+XML data to interpret, and with least priority any data to interpret (the anything mime-type is `*/*`). `accept-langauge` asks for english characters. `connection` is set to `keep-alive` to ask the server to keep the connection open. `keep-alive` is asking the server to reset/close the connection after ten seconds or ten requests. `user-agent` tells the server that the software issuing the request is one of the four web browsers listed. The server's response issues similar headers, such as `content-type`, which is usually one of the the request's `accept` choices. The best place to learn more about headers is [Mozilla](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers).\n", "\n", "The server's response generally looks like this:\n", "\n", "> __STATUS REASON VERSION
header1: xxx
header2: xxx
header3: xxx
...
headerN: xxx
\\[data bytes, or not\\]__\n", "\n", "whereby on the first line `REASON` is a short description if what happened on the server's end, `STATUS` is an integer representiation of \"REASON\" (this integer makes inequality operators like `>=` useful for rapidly handling responses from one or many servers), and `http/x.x` is the version of HTTP being used.\n", "\n", "At this point, hopefully it's clear that HTTP is simply an architecture of rules and procedures for communicating-- a special way to write and parse messages and nothing more. In fact, you and a human acquaintance can communicate with HTTP over paper and pen. Not being all that difficult or mysterious, let's have a look at some examples.\n", "\n", "## A simple client\n", "Let's first be a client, and fetch the `/` resource from `www.python.org` using Python's `http` library. This package provides module `client`, which contains a nice _connection object_ that we can make basic requests with. Usage to fetch the HTML of a website can be as simple as:\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "200 OK 11 \n", "\n", "Server: nginx\n", "Content-Type: text/html; charset=utf-8\n", "X-Frame-Options: DENY\n", "Via: 1.1 vegur\n", "Via: 1.1 varnish\n", "Content-Length: 48896\n", "Accept-Ranges: bytes\n", "Date: Wed, 26 Feb 2020 18:59:10 GMT\n", "Via: 1.1 varnish\n", "Age: 2594\n", "Connection: keep-alive\n", "X-Served-By: cache-bwi5142-BWI, cache-bos4638-BOS\n", "X-Cache: HIT, HIT\n", "X-Cache-Hits: 3, 5\n", "X-Timer: S1582743550.249671,VS0,VE0\n", "Vary: Cookie\n", "Strict-Transport-Security: max-age=63072000; includeSubDomains\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "Here we are using a secure version of the HTTP connection object because most severs tend to prefer this (which they let you know with headers). You can think of instantiating `httpc.HTTPSConnection` as equivalent to writing out this much of a HTTP message:\n", " \n", "> __METHOD URL http/1.1
host: www.python.org
accept-encoding: identity__\n", "\n", "Calling `conn`'s `request` method as above will then send this message to the Python backend:\n", "\n", "> __GET / http/1.1
host: www.python.org
accept-encoding: identity__\n", "\n", "Calling `conn`'s `getresponse` method handles all the HTTP meassge parsing and returns a response object that contains a headers `dict`, a status and reason atttributes, and is readable like a file to get the response data. Note in the a bove example that the connection object automatically sent two headers to the server, `host` and `accept-encoding`. This behavior can be overridden, and the connection object also has some features that facilitate a more fine-grained request, such as when custom headers need to be included:\n", "
" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "200 OK 11 \n", "\n", "Server: nginx\n", "Content-Type: text/html; charset=utf-8\n", "X-Frame-Options: DENY\n", "Via: 1.1 vegur\n", "Via: 1.1 varnish\n", "Content-Length: 48896\n", "Accept-Ranges: bytes\n", "Date: Wed, 26 Feb 2020 18:59:36 GMT\n", "Via: 1.1 varnish\n", "Age: 2620\n", "Connection: keep-alive\n", "X-Served-By: cache-bwi5142-BWI, cache-bos4647-BOS\n", "X-Cache: HIT, HIT\n", "X-Cache-Hits: 3, 5\n", "X-Timer: S1582743576.180865,VS0,VE0\n", "Vary: Cookie\n", "Strict-Transport-Security: max-age=63072000; includeSubDomains\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "That's a lot of control, and probably not necessary for most folks. The `requests` library boils all of this down a request to one call:\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "200 OK 11 \n", "\n", "Server: nginx\n", "Content-Type: text/html; charset=utf-8\n", "X-Frame-Options: DENY\n", "Via: 1.1 vegur, 1.1 varnish, 1.1 varnish\n", "Content-Length: 48896\n", "Accept-Ranges: bytes\n", "Date: Wed, 26 Feb 2020 19:00:32 GMT\n", "Age: 2677\n", "Connection: keep-alive\n", "X-Served-By: cache-bwi5142-BWI, cache-bos4626-BOS\n", "X-Cache: HIT, HIT\n", "X-Cache-Hits: 3, 3\n", "X-Timer: S1582743632.489480,VS0,VE0\n", "Vary: Cookie\n", "Strict-Transport-Security: max-age=63072000; includeSubDomains\n", "\n", "\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "That's super easy. It's also very sophisticated \"under the hood\". You see, in the http.client example above, we make a single request on a single connection, and if the status is 200 (a.k.a. OK, meaning the request was fulfilled), then print the response just one time. Said differently, there is no support for handling _redirection_, the process where a server returns with status >= 300 and < 400, and a _location_ HTTP header specifying the URL that the client should try instead. As it turns out, `requests.get` will try following redirects automatically. As an aside, there is a rich collection of concurrent/sequential HTTP client libraries for Python, e.g. aiohttp and builtin urllib.request." ] } ], "metadata": { "free-energy": { "author": "will", "title": "You and HTTP" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }