The WebSocket Protocol is a cool and easy way to make updates on a website without excessive polling from clients. I’ve dived into the tech recently to make an RSS reader using sockets instead of regular HTTP calls in order to learn more about how it all works.

At first I intended to just jump in and do it with Flask but it turned out Flask doesn’t play nicely with web sockets. Fortunately for me there’s an alternative known as Quart. Quart works well with Python async and as a bonus feels very “flasky”.

In this post I’m going to describe a simple prototype server for URL scrapping. To start off I am going to make a global asyncio.Queue where scrapped URLs will be stored for distribution to connected clients. Rather than making a full fledged client I’ll just connect to my mini server via curl which as a side effect also nicely showcases some of WebSocket protocol internals.

To make a simple socket server with Quart just annotate an async function with @app.websocket(<path>)

@app.websocket('/scraped_urls')
async def notify_scraped_urls():
    try:
        while True:
            try:
                data = queue.get_nowait()
                serialized_data = json.dumps(data)
                await websocket.send(serialized_data)
            except asyncio.QueueEmpty:
                pass

            await asyncio.sleep(1)
    except asyncio.CancelledError:
        logger.info("Stopping processing loop due to client cancellation")
        raise

Every time a client connects it essentially falls into its own infinite loop on server side until either server gets bored and disconnects or client disconnects. When server detects client disconnection, asyncio.CancelledError gets raised and the loop terminates for that particular client.

Data for connected clients is simply fetched from global queue, serialized and sent with await websocket.send. Care must be taken when using async as coroutines must explicitly yield control so other coroutines can run.

In order for notify_scraped_urls to actually send something I need to push some stuff to queue

@app.route('/urls/<request_id>', methods=['PUT'])
async def process_url(request_id):
    if request_id is None:
        logger.info("request_id is missing")
        abort(400, "missing request_id")

    data = await request.get_json()
    if data is None:
        logger.info("No data in payload: %s", data)
        abort(400, "missing request_id")

    url = data['url']
    task = asyncio.create_task(parse_url(request_id, url))
    return ({
        "status": "accepted"
    }, 201)

A sample payload for PUT HTTP request looks as follows:

{
  "url": "https://kofe.si"
}

process_url creates an async task to be processed later and returns accepted HTTP status code. asyncio.create_task is a way to offload some processing to a later time. If I wrote await task after creating a task process_url would wait until task completion before continuing with execution.

Now for the easy part. Getting links from a URL can be easily done with requests and BeautifulSoup.

def is_valid_url(url):
    return url != None and str(url).startswith("http://") or \
        str(url).startswith("https://")
  
async def parse_url(request_id: int, url: str) -> Any:
    try:
        headers = { 
            "Accept": "text/html",
            "User-Agent": "Crawler 0.1"
        }   
        result = requests.get(url, 
                              allow_redirects=True, 
                              timeout=3,
                              verify=False,
                              headers=headers)
        if result.status_code == 200:
            site_text = result.text
            soup = BeautifulSoup(site_text, 'html.parser')
            links = [tag.get('href') for tag in soup.find_all('a') 
                if is_valid_url(tag.get('href'))]
            await queue.put({
                "request_id": request_id,
                "url": url,
                "links": links
            })  
        else:
            logger.info("Failed parsing url content: %s (%d, %s)",
                url, result.status_code, result)
    except Exception as e:
        logger.error("Failed scrapping url: {}".format(url), e)

parse_url downloads submitted URL using requests and extracts all a elements from DOM tree. Since I was interested only in HTTP/S links I added a simple is_valid_url check to filter relative and non-http URLs. After parsing URL all valid URLs are packed into a JSON payload and submitted to my global queue.

Note that if multiple clients would connect via web sockets only one of them would get a message. If I wanted to send a message to a specific client or all clients I’d have to have different queues for different clients.

Onward to testing! While I could make a full fledged client to connect to web sockets I decided to try and go the curl route. To make a WebSocket connection I basically have to make a HTTP request with a few additional headers:

Connection: Upgrade and Upgrade: websocket are a necessary part of handshake to convert protocol from HTTP to a WebSocket protocol. Sec-WebSocket-Key is used by server to prove it received a valid WebSocket opening handshake and Sec-WebSocket-Version is largely self-explanatory.

curl --include \
    --no-buffer \
    --header "Connection: Upgrade" \
    --header "Upgrade: websocket" \
    --header "Host: localhost" \
    --header "Origin: http://localhost:8080" \
    --header "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==" \
    --header "Sec-WebSocket-Version: 13" \
    "localhost:8080/scraped_urls"

This yields a response similar to:

HTTP/1.1 101
sec-websocket-accept: qGEgH3En71di5rrssAZTmtRTyFk=
upgrade: WebSocket
connection: Upgrade
date: Sat, 06 Mar 2021 18:33:06 GMT
server: hypercorn-h11

101 stands for Switching Protocols.
Now that I have a working client and server what remains is to submit one URL:

curl -H "Content-type: application/json" \
    -XPUT localhost:8080/urls/1 \
    -d '{"url":"https://www.kofe.si"}'

This should printout raw text (originally JSON) result in shell where we have a connected client via WebSocket protocol. I took the liberty to make response of parse_url pretty printed.

{
  "request_id": "1", 
  "url": "https://www.kofe.si", 
  "links": [
    "https://github.com/crnkofe", 
    "https://twitter.com/crnkofe", 
    "https://devdocs.io/openjdk~11/java.base/java/lang/autocloseable", 
    "https://www.python.org/dev/peps/pep-0343/", 
    "https://github.com/crnkofe/blog/tree/master/2021-01-18_context"
  ]
}

And with that we get a nice async response from the server with scrapped links. I will make a proper client for this server in the future to showcase more of a client-server dialogue than just a simple curl.

Feel free to check-out sockets code on my Github.