Pythonic Web Sockets
The WebSocket Protocol is a cool and easy way to make updates on a website without excessive polling from clients. I’ve dived into the tech recently to make an RSS reader using sockets instead of regular HTTP calls in order to learn more about how it all works.
At first I intended to just jump in and do it with Flask
but it turned out Flask doesn’t play nicely with web sockets. Fortunately for me there’s an alternative known as Quart.
Quart works well with Python async
and as a bonus feels very “flasky”.
In this post I’m going to describe a simple prototype server for URL scrapping.
To start off I am going to make a global asyncio.Queue
where scrapped URLs will be stored for distribution to connected
clients. Rather than making a full fledged client I’ll just connect to my mini server via curl
which
as a side effect also nicely showcases some of WebSocket protocol internals.
To make a simple socket server with Quart just annotate an async
function with @app.websocket(<path>)
@app.websocket('/scraped_urls')
async def notify_scraped_urls():
try:
while True:
try:
data = queue.get_nowait()
serialized_data = json.dumps(data)
await websocket.send(serialized_data)
except asyncio.QueueEmpty:
pass
await asyncio.sleep(1)
except asyncio.CancelledError:
logger.info("Stopping processing loop due to client cancellation")
raise
Every time a client connects it essentially falls into its own infinite loop on server side until either server
gets bored and disconnects or client disconnects. When server detects client disconnection,
asyncio.CancelledError
gets raised and the loop terminates for that particular client.
Data for connected clients is simply fetched from global queue, serialized and sent with await websocket.send
.
Care must be taken when using async
as coroutines must explicitly yield control so other coroutines
can run.
In order for notify_scraped_urls
to actually send something I need to push some stuff to queue
@app.route('/urls/<request_id>', methods=['PUT'])
async def process_url(request_id):
if request_id is None:
logger.info("request_id is missing")
abort(400, "missing request_id")
data = await request.get_json()
if data is None:
logger.info("No data in payload: %s", data)
abort(400, "missing request_id")
url = data['url']
task = asyncio.create_task(parse_url(request_id, url))
return ({
"status": "accepted"
}, 201)
A sample payload for PUT HTTP request looks as follows:
{
"url": "https://kofe.si"
}
process_url
creates an async task to be processed later and returns accepted HTTP status code.
asyncio.create_task
is a way to offload some processing to a later time.
If I wrote await task
after creating a task process_url
would wait until task completion before continuing with execution.
Now for the easy part. Getting links from a URL can be easily done with requests and BeautifulSoup.
def is_valid_url(url):
return url != None and str(url).startswith("http://") or \
str(url).startswith("https://")
async def parse_url(request_id: int, url: str) -> Any:
try:
headers = {
"Accept": "text/html",
"User-Agent": "Crawler 0.1"
}
result = requests.get(url,
allow_redirects=True,
timeout=3,
verify=False,
headers=headers)
if result.status_code == 200:
site_text = result.text
soup = BeautifulSoup(site_text, 'html.parser')
links = [tag.get('href') for tag in soup.find_all('a')
if is_valid_url(tag.get('href'))]
await queue.put({
"request_id": request_id,
"url": url,
"links": links
})
else:
logger.info("Failed parsing url content: %s (%d, %s)",
url, result.status_code, result)
except Exception as e:
logger.error("Failed scrapping url: {}".format(url), e)
parse_url
downloads submitted URL using requests and extracts all a
elements from DOM tree.
Since I was interested only in HTTP/S links I added a simple is_valid_url
check to filter relative and non-http URLs.
After parsing URL all valid URLs are packed into a JSON payload and submitted to my global queue
.
Note that if multiple clients would connect via web sockets only one of them would get a message. If I wanted to send a message to a specific client or all clients I’d have to have different queues for different clients.
Onward to testing! While I could make a full fledged client to connect to web sockets I decided to
try and go the curl
route. To make a WebSocket connection I basically have to make a HTTP request with a few additional headers:
Connection: Upgrade
and Upgrade: websocket
are a necessary part of handshake to convert
protocol from HTTP to a WebSocket protocol. Sec-WebSocket-Key
is used by server to prove it received a valid WebSocket opening handshake and
Sec-WebSocket-Version
is largely self-explanatory.
curl --include \
--no-buffer \
--header "Connection: Upgrade" \
--header "Upgrade: websocket" \
--header "Host: localhost" \
--header "Origin: http://localhost:8080" \
--header "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==" \
--header "Sec-WebSocket-Version: 13" \
"localhost:8080/scraped_urls"
This yields a response similar to:
HTTP/1.1 101
sec-websocket-accept: qGEgH3En71di5rrssAZTmtRTyFk=
upgrade: WebSocket
connection: Upgrade
date: Sat, 06 Mar 2021 18:33:06 GMT
server: hypercorn-h11
101 stands for Switching Protocols.
Now that I have a working client and server what remains is to submit one URL:
curl -H "Content-type: application/json" \
-XPUT localhost:8080/urls/1 \
-d '{"url":"https://www.kofe.si"}'
This should printout raw text (originally JSON) result in shell where we have a connected client via WebSocket protocol.
I took the liberty to make response of parse_url
pretty printed.
{
"request_id": "1",
"url": "https://www.kofe.si",
"links": [
"https://github.com/crnkofe",
"https://twitter.com/crnkofe",
"https://devdocs.io/openjdk~11/java.base/java/lang/autocloseable",
"https://www.python.org/dev/peps/pep-0343/",
"https://github.com/crnkofe/blog/tree/master/2021-01-18_context"
]
}
And with that we get a nice async response from the server with scrapped links. I will make a proper client for this server in the future to showcase more of a client-server dialogue than just a simple curl.
Feel free to check-out sockets code on my Github.