The size of data on the internet according to some sources, in 2020 hit 40 zetabytes. A zetabyte is about a trillion gigabytes. That’s quite a bit you have to admit.

The best way to retrieve this data and do something useful with it is by sending an HTTP request. However, a singe request probably won’t get you a lot of data. So, the actual best way of retrieving data is by sending a lot of HTTP requests . But sending a large number of requests can take quite a bit of time if you are sending them synchronously , meaning if you are waiting for one request to complete, before you send the next one. The best way to solve this issue is by making use of concurrency.

Python is pretty nice for working with data. There is a ton of useful libraries, and it is the industry standard for data science, and one of the go to languages for data engineering.

In this article I will describe two different approaches for sending concurrent requests.

The built-in concurrent library

Python is technically a multi-threaded language, however, due to the GIL (global interpreter lock) in practice it really isn’t. So, threads in Python have more to do with concurrency, than with parallelism.

The concurrent library has a class called ThreadPoolExecutor , which we will use for sending concurrent requests. For this example I am using the Rick and Morty API . Our goal is to get information on the various characters of the Rick and Morty cartoon, and the API is a good place to start. Let‘s see some code, and I will explain it line by line:

Lines 1-3 are the imported libraries we need. We’ll use the requests library for sending HTTP requests to the API, and we’ll use the concurrent library for executing them concurrently.

The characters variable is a range of integers from 1-99 (notice I use range instead of list, because this way the variable is lazy loaded into memory , meaning it’s a bit more efficient memory wise) .

The base_url is the endpoint we will call along with the character id suffix to get our data.

The threads variable basically tells our threadPoolExecutor that we wan’t a maximum of 20 threads ( but not real OS threads like I said) to be spawned. Lines 13-20 do the actual execution.

The future_to_url variable is a dictionary with an interesting key value pair. The key is a method - executor.submit . What’s more interesting is that the method accepts two parameters . One is the name of the function ( get_character_info) , and the other is the parameter that is passed to that function. Make sure not to mix this up, by adding the char parameter in parentheses as you would when calling the get_character_info function by itself. The value of the dictionary is basically a tuple comprehension, which is something we normally wouldn’t use as a dictionary value. The point here is to iterate over all of the character ids, and make the function call for each.

Next, we initiate a for loop that will iterate over concurrent.futures.as_completed(future_to_url), which in simple terms means - get me the results of these calls as the finish.

The try/except block will declare the data variable as the result of the HTTP request, which hopefully won’t fail. If it does, we will print a simple error message, to see what went wrong.

If you ran this code, you have probably seen how fast it executes. We got 100 API results in less than a second. Had we done this one by one, it would probably have taken over a minute.

The Asyncio library

This one is also built-in , but in order to use it with HTTP calls, we need to install an asynchronous HTTP library , called aiohttp. The reason is that the requests library that we used previously doesn’t work asynchronously, so it won’t have any effect here.

Asyncio works differently than ThreadPoolExecutor, and uses something called the event loop. This is similar to how NodeJs works, so if you are coming from JavaScript, you may be familiar with this approach.

Here is the code:

The first 10 lines of code are relatively similar to the ThreadPoolExecutor approach, with 2 main differences. First, we’re importing the aiohttp library instead of requests . Second, in our function definition , we’re using the async keyword in front of everything else. This way we are telling the Python interpreter that we will be running this function in an event loop.

Lines 12 - 18 is where things start to differ from the first approach, but as you can probably conclude, the most significant call is the tasks.append call which is similar to the executor.submit call from the first approach. The asyncio.gather on the next line is similar to the futures.as_completed method in the sense that it is gathering the results of the concurrent calls in a singe collection.

Finally, when working with asyncio we need to call asyncio.run() (which is available only from Python 3.7 and up, otherwise it takes a couple more lines of code) .This function accepts a parameter which is the asynchronous function we want to add to the event loop.

This approach is perhaps a bit more complex, but it is faster and more reliable . I would generally recommend it more, especially if you are making hundreds or even thousands of concurrent calls.

Ultimately, either one of the approaches will finish the HTTP calls in a fraction of the time it would take to call them synchronously.

Thanks for reading, I hope you enjoyed it!