Threading in Python
Threads can provide concurrency, even if they're not truly parallel.
In my last article, I took a short tour through the ways you can add concurrency to your programs. In this article, I focus on one of those forms that has a reputation for being particularly frustrating for many developers: threading. I explore the ways you can use threads in Python and the limitations the language puts upon you when doing so.
The basic idea behind threading is a simple one: just as the computer can run more than one process at a time, so too can your process run more than one thread at a time. When you want your program to do something in the background, you can launch a new thread. The main thread continues to run in the foreground, allowing the program to do two (or more) things at once.
What's the difference between launching a new process and a new thread? A new process is completely independent of your existing process, giving you more stability (in that the processes cannot affect or corrupt one another) but also less flexibility (in that data cannot easily flow from one thread to another). Because multiple threads within a process share data, they can work with one another more closely and easily.
For example, let's say you want to retrieve all of the data from a variety
of websites. My preferred Python package for retrieving data from the web is
the "requests" package, available from PyPI. Thus, I can use a
for
loop, as
follows:
length = {}
for one_url in urls:
response = requests.get(one_url)
length[one_url] = len(response.content)
for key, value in length.items():
print("{0:30}: {1:8,}".format(key, value))
How does this program work? It goes through a list of URLs (as strings), one
by
one, calculating the length of the content and then storing that
content inside a dictionary called length
. The keys in
length
are URLs, and the values are the lengths of the requested URL content.
So far, so good; I've turned this into a complete program (retrieve1.py), which is shown in Listing 1. I put nine URLs into a text file called urls.txt (Listing 2), and then timed how long retrieving each of them took. On my computer, the total time was about 15 seconds, although there was clearly some variation in the timing.
Listing 1. retrieve1.py
#!/usr/bin/env python3
import requests
import time
urls = [one_line.strip()
for one_line in open('urls.txt')]
length = {}
start_time = time.time()
for one_url in urls:
response = requests.get(one_url)
length[one_url] = len(response.content)
for key, value in length.items():
print("{0:30}: {1:8,}".format(key, value))
end_time = time.time()
total_time = end_time - start_time
print("\nTotal time: {0:.3} seconds".format(total_time))
Listing 2. urls.txt
http://lerner.co.il
http://LinuxJournal.com
http://en.wikipedia.org
http://news.ycombinator.com
http://NYTimes.com
http://Facebook.com
http://WashingtonPost.com
http://Haaretz.co.il
http://thetech.com
Improving the Timing with Threads
How can I improve the timing? Well, Python provides threading. Many people think of Python's threads as fatally flawed, because only one thread actually can execute at a time, thanks to the GIL (global interpreter lock). This is true if you're running a program that is performing serious calculations, and in which you really want the system to be using multiple CPUs in parallel.
However, I have a different sort of use case here. I'm interested in retrieving data from different websites. Python knows that I/O can take a long time, and so whenever a Python thread engages in I/O (that is, the screen, disk or network), it gives up control and hands use of the GIL over to a different thread.
In the case of my "retrieve" program, this is perfect. I can spawn a separate thread to retrieve each of the URLs in the array. I then can wait for the URLs to be retrieved in parallel, checking in with each of the threads one at a time. In this way, I probably can save time.
Let's start with the core of my rewritten program. I'll want to
implement the retrieval as a function, and then invoke that function
along with one argument—the URL I want to retrieve. I then
can invoke that function by creating a new instance of
threading.Thread
,
telling the new instance not only which function I want to run in a
new thread, but also which argument(s) I want to pass. This is how
that code will look:
for one_url in urls:
t = threading.Thread(target=get_length, args=(one_url,))
t.start()
But wait. How will the get_length
function communicate the content
length to the rest of the program? In a threaded
program, you really
must not have individual threads modify built-in data structures,
such as a list. This is because such data structures aren't
thread-safe, and doing something such as an "append" from one thread might
cause all sorts of problems.
However, you can use a "queue" data structure, which is thread-safe, and thus guarantees a form of communication. The function can put its results on the queue, and then, when all of the threads have completed their run, you can read those results from the queue.
Here, then, is how the function might look:
from queue import Queue
queue = Queue()
def get_length(one_url):
response = requests.get(one_url)
queue.put((one_url, len(response.content)))
As you can see, the function retrieves the content of
one_url
and
then places the URL itself, as well as the length of the content, in a
tuple. That tuple is then placed in the queue.
It's a nice little program. The main thread spawns a new
thread, each of which runs get_length
. In
get_length
, the
information gets stuck on the queue.
The thing is, now it needs to retrieve things from the queue. But if you do this just after launching the threads, you run the risk of reading from the queue before the threads have completed. So, you need to "join" the threads, which means to wait until they have finished. Once the threads have all been joined, you can read all of their information from the queue.
There are a few different ways to join the threads. An easy one is to create a list where you will store the threads and then append each new thread object to that list as you create it:
threads = [ ]
for one_url in urls:
t = threading.Thread(target=get_length, args=(one_url,))
threads.append(t)
t.start()
You then can iterate over each of the thread objects, joining them:
for one_thread in threads:
one_thread.join()
Note that when you call one_thread.join()
in this way, the call
blocks. Perhaps that's not the most efficient way to do things, but in
my experiments, it still took about one second—15 times faster—to
retrieve all of the URLs.
In other words, Python threads are routinely seen as terrible and useless. But in this case, you can see that they allowed me to parallelize the program without too much trouble, having different sections execute concurrently.
Listing 3. retrieve2.py
#!/usr/bin/env python3
import requests
import time
import threading
from queue import Queue
urls = [one_line.strip()
for one_line in open('urls.txt')]
length = {}
queue = Queue()
start_time = time.time()
threads = [ ]
def get_length(one_url):
response = requests.get(one_url)
queue.put((one_url, len(response.content)))
# Launch our function in a thread
print("Launching")
for one_url in urls:
t = threading.Thread(target=get_length, args=(one_url,))
threads.append(t)
t.start()
# Joining all
print("Joining")
for one_thread in threads:
one_thread.join()
# Retrieving + printing
print("Retrieving + printing")
while not queue.empty():
one_url, length = queue.get()
print("{0:30}: {1:8,}".format(one_url, length))
end_time = time.time()
total_time = end_time - start_time
print("\nTotal time: {0:.3} seconds".format(total_time))
Considerations
The good news is that this demonstrates how using threads can be effective when you're doing numerous, time-intensive I/O actions. This is especially good news if you're writing a server in Python that uses threads; you can open up a new thread for each incoming request and/or allocate each new request to an existing, pre-created thread. Again, if the threads don't really need to execute in a truly parallel fashion, you're fine.
But, what if your system receives a very large number of requests? In such a case, your threads might not be able to keep up. This is particularly true if the code being executed in each thread is CPU-intensive.
In such a case, you don't want to use threads. A popular option—indeed, the popular option—is to use processes. In my next article, I plan to look at how such processes can work and interact.