Examining Data Using Pandas
You don't need to be a data scientist to use Pandas for some basic analysis.
Traditionally, people who program in Python use the data types that come with the language, such as integers, strings, lists, tuples and dictionaries. Sure, you can create objects in Python, but those objects typically are built out of those fundamental data structures.
If you're a data scientist working with Pandas though, most of your time is spent with NumPy. NumPy might feel like a Python data structure, but it acts differently in many ways. That's not just because all of its operations work via vectors, but also because the underlying data is actually a C-style array. This makes NumPy extremely fast and efficient, consuming far less memory for a given array of numbers than traditional Python objects would do.
The thing is, NumPy is designed to be fast, but it's also a bit low level for some people. To get more functionality and a more flexible interface, many people use Pandas, a Python package that provides two basic wrappers around NumPy arrays: one-dimensional Series objects and two-dimensional Data Frame objects.
I often describe Pandas as "Excel within Python", in that you can perform all sorts of calculations as well as sort data, search through it and plot it.
For all of these reasons, it's no surprise that Pandas is a darling of the data science community. But here's the thing: you don't need to be a data scientist to enjoy Pandas. It has a lot of excellent functionality that's good for Python developers who otherwise would spend their time wrestling with lists, tuples and dictionaries.
So in this article, I describe some basic analysis that everyone can do with Pandas, regardless of whether you're a data scientist. If you ever work with CSV files (and you probably do), I definitely recommend thinking about using Pandas to open, read, analyze and even write to them. And although I don't cover it in this article, Pandas handles JSON and Excel very well too.
Creating Data FramesAlthough it's possible to create a data frame from scratch using Python data structures or NumPy arrays, it's more common in my experience to do so from a file. Fortunately, Pandas can load data from a variety of file formats.
Before you can do anything with Pandas, you have to load it. In a Jupyter notebook, do:
%pylab inline
import pandas as pd
For example, Python comes with a csv
module that knows how to handle
files in CSV (comma-separated value) format. But, then you need to
iterate over the file and do something with each of those
lines/rows. I often find it easier to use Pandas to work with
such files. For example, here's a CSV file:
a,b,c,d
e,f,g,h
"i,j",k,l,m
n,o.p,q
You can turn this into a data frame with:
df = pd.read_csv('mycsv.csv')
Now you can view the contents of the data frame in Jupyter with:
df
The thing is, there are all sorts of CSV files you probably don't
even think of as CSV files. For example, consider the Linux
/etc/passwd file. It's not really a CSV file, but if you think about
it for a moment, you'll realize that the format is the same. Each
record is on a single line, and the fields are separated with ":"
characters. So in this case, you'll need to use the sep
parameter to indicate
the separator:
filename = '/etc/passwd'
df = pd.read_csv(filename, sep=':')
df.head()
The command df.head()
, much like the UNIX head
utility, shows the
first few rows of the data frame. I often use this method to inspect
the data and make sure it came through okay.
In this particular case, the data came through just fine, but the first line (the root user's record!) was interpreted as header rows. Fortunately, you can fix that with another parameter:
df = pd.read_csv(filename, sep=':', header=None)
df.head()
If you pass the header
parameter, you're telling Pandas which
row of the file should be considered the headers. But if you pass
None
as a value, you're telling Pandas that none of the rows is the
header. That's fine, but then you're going to get integers (starting
with 0) as your column names. I'd rather use real names, so
to specify them, do:
df = pd.read_csv(filename, sep=':', header=None,
names=['username', 'password', 'uid',
'gid', 'name', 'homedir', 'shell'])
df.head()
The thing is, do you really need the password
column? Given that on
modern computers, it'll always contain "x", I think that you can leave
it out. So, you can say:
df = pd.read_csv(filename, sep=':', header=None,
usecols=[0,2,3,4,5,6],
names=['username', 'uid', 'gid',
'name', 'homedir', 'shell'])
df.head()
The usecols
parameter indicates which columns you want to read from
the file.
Now, why would you want to use Pandas on your /etc/passwd file? You probably don't, but as you can imagine, if it works on that file, it'll work on other files too.
For example, many cities now are making data available to the general public. Chicago publishes much of its data online, so let's take a look and see, for example, what you can find out about what cars were towed in the last 90 days in Chicago.
You can go to the Chicago data portal at https://data.cityofchicago.org/browse. I clicked "towed vehicles"→"export"→"CSV" to get a CSV file. Then I took the downloaded file and imported it into Pandas with:
df = pd.read_csv('towed.csv')
Nah, I'm just kidding. You think that I have the time and energy to download the file and then load it? One of my favorite features of Pandas is the fact that most methods that work with files also can work with URLs. So, I can say:
url = 'https://data.cityofchicago.org/api/views/ygr5-vcbg/
↪rows.csv?accessType=DOWNLOAD'
df = pd.read_csv(url)
(Note that the URL might well have changed by the time you read this, or it might be keyed to my session information. If not, all the better, from my perspective!)
I then take a look at the data frame and discover that it has headers, that the separator is a comma and that it worked just fine. The only adjustment I'm going to make is to parse the "tow date" column as a date, so I can play with that a bit. I also indicate that because the date is in US format, the day comes first:
df = pd.read_csv(url, parse_dates=['Tow Date'], dayfirst=False)
df.head()
You'll notice that the first column now appears a bit differently, in
year-month-day format. That shows there is a datestamp. You also
can see that if you run the df.info
method, which tells you about the
data frame itself:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5560 entries, 0 to 5559
Data columns (total 10 columns):
Tow Date 5560 non-null datetime64[ns]
Make 5537 non-null object
Style 5538 non-null object
Model 509 non-null object
Color 5536 non-null object
Plate 4811 non-null object
State 5392 non-null object
Towed to Address 5560 non-null object
Tow Facility Phone 5559 non-null object
Inventory Number 5560 non-null int64
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 434.5+ KB
As you can see, the "Tow Date" column is now being stored as a
datetime64
, a 64-bit field containing a date and time.
So, now that you have this data, what can you do with it? One of my
favorite methods is value_counts
, which tells how many times a
particular value appears in a column. So, you can say the following to find
out how many vehicles were towed from each state:
df['State'].value_counts()
Now,
that's going to be a long list, so it's probably better to limit it by
using head
on the series you get back from
value_counts
:
df['State'].value_counts().head(10)
Now you'll see which ten states have the most towed vehicles in Chicago in the last 90 days:
IL 4948
IN 148
TX 48
WI 48
MN 28
IA 27
MI 19
GA 14
FL 14
TN 13
Not surprisingly, most of the vehicles towed in Chicago were from Illinois, with the second-highest number from Indiana, which is nearby. Less expected (to me, at least) was the large number of towed vehicles from Texas, which is not exactly nearby.
To make use of the fact that the timestamp is now a true
datetime
object, you can find out the most
common dates on which vehicles were towed:
2018-03-03 215
2018-03-04 195
2018-02-24 165
2018-03-25 148
2018-03-26 140
2018-03-24 135
2018-03-15 126
2018-03-21 122
2018-02-03 120
2018-03-22 117
As you can see, there's a great deal of variation. Maybe that
variation is due to the day of the week? In order to find out,
you can use the dt
proxy object Pandas provides for
datetime
columns. You can use that to extract information from the
datetime
column and then analyze it.
For example, you can say:
df['Tow Date'].dt.dayofweek
This retrieves the day of the week for each of the tow dates. Then you can
use value_counts
to summarize how many vehicles were towed on
each day of the week:
df['Tow Date'].dt.dayofweek.value_counts()
The results are as follows:
5 1143
4 831
3 820
6 798
2 794
0 640
1 534
The above indicates that Chicago towed far more cars on Fridays than on other days. This might always be true, or it might just be true for this 90-day sample.
You also can check to see what brand of cars are towed most often or whether there's a correlation between the color and the chances of being towed.
What if you want to know, week by week, how many cars were towed? For that, you can take advantage of a great Pandas feature, in which you can set the data frame's index to be a timestamp column:
df.set_index('Tow Date', inplace=True)
Now, instead of accessing rows by an integer index, you'll use a
timestamp. But the point is not to access things that way. Rather,
it's a resample
, similar to a GROUP BY
query in SQL.
Resampling can be done in a variety of ways; here's asking for it
to be on a one-week basis:
df.resample('1W')
By itself, this does nothing. But if you then count the number of rows
for each week, you get much more interesting output. For every column, you
get the number of entries per week. The numbers differ, because many
columns have missing (NaN
) information, which isn't included in the
count. That's fine; you can just use the Make
column, which seems to
be filled in for every towed vehicle:
df.resample('1W').count()['Make']
Now you should get a very nice report, showing how many cars were towed each week:
Tow Date
2017-12-31 103
2018-01-07 321
2018-01-14 209
2018-01-21 241
2018-01-28 250
2018-02-04 399
2018-02-11 248
2018-02-18 328
2018-02-25 587
2018-03-04 862
2018-03-11 495
2018-03-18 601
2018-03-25 754
2018-04-01 139
Freq: W-SUN, Name: Make, dtype: int64
Humans see patterns more easily with graphics than with tables of numbers, so you might want to plot this, also using Pandas:
df.resample('1W').count()['Make'].plot()
That makes a line plot showing that the number of towed vehicles shot up at the start of March. Why? You could look at temperatures and a calendar to start to guess, but the fact that you can download, inspect and graph this information so easily is a good starting point.
ConclusionPandas is more than merely a tool for data scientists to do numeric analysis. The fact that it can read (and write) formats such as CSV, Excel and JSON make it my primary tool for working with those formats. Add to that the fact that it can summarize data, including based on timestamps, and you probably can see why Pandas is so popular.
ResourcesThe home page for Pandas is here, and the Jupyter notebook, which I mentioned in this article, is here.
Finally, the list of "datetime components", which you can access via
the dt
object on timestamp columns in Pandas, is
here.