Pandas
Serious practitioners of data science use the full scientific method, starting with a question and a hypothesis, followed by an exploration of the data to determine whether the hypothesis holds up. But in many cases, such as when you aren't quite sure what your data contains, it helps to perform some exploratory data analysis—just looking around, trying to see if you can find something.
And, that's what I'm going to cover here, using tools provided by the amazing Python ecosystem for data science, sometimes known as the SciPy stack. It's hard to overstate the number of people I've met in the past year or two who are learning Python specifically for data science needs. Back when I was analyzing data for my PhD dissertation, just two years ago, I was told that Python wasn't yet mature enough to do the sorts of things I needed, and that I should use the R language instead. I do have to wonder whether the tables have turned by now; the number of contributors and contributions to the SciPy stack is phenomenal, making it a more compelling platform for data analysis.
In my article "Analyzing Data", I described how to filter through logfiles, turning them into CSV files containing the information that was of interest. Here, I explain how to import that data into Pandas, which provides an additional layer of flexibility and will let you explore the data in all sorts of ways—including graphically. Although I won't necessarily reach any amazing conclusions, you'll at least see how you can import data into Pandas, slice and dice it in various ways, and then produce some basic plots.
PandasNumPy is a Python package, downloadable from the Python Package Index
(PyPI), which provides a data structure known an a NumPy array. These arrays, although accessible from Python, are mainly implemented in C for maximum speed and efficiency. They also operate on a vector basis, so if you add 1 to a NumPy array, you're adding 1 to every single element in that array. It takes a while to get used to this way of thinking, and to the fact that the array should have a uniform data type.Now, what can you do with your NumPy array? You could apply any number of functions to it. Fortunately, SciPy has an enormous number of functions defined and available, suitable for nearly every kind of scientific and mathematical investigation you might want to perform.
But in this case, and in many cases in the data science world, what I really want to do is read data from a variety of formats and then explore that data. The perfect tool for that is Pandas, an extensive library designed for data analysis within Python.
The most basic data structure in Pandas is a "series", which is basically a wrapper around a NumPy array. A series can contain any number of elements, all of which should be of the same type for maximum efficiency (and reasonableness). The big deal with a series is that you can set whatever indexes you want, giving you more expressive power than would be possible in a NumPy array. Pandas also provides some additional functionality for series objects in the form of a large number of methods.
But the real powerhouse of Pandas is the "data frame", which is something like an Excel spreadsheet implemented inside of Python. Once you get a table of information inside a data frame, you can perform a wide variety of manipulations and calculations, often working in similar ways to a relational database. Indeed, many of the methods you can invoke on a data frame are similar or identical in name to the operations you can invoke in SQL.
Installing Pandas isn't very difficult, if you have a working Python
installation already. It's easiest to use pip
, the standard Python
installation program, to do so:
sudo pip install -U numpy matplotlib pandas
The above will install a number of different packages, overwriting the existing installation if an older version of a package is installed.
As good as Pandas is, it's even better when it is integrated with the rest of the SciPy stack and inside of the Jupyter (that is, IPython) notebook. You can install this as well:
sudo pip install -U 'jupyter[notebook]'
Don't forget the quotes, which ensure that the shell doesn't try to interpret the square brackets as a form of shell globbing. Now, once you have installed this, run the Jupyter notebook:
jupyter notebook
If all goes well, the shell window should fill with some logfile output. But soon after that, your Web browser will open, giving you a chance (using the menu on the right side of the page) to open a new Python page. The idea is that you'll then interact with this document, entering Python code inside the individual cells, rather than putting them in a file. To execute the code inside a cell, just press Shift-Enter; the cell will execute, and the result of evaluating the final line will be displayed.
Even if I wasn't working in the area of data science, I would find the Jupyter Notebook to be an extremely clean, easy-to-use and convenient way to work with my Python code. It has replaced my use of the text-based Python interactive shell. If nothing else, the fact that I can save and return to cells across sessions means that I spend less time re-creating where I was the previous time I worked on a project.
Inside Jupyter Notebook, you'll want to load NumPy, Pandas and a
variety of related functionality. The easiest way to do so is to use
a combination of Python import
statements and the
%pylab
magic
function within the notebook:
%pylab inline
import pandas as pd
from pandas import Series, DataFrame
The above ensures that everything you'll need is defined. In theory, you
don't need to alias Pandas to pd
, but everyone else in the Pandas
world does so. I must admit that I avoided this alias for some time,
but finally decided that if I want my code to integrate nicely with
other people's projects, I really should follow their conventions.
Now let's read the CSV file that I created for my previous article. As you might remember, the file contains a number of columns, separated by tabs, which were created from an Apache logfile. It turns out that CSV, although a seemingly primitive format for exchanging information, is one of the most popular methods for doing so in the data science world. As a result, Pandas provides a variety of functions that let you turn a CSV file into a data frame.
The easiest and most common such function is
read_csv
. As you might
expect, read_csv
can be handed a filename as a parameter, which it'll
read and turn into a data frame. But read_csv
, like many other
of the read_*
functions in Pandas, also can take a file object or even
a URL.
I started by trying to read access.csv, the CSV file from
my previous
article, with the read_csv
method:
df = pd.read_csv('access.csv')
Unfortunately, this failed, with a very strange error message,
indicating that different lines of the file contained different
numbers of fields. After a bit of thought and debugging, it turns out
that this error is because the file contains tab-separated values, and
that the default setting of pd.read_csv is to assume comma
separators. So, you can retry your load, passing the
sep
parameter:
df = pd.read_csv('access.csv', sep='\t')
And sure enough, that worked! Moreover, if you ask for the keys of the Pandas data frame you have just created, you get the headers as they were defined at the top of the file. You can see those by asking the data frame to show you its keys:
df.keys()
Now, you can think of a data frame as a Python version of an Excel spreadsheet or of a table in a two-dimensional relational database, but you also can think of it as a set of Pandas series objects, with each series providing a particular column.
I should note that read_csv
(and the other
read_*
functions in Pandas)
are truly amazing pieces of software. If you're trying to read from a
CSV file and Pandas isn't handling it correctly, you either have
an extremely strange file format, or you haven't found the right
option yet.
Now that you've loaded the CSV file into a data frame, what can you do with it? First, you can ask to see the entire thing, but in the case of this example CSV file, there are more than 27,000 rows, which means that printing it out and looking through it is probably a bad idea. (That said, when you look at a data frame inside Jupyter, you will see only the first few rows and last few rows, making it easier to deal with.)
If you think of your data frame as a spreadsheet, you can look at individual rows, columns and combinations of those.
You can ask for an entire column by using the column (key) name in square brackets or even as an attribute. Thus, you can get all of the requested URLs by asking for the "r" column, as follows:
df['r']
Or like this:
df.r
Of course, this still will result in the printing of a very large number of rows. You can ask for only the first five rows by using Python slice syntax—something that's often quite confusing for people when they start with Pandas, but which becomes natural after a short while. (Remember that using an individual column name inside square brackets produces one column, whereas using a slice inside square brackets produces one or more rows.)
So, to see the first ten rows, you can say:
df[:10]
And of course, if you're interested only in seeing the first ten HTTP requests that came into the server, then you can say:
df.r[:10]
When you ask for a single column from a data frame, you're really getting a Pandas series, with all of its abilities.
One of the things you often will want to do with a data frame is figure
out the most popular data. This is especially true when working with
logfiles, which are supposed to give you some insights into your work.
For example, perhaps you want to find out which URLs were most
popular. You can ask to count all of the rows in
df
:
df.count()
This will give you a total of all rows. But, you also can retrieve a single column (which is a Pandas series) and ask it to count the number of times each value appears:
df['r'].value_counts()
The resulting series has indexes that are the values (that is, URLs) themselves and also a count (in descending order) of the number of times each one appeared.
PlottingThis is already great, but you can do even better and plot the results. For example, you might want to have a bar graph indicating how many times each of the top ten URLs was invoked. You can say:
df['r'].value_counts()[:10].plot.bar()
Notice how you take the original data frame, count the number of times
each value appears, take the top ten of those, and then invoke methods
for plotting via Matplotlib, producing a simple, but effective, bar
chart. If you're using Jupyter and invoked %pylab
inline
, this
actually will appear in your browser window, rather than an external
program.
You similarly can make a pie chart:
df['r'].value_counts()[:10].plot.pie()
But wait a second. This chart indicates that the most popular URL by
a long shot was /feed/, a URL used by RSS readers to access my blog.
Although that's flattering, it masks the other data I'm interested in.
You thus can use "boolean indexing" to retrieve a subset of rows from
df
and then plot only those rows:
df[~df.r.str.contains('/feed/')]['r'].value_counts()[:10].plot.pie()
Whoa...that looks huge and complicated. Let's break it apart to understand what's going on:
-
This used boolean indexing to retrieve some rows and get rid of others. The conditions are expressed using a combination of generic Python and NumPy/Pandas-specific syntax and code.
-
This example used the
str.contains
method provided by Pandas, which enables you to find all of the rows where the URL contained "/feed/". -
Then, the example used the (normally) bitwise operator ~ to invert the logic of what you're trying to find.
-
Finally, the result is plotted, providing a picture of which URLs were and were not popular.
Reading the data from CSV and into a data frame gives great flexibility in manipulating the data and, eventually, in plotting it.
ConclusionIn this article, I described how to read logfile data into Pandas and even executed a few small plots with it. In a future article, I plan to explain how you can transform data even more to provide insights for everyone interested in the logfile.
ResourcesData science is a hot topic, and many people have been writing good books on the subject. I've most recently been reading and enjoying an early release of the Python Data Science Handbook by Jake VanderPlas, which contains great information on data science as well as its use from within Python. Cathy O'Neil and Rachel Schutt's slightly older book, Doing Data Science, also is excellent, approaching problems from a different angle. Both are published by O'Reilly, and both are worth reading if you're interested in data science.
To learn more about the Python tools used in data science, check out the sites for NumPy, SciPy, Pandas and IPython. There is a lot to learn, so be prepared for a deep dive and lots of reading.
Pandas is available from, and documented at, http://pandas.pydata.org.
Python itself is available from here, and the PyPI package index, from which you can download all of the packages mentioned in this article, is here.