There are so many ways to create a graph using Python, but which way is best? When we make visualizations, it is important to ask some questions as to the figure’s objective: Are you trying to get an initial feel for how your data looks? Maybe you are trying to impress someone at a presentation? Perhaps you want to show someone a figure internally and want a middle-of-the-road figure? In this post, I will be walking through a number of popular Python visualization packages, their pros and cons, and situations where they can each shine. I will scope this review to 2D plots, leaving room for 3D figures and dashboards for another time, though many of these packages support both quite well.
I’m going to group these together for a few reasons, first of which being that Seaborn and Pandas plotting were built on top of Matplotlib — when you use Seaborn or df.plot() in Pandas, you are actually utilizing code that people have written using Matplotlib. The resulting aesthetic from each of these is therefore similar and the ways of customizing your plots will use eerily similar syntax.
When I think of these visualization tools, I think of 3 words: Exploratory Data Analysis. These packages are fantastic for getting a first look at your data but lack when it comes to presentation. Matplotlib is a low-level library that allows for incredible levels of customization (so don’t simply rule it out for presentation!), but there are many other tools that make great presentation-worthy graphics much easier. Matplotlib also has a set of style selections which emulate other popular aesthetics like ggplot2 and xkcd. Below are some examples of graphs made using Matplotlib and its cousins:
When working with basketball salary data, I wanted to find the teams that had the highest median salary data. In order to show that, I color-coded a bar graph with each team’s salary below to show which teams players should appeal to in order to be on a team that pays well.
This second plot is a Q-Q plot of the residuals of a regression experiment. The main purpose of this visualization is to show how few lines are necessary to make a useful visualization, even if the aesthetics are not eye-popping.
Ultimately, Matplotlib and its relatives are very efficient, but typically not the final product as far as a presentation goes.
I groaned as I wrote that. “Aaron, why are you talking about ggplot, the most popular R visualization package? Isn’t this a Python package review?” you may ask. People implemented ggplot2 in Python, replicating everything from the aesthetics to syntax. From all of the material I have seen, it looks and feels like ggplot2, but with the added bonus of having dependancies on the Pandas Python package, which recently deprecated some methods resulting in the irrelevance of the Python version. If you want to use the REAL ggplot in R(which has all of the same looks, feels, and syntax without the dependencies), I talk through some of its perks here! That said, if you truly must use ggplot in Python, you must install Pandas version 0.19.2, but I would caution against downgrading your Pandas so that you can use an inferior plotting package.
What makes ggplot2 (and ggplot for Python, I guess) game-changing is that they use the “Grammar of Graphics” to construct a figure. The basic premise is that you can instantiate your plot and then add different features to it separately, i.e. the title, axes, data points, and trendline are all added separately with their own aesthetic properties. A simple example of some ggplot code follows is below. First we instantiate our figure with ggplot, set our aesthetics and data, then add points, a theme, and axis/title labels.
Bokeh is beautiful. Conceptually similar to ggplot in how it uses the grammar of graphics to structure its figures, Bokeh has an easy-to-use interface that makes very professional graphs and dashboards. To illustrate my point (sorry!), below is a sample of code to make a histogram from the 538 Masculinity Survey dataset.Using Bokeh to represent survey responses
The bar graph to the left shows responses to the question “Do you identify as masculine” as asked by 538 in a recent survey. The Bokeh code in lines 9–14 create an elegant and professional histogram of response counts, with sensible font sizing, y-ticks, and formatting. The majority of the code I wrote went to labeling the axes and title, along with giving the bars a color and border. When making nice, presentable figures, I lean very heavily towards Bokeh — a lot of the aesthetic work has already been done for us!Using Pandas to represent the same data
The blue plot to the left is what comes from the single line of code on line 17 of the gist above. Both histograms have the same values, but serve different purposes. In an exploratory setting, it is much more convenient to write one line with pandas to see the data, but the aesthetics of Bokeh are pretty clearly superior. Every convenience that Bokeh provides takes customization in matplotlib, be it angle of x-tick labels, background lines, ytick spread, font sizing/italicizing/bolding, etc. The graph below shows a few random trends using a few more customizations with legends and different line types and colors:
Bokeh is also a great tool for making interactive dashboards. I don’t want to get into dashboarding in this post, but there are great posts (like this one) that get more into the application and implementation of Bokeh dashboards.
Plotly is extremely powerful, but both setup and creating figures take a lot of time and neither are intuitive. After spending the better part of a morning working through Plotly, I went to lunch with almost nothing to show for it. I had created a bar graph without axis labels and a ‘scatterplot’ that had lines that I couldn’t remove. Some notable cons when getting started with Plotly:
- It requires an API key and registration rather than just a pip install
- It plots data/layout objects that are unique to Plotly and aren’t intuitive
- The plot layout hasn’t worked for me (40 lines of code for literal nothing!)
For all of its setup cons, however, there are pros and workarounds:
- You can edit plots on the Plotly website as well as in a Python environment
- There is a lot of support for interactive graphs/dashboards
- Plotly is partnered with Mapbox, allowing for customized maps
- There is amazing overall potential for great graphics
It wouldn’t be fair for me to just air my gripes with Plotly without showing some code and what I was able to accomplish versus what people more capable with this package have made.A bar graph representing average turnovers per minute by different NBA teamsAn attempt at a scatterplot representing salary as a function of playing time in the NBA
Overall, the aesthetics out of the box look good but multiple attempts at fixing the axis labels copying the documentation verbatim yielded no change. As I promised before, however, here are some plots that show the potential of Plotly and why spending more than a few hours might be worth it:Some sample plots from the Plotly page
Pygal is a slightly lesser-known plotting package that, like other popular packages, uses the grammar of graphics framework to construct its images. It is a relatively straight-forward package due to how simple the plot objects are. Using Pygal is about as simple as:
- Instantiate your figure
- Format using the figure objects’ attributes
- Add data to your figure using figure.add() notation
The main issues I had with Pygal were in actually rendering the figures. I had to use their render_to_file option then open that file in a web browser to see what I had built. It was ultimately worth it, as the figures are interactive and have a pleasant and easily customizable aesthetic. Overall, the package seems good but has some file creation/rendering quirks that limit its appeal.
Networkx is a great solution for analyzing and visualizing graphs, though it is based visually on matplotlib. Graphs and networks are not my area of expertise, but Networkx allows for quick and easy graphical representations of connected networks. Below are a few different representations of a simple graph I constructed, and some code getting started plotting a small Facebook network downloaded from SNAP at Stanford.
The code I used to color-code each node by its number (1–10) is below:
Below is code I wrote to visualize the sparse Facebook graph mentioned above:This graph is quite sparse, Networkx shows it by giving each cluster maximum separation
There are so many packages out there to visualize data and no clear best package. Hopefully after reading through this review, you can see how some of the various aesthetics and code lend themselves to different situations, from EDA to presentation.
Source: Towards Data Science