video

R tip: Reshape data with tidyr

See how the tidyr R package’s gather and spread functions work. Plus a bonus look at labeling in ggplot2

Similar
Comments
Hi, I’m Sharon Machlis, Director of Editorial Data & Analytics at IDG Communications. I’m here with Episode 12 of Do More With R: Reshaping data with the tidyr package.
It’s Murphy’s Law of Data: The data you have isn’t always in the format that you need. And not all problems have to do with mistakes or gaps in the data. Sometimes you’ve got wide data that needs to be long; or long data that needs to be wide.
Let’s work on an example. I’ll read in a spreadsheet of home prices in 5 U.S. metro areas: Boston, Detroit, Philadelphia, San Francisco, and San Jose (which I’m calling Silicon Valley). More specifically, data about home prices every 2 years, when all cities started with an index of 100 in 1995. This data runs from 2000 to 2018.
Here’s a look at the spreadsheet:
And here’s what it looks like when I import it with the rio package.
This is a pretty human-friendly format. It’s sometimes referred to as a “wide” format. Each metro area has its own column, and you can scan down each column and see the movement for that metro area.
But if you want to graph that with ggplot2, you want the data in so-called tidy, or “long”, format. One observation per row, and no data in column names. So you can easily tell ggplot2 color by city. Right now, the city information is in COLUMN NAMES, not the data itself.
Another example: If I want to calculate which city had the highest index value in each year, it’s pretty easy to calculate which number is highest in each row. But if you want to show which metro area had the highest index value, you have to pull information from the column name.
Here’s what a tidy version of this data looks like.
One observation per row: The quarter, the home-price index value, and the Metro area. Not as easy for a person to scan, but much better for analyzing in R – especially with tidyverse packages.
So, if the only version of your data was the wide version
How do you get the long version? One way is with the tidyr package’s “gather” function.
gather() takes at least 3 arguments: 1st the name of your data frame, second the name you want for your new category column – that’s called the key. And third is the name you want for your new value column, that’s called the value. After that are any columns that you want “gathered” into the new key and value columns. If you don’t supply any column names, all the columns get gathered. In this case, we want all the city columns gathered but not the Quarter column. I can exclude that with minus Quarter.
If I run that
You’ll see I’ve got a long or tidy version of the data. So once again that was gather, my data frame, name of new category column, name of new value column, then outlining the columns I want to gather or excluding the ones not to gather with a minus sign.
This version is much easier to graph with ggplot2.
The first code group is a default ggplot line chart with this data. Just by adding group = MetroArea my chart plots each metro area as its own series, or line. color = MetroArea gives each line a different color.
The second code group adds a little more customization to the plot:
I’ve selected a different theme, and then tweaked that by removing all the background grids and y-axis label, adding a title and subtitle, and centering the title and subtitle. Before I go back to reshaping, I’d like to show you a cool package that works with ggplot2 called directlabels.
I’m using the same customized plot I just made, but storing it in a variable called my_customized_plot. Then I’ll run the direct-dot-label function on it, with the argument last.points and a slight horizontal justification of the text. And see what happens.
Instead of a legend, I’ve got a nice label for each line! I do love that as an option for some plots.
Back to reshaping.
Let’s say we started off with this as tidy data, but wanted to make it “wide” to create a table that’s easier to read. Basically going from the long data frame we had now to that first version we saw with each metro area in its own column. For that, you need the opposite of gather() which is spread().
spread() also takes data, key, and value as arguments. In this case, the data is your tidy data frame. Key is the name of the existing column where you want the values each turned into their own columns. For this data, it’s MetroArea. We have one column with metro areas, and we want each one to be in its own column. Value is the name of the existing column that holds the values that should be spread out into the new columns. R may not know for sure whether that should be the Index column or the Quarter column unless we tell it.
So let’s run that.
And now we’re back to wide data.
That’s it for this episode, thanks for watching! For more R tips, head to the More With R video page at go.infoworld.com/morewithR. That’s https go dot infoworld dot com slash more with R, all lowercase except for the R. Or, you can add the “Do More With R” playlist to your YouTube library. So long, and hope to see you next episode!