Frequencies in R - Part 1

Tuesday, December 6, 2016 - 08:45
I’m often asked to name the most common statistical procedure used in my company’s Data Science work. My answer, only partly in jest, is frequencies and crosstabs — to help with the mundane tasks of profiling and exploring data. Indeed frequency distributions and the dotplots that showcase them, along with histogram/densityplots, scatterplots, and their trellis variants, are the foundations of our Exploratory Data Analysis (EDA) that precedes formal statistical and Machine Learning investigations.

Frequencies can either be for a single attribute or for a crossclassification of multiple attributes. Counts and perhaps percentages of total are the primary outputs. The number of distinct values for the attribute (or crossclassification) represents the cardinality. Generally, I look to include null values in the tallies, but prefer the choice of whether to include or discard them. Critical is the capability to compute counts for the result of a function applied to an attribute. For example, it’s important to readily tally the counts of isnull and isnotnull for each data.frame attribute.

There’s no shortage of functions to compute multivariate frequencies in R, some derived from the core language, others from addon packages developed by the vast R ecosystem. The ones I’ve worked with primarily, and the foci of Part I, are table from the base package, xtabs from the stats package, and count from Hadley Wickham’s plyr package.

So I set out to evaluate the three alternatives for capturing frequencies of attributes in R data frames. I first downloaded three years of Medicare provider utilzation data available from cms.gov to drive the tests. The final R data.frame/data.table, medicarephysician, produced from code in the cell below, is meaty, with over 27M records and 30 attributes. I performed the analyses below on my Wintel notebook with 64GB RAM — though earlier versions of the tests completed on a 16 GB machine.

The 27M+ medicarephysician records load in 3 minutes and consume slightly more than 6GB of memory. Note the usage of addon packages data.table, readr, and dplyr. I especially like the %>% operator for chaining operations in dplyr.

screen-shot-2016-11-29-at-2-43-41-pm

screen-shot-2016-11-29-at-2-44-05-pm

The first test pits, in turn, the table, xtabs, and plyr::count functions for a single column frequency of an attribute from medicarephysician, nppescredentials, which has a relatively high cardinality of 16,000+. Each of the three cells tallies column frequencies and writes them to a data.frame, computing cardinality as a by-product. Actually, table and xtabs return exotic structures that are coerced to data.frames/data.tables with tbl_dt. Both also exclude null values by default. In this example, the performance timings of the three competitors are similar.

screen-shot-2016-11-29-at-2-44-26-pm

screen-shot-2016-11-29-at-2-44-46-pm

screen-shot-2016-11-29-at-2-45-07-pm

When the same frequency functions are applied to the high cardinality npi attribute (over 1M distinct values), plyr::count considerably outperforms it’s older rivals. The results from table, xtabs, and plyr::count are detailed below.

screen-shot-2016-11-29-at-2-45-20-pm

screen-shot-2016-11-29-at-2-45-35-pm

screen-shot-2016-11-29-at-2-45-48-pm

The third test is bivariate frequencies or crosstabs with attributes nppescredentials and providertype. By default, table and xtabs exclude nulls and return all not-null combinations of the two variables, even those which don’t occur in the data, so I specifically request nulls and filter 0’s from the frequencies data.frame. The timings are slightly higher for plyr::count in this example.

screen-shot-2016-11-29-at-2-46-01-pm

screen-shot-2016-11-29-at-2-46-14-pm

screen-shot-2016-11-29-at-2-46-28-pm

Finally, I invoke table, xtabs, and plyr::count on the bivariate attributes year and is.na(stdevsubmittedchrgamt) to determine the frequencies of null and not null by year. As is evident from the output, stdevsubmittedchrgamt was included in 2012 and 2013 raw data, but not in 2014. plyr::count noticably outperforms the competition in this illustration.

screen-shot-2016-11-29-at-2-46-52-pm

screen-shot-2016-11-29-at-2-47-04-pm

screen-shot-2016-11-29-at-2-47-14-pm

These quick tests affirm my recent preference for plyr::count over table and xtabs as a staple for frequencies. In Part II, I’ll present a competing home-brewed frequency function sitting on top of the high performance data.table and dplyr packages.

 

 

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.