In Part 2 this month, I perform the same tests as Part 1, but instead I’ll use the capabilities and functions of the data.table and dplyr packages.
First, load the medicarephysician data.table.
Next, use data.table package functionality to compute frequencies for each of the four examples considered in Part I. The first cell tallies counts for the nppescredentials attribute of medicarephysician, which has a relatively high cardinality of 16,000+. Notice the sub-second performance, easily beating the best from last month.
The same data.table approach applied to the high cardinality npi attribute (over 1M distinct values) outperforms Part I’s competitors by an order of magnitude.
Similarly, the data.table solution for the third test of bivariate frequencies or crosstabs with attributes nppescredentials and providertype substantially bests last month’s solutions.
And finally, the 4th cell counting null and not null frequencies completes in well less than a second — again much faster than last month’s tests. Indeed, the sum of timings of all four examples with data.table is around four seconds, a no contest winner over last month.
It’s pretty easy to package the data.table frequencies capabilities in a reusable R function. frequenciesdyn, below, takes two strings as arguments: the first, the name of the data.table, and the second, the names of one or more attributes for which tallies are desired. It then builds and dynamically executes strings to tally counts and calculate percent of totals using R’s eval-parse capability. Each of the examples above are re-run using this function — with comparable performance.
With the frequenciesdyn procedure building block in place, it’s pretty straightforward to craft a driver function that loops through a data.table’s columns and tallies frequencies of each in turn. The well-worn genallnullfreq function below returns the null/not null frequencies for each column in medicarephysician by year.
Finally the generic genallfreq function returns the frequencies of all columns of a data.table. The returned data can be quite large if the input table is big and columns have high cardinalities. In the example below, I return frequencies on a subset of eleven table columns.
The use of the R community-developed packages data.table and the tidyr family from Hadley Wickham has significantly enhanced both capacity and performance of basic data analysis tasks in R. Indeed, so fundamental are these to exploratory analysis that they now must be central to any serious R work.
I’ll continue presenting ideas on data exploration/analysis in R in subsequent blogs.