Data Exploration in R -- Part I

Monday, May 22, 2017 - 17:42

With the analytics work I've been involved with over the last 15 years, I estimate that 70% of the effort has been devoted to data access/integration/wrangling/curation, about 20% focused on data exploration, and the remaining 10% fitting algorithms/models. SQL and the open source platforms R and Python have been primary computation platforms, and now provide strong coverage for all angles of the analytics process -- data integration, exploration, and modeling.
 

Though my analytics company, Inquidia Consulting, has defining expertise in data integration, my favorite area is exploration -- the attempt to preliminarily discern patterns in data using statistical visualizations and simple summarizations such as frequencies and order statistics. Often, such work provides justification and direction for modelling efforts to follow.

 

This blog is the first of a multi-part series to share a few exploratory techniques I've found useful in recent work, though it's not intended to be a comprehensive explication of data exploration. The code below is in R, and uses the data.table package, as well as Hadley Wickham's splendid ggplot2 graphics and tidyverse programming libraries.

 

The data investigated, Ames Iowa Housing Data, was assembled by an Iowa State University professor for an applied regression analysis course he teaches. "The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values."

 

The objective with this data is to predict saleprice of a house/townhome/condo as a function of housing attributes or features. With saleprice a continuous numeric target, the challenge is one of regression. We wish to explore the direction/strength of relationships between the features, which may be either numeric or categorical, with numeric saleprice. Several exploratory techniques are demonstrated below.

 

Let's get started.

 

First, set a few options, load some packages, and identify the file to be loaded from a data website. Read the data into an R data.table named housing.

In [1]:
options(warn=-1)
options(datatable.print.topn=200)
options(scipen=10)


suppressMessages(library(data.table))
suppressMessages(library(readxl))
suppressMessages(library(ggplot2))
suppressMessages(library(ggthemes))
suppressMessages(library(gridExtra))
suppressMessages(library(dplyr))
suppressMessages(library(dtplyr))
suppressMessages(library(forcats))
suppressMessages(library(tidyverse))
suppressMessages(library(splines))


dir <- "c:/data/ames housing"
#fname <- "https://www.openintro.org/stat/data/ames.csv"
fname <- "housing.csv"

setwd(dir)
In [2]:
housing <- fread(fname)
setnames(housing,gsub("[ ,.]","",tolower(names(housing))))
str(housing)
 
Classes 'data.table' and 'data.frame':	2930 obs. of  82 variables:
 $ order        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ pid          : int  526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
 $ mssubclass   : int  20 20 20 20 60 60 120 120 120 60 ...
 $ mszoning     : chr  "RL" "RH" "RL" "RL" ...
 $ lotfrontage  : int  141 80 81 93 74 78 41 43 39 60 ...
 $ lotarea      : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
 $ street       : chr  "Pave" "Pave" "Pave" "Pave" ...
 $ alley        : chr  "" "" "" "" ...
 $ lotshape     : chr  "IR1" "Reg" "IR1" "Reg" ...
 $ landcontour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
 $ utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
 $ lotconfig    : chr  "Corner" "Inside" "Corner" "Corner" ...
 $ landslope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
 $ neighborhood : chr  "NAmes" "NAmes" "NAmes" "NAmes" ...
 $ condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
 $ condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
 $ bldgtype     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
 $ housestyle   : chr  "1Story" "1Story" "1Story" "1Story" ...
 $ overallqual  : int  6 5 6 7 5 6 8 8 8 7 ...
 $ overallcond  : int  5 6 6 5 5 6 5 5 5 5 ...
 $ yearbuilt    : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
 $ yearremodadd : int  1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
 $ roofstyle    : chr  "Hip" "Gable" "Hip" "Hip" ...
 $ roofmatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
 $ exterior1st  : chr  "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
 $ exterior2nd  : chr  "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
 $ masvnrtype   : chr  "Stone" "None" "BrkFace" "None" ...
 $ masvnrarea   : int  112 0 108 0 0 20 0 0 0 0 ...
 $ exterqual    : chr  "TA" "TA" "TA" "Gd" ...
 $ extercond    : chr  "TA" "TA" "TA" "TA" ...
 $ foundation   : chr  "CBlock" "CBlock" "CBlock" "CBlock" ...
 $ bsmtqual     : chr  "TA" "TA" "TA" "TA" ...
 $ bsmtcond     : chr  "Gd" "TA" "TA" "TA" ...
 $ bsmtexposure : chr  "Gd" "No" "No" "No" ...
 $ bsmtfintype1 : chr  "BLQ" "Rec" "ALQ" "ALQ" ...
 $ bsmtfinsf1   : int  639 468 923 1065 791 602 616 263 1180 0 ...
 $ bsmtfintype2 : chr  "Unf" "LwQ" "Unf" "Unf" ...
 $ bsmtfinsf2   : int  0 144 0 0 0 0 0 0 0 0 ...
 $ bsmtunfsf    : int  441 270 406 1045 137 324 722 1017 415 994 ...
 $ totalbsmtsf  : int  1080 882 1329 2110 928 926 1338 1280 1595 994 ...
 $ heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
 $ heatingqc    : chr  "Fa" "TA" "TA" "Ex" ...
 $ centralair   : chr  "Y" "Y" "Y" "Y" ...
 $ electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
 $ x1stflrsf    : int  1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
 $ x2ndflrsf    : int  0 0 0 0 701 678 0 0 0 776 ...
 $ lowqualfinsf : int  0 0 0 0 0 0 0 0 0 0 ...
 $ grlivarea    : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
 $ bsmtfullbath : int  1 0 0 1 0 0 1 0 1 0 ...
 $ bsmthalfbath : int  0 0 0 0 0 0 0 0 0 0 ...
 $ fullbath     : int  1 1 1 2 2 2 2 2 2 2 ...
 $ halfbath     : int  0 0 1 1 1 1 0 0 0 1 ...
 $ bedroomabvgr : int  3 2 3 3 3 3 2 2 2 3 ...
 $ kitchenabvgr : int  1 1 1 1 1 1 1 1 1 1 ...
 $ kitchenqual  : chr  "TA" "TA" "Gd" "Ex" ...
 $ totrmsabvgrd : int  7 5 6 8 6 7 6 5 5 7 ...
 $ functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
 $ fireplaces   : int  2 0 0 2 1 1 0 0 1 1 ...
 $ fireplacequ  : chr  "Gd" "" "" "TA" ...
 $ garagetype   : chr  "Attchd" "Attchd" "Attchd" "Attchd" ...
 $ garageyrblt  : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
 $ garagefinish : chr  "Fin" "Unf" "Unf" "Fin" ...
 $ garagecars   : int  2 1 1 2 2 2 2 2 2 2 ...
 $ garagearea   : int  528 730 312 522 482 470 582 506 608 442 ...
 $ garagequal   : chr  "TA" "TA" "TA" "TA" ...
 $ garagecond   : chr  "TA" "TA" "TA" "TA" ...
 $ paveddrive   : chr  "P" "Y" "Y" "Y" ...
 $ wooddecksf   : int  210 140 393 0 212 360 0 0 237 140 ...
 $ openporchsf  : int  62 0 36 0 34 36 0 82 152 60 ...
 $ enclosedporch: int  0 0 0 0 0 0 170 0 0 0 ...
 $ x3ssnporch   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ screenporch  : int  0 120 0 0 0 0 0 144 0 0 ...
 $ poolarea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poolqc       : chr  "" "" "" "" ...
 $ fence        : chr  "" "MnPrv" "" "" ...
 $ miscfeature  : chr  "" "" "Gar2" "" ...
 $ miscval      : int  0 0 12500 0 0 0 0 0 0 0 ...
 $ mosold       : int  5 6 6 4 3 6 4 1 3 6 ...
 $ yrsold       : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ saletype     : chr  "WD" "WD" "WD" "WD" ...
 $ salecondition: chr  "Normal" "Normal" "Normal" "Normal" ...
 $ saleprice    : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 

Update the housing data.table to set blanks values to NA and make factors/categories of recurring-value character variables. Add a log sale price variable. Finally, create additional decile "bin" attributes for several continuous variables such as above ground living area, "grlivarea". I can hear my old stats professors gasping!

In [3]:
housing[housing==""] <- NA

invisible(housing[,`:=`(bsmtcond=as.factor(bsmtcond),
                           alley=as.factor(alley),
                           mszoning=as.factor(mszoning),
                           street=as.factor(street),
                           lotshape=as.factor(lotshape),
                           landcontour=as.factor(landcontour),
                           utilities=as.factor(utilities),
                           lotconfig=as.factor(lotconfig),
                           landslope=as.factor(landslope),
                           neighborhood=as.factor(neighborhood),
                           condition1=as.factor(condition1),
                           condition2=as.factor(condition1),
                           bldgtype=as.factor(bldgtype),
                           housestyle=as.factor(housestyle),
                           roofstyle=as.factor(roofstyle),
                           roofmatl=as.factor(roofmatl),
                           exterior1st=as.factor(exterior1st),
                           exterior2nd=as.factor(exterior2nd),
                           masvnrtype=as.factor(masvnrtype),
                           exterqual=as.factor(exterqual),
                           extercond=as.factor(extercond),
                           foundation=as.factor(foundation),
                           bsmtqual=as.factor(bsmtqual),
                           bsmtexposure=as.factor(bsmtexposure),
                           bsmtfintype1=as.factor(bsmtfintype1),
                           bsmtfintype2=as.factor(bsmtfintype2),
                           centralair=as.factor(centralair),
                           heating=as.factor(heating),
                           heatingqc=as.factor(heatingqc),
                           electrical=as.factor(electrical),
                           functional=as.factor(functional),
                           kitchenqual=as.factor(kitchenqual),
                           garagetype=as.factor(garagetype),
                           garagefinish=as.factor(garagefinish),
                           garagecond=as.factor(garagecond),
                           garagequal=as.factor(garagequal),
                           paveddrive=as.factor(paveddrive),
                           fence=as.factor(fence),
                           saletype=as.factor(saletype),
                           salecondition=as.factor(salecondition),
                           decx1stflrsf = as.factor(ntile(x1stflrsf, 10)),
                           decx2ndflrsf = as.factor(ntile(x2ndflrsf, 10)),                     
                           declotarea = as.factor(ntile(lotarea, 10)),
                           bath = as.factor(fullbath+halfbath),
                           decgrlivarea = as.factor(ntile(grlivarea, 10)),
                           log10saleprice = log10(saleprice)
                          )])
str(housing)
 
Classes 'data.table' and 'data.frame':	2930 obs. of  88 variables:
 $ order         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ pid           : int  526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
 $ mssubclass    : int  20 20 20 20 60 60 120 120 120 60 ...
 $ mszoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
 $ lotfrontage   : int  141 80 81 93 74 78 41 43 39 60 ...
 $ lotarea       : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
 $ street        : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ alley         : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ lotshape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 1 4 1 4 1 1 4 1 1 4 ...
 $ landcontour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
 $ utilities     : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ lotconfig     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
 $ landslope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ neighborhood  : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 25 25 9 ...
 $ condition1    : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
 $ condition2    : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
 $ bldgtype      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
 $ housestyle    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
 $ overallqual   : int  6 5 6 7 5 6 8 8 8 7 ...
 $ overallcond   : int  5 6 6 5 5 6 5 5 5 5 ...
 $ yearbuilt     : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
 $ yearremodadd  : int  1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
 $ roofstyle     : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
 $ roofmatl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ exterior1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
 $ exterior2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
 $ masvnrtype    : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 2 4 4 2 4 4 4 4 ...
 $ masvnrarea    : int  112 0 108 0 0 20 0 0 0 0 ...
 $ exterqual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 3 4 4 3 3 3 4 ...
 $ extercond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ foundation    : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
 $ bsmtqual      : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 3 5 3 3 3 5 ...
 $ bsmtcond      : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 5 5 5 5 5 5 5 5 5 ...
 $ bsmtexposure  : Factor w/ 4 levels "Av","Gd","Mn",..: 2 4 4 4 4 4 3 4 4 4 ...
 $ bsmtfintype1  : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 2 5 1 1 3 3 3 1 3 6 ...
 $ bsmtfinsf1    : int  639 468 923 1065 791 602 616 263 1180 0 ...
 $ bsmtfintype2  : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 4 6 6 6 6 6 6 6 6 ...
 $ bsmtfinsf2    : int  0 144 0 0 0 0 0 0 0 0 ...
 $ bsmtunfsf     : int  441 270 406 1045 137 324 722 1017 415 994 ...
 $ totalbsmtsf   : int  1080 882 1329 2110 928 926 1338 1280 1595 994 ...
 $ heating       : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ heatingqc     : Factor w/ 5 levels "Ex","Fa","Gd",..: 2 5 5 1 3 1 1 1 1 3 ...
 $ centralair    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ electrical    : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ x1stflrsf     : int  1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
 $ x2ndflrsf     : int  0 0 0 0 701 678 0 0 0 776 ...
 $ lowqualfinsf  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ grlivarea     : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
 $ bsmtfullbath  : int  1 0 0 1 0 0 1 0 1 0 ...
 $ bsmthalfbath  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ fullbath      : int  1 1 1 2 2 2 2 2 2 2 ...
 $ halfbath      : int  0 0 1 1 1 1 0 0 0 1 ...
 $ bedroomabvgr  : int  3 2 3 3 3 3 2 2 2 3 ...
 $ kitchenabvgr  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ kitchenqual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 3 1 5 3 3 3 3 3 ...
 $ totrmsabvgrd  : int  7 5 6 8 6 7 6 5 5 7 ...
 $ functional    : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ fireplaces    : int  2 0 0 2 1 1 0 0 1 1 ...
 $ fireplacequ   : chr  "Gd" NA NA "TA" ...
 $ garagetype    : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ garageyrblt   : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
 $ garagefinish  : Factor w/ 3 levels "Fin","RFn","Unf": 1 3 3 1 1 1 1 2 2 1 ...
 $ garagecars    : int  2 1 1 2 2 2 2 2 2 2 ...
 $ garagearea    : int  528 730 312 522 482 470 582 506 608 442 ...
 $ garagequal    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ garagecond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ paveddrive    : Factor w/ 3 levels "N","P","Y": 2 3 3 3 3 3 3 3 3 3 ...
 $ wooddecksf    : int  210 140 393 0 212 360 0 0 237 140 ...
 $ openporchsf   : int  62 0 36 0 34 36 0 82 152 60 ...
 $ enclosedporch : int  0 0 0 0 0 0 170 0 0 0 ...
 $ x3ssnporch    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ screenporch   : int  0 120 0 0 0 0 0 144 0 0 ...
 $ poolarea      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poolqc        : chr  NA NA NA NA ...
 $ fence         : Factor w/ 4 levels "GdPrv","GdWo",..: NA 3 NA NA 3 NA NA NA NA NA ...
 $ miscfeature   : chr  NA NA "Gar2" NA ...
 $ miscval       : int  0 0 12500 0 0 0 0 0 0 0 ...
 $ mosold        : int  5 6 6 4 3 6 4 1 3 6 ...
 $ yrsold        : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ saletype      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ salecondition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ saleprice     : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
 $ decx1stflrsf  : Factor w/ 10 levels "1","2","3","4",..: 9 3 8 10 4 4 8 7 9 5 ...
 $ decx2ndflrsf  : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 8 8 1 1 1 9 ...
 $ declotarea    : Factor w/ 10 levels "1","2","3","4",..: 10 8 9 8 9 6 2 2 2 3 ...
 $ bath          : Factor w/ 7 levels "0","1","2","3",..: 2 2 3 4 4 4 3 3 3 4 ...
 $ decgrlivarea  : Factor w/ 10 levels "1","2","3","4",..: 7 1 5 9 7 7 5 4 7 8 ...
 $ log10saleprice: num  5.33 5.02 5.24 5.39 5.28 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 

Now randomly apportion housing into housing_train and housing_test data.tables with 70 and 30 percent of records respectively. Henceforth we use only housing_train to explore, avoiding the pitfalls of snooping.

In [4]:
set.seed(123)

TRAIN <- .70
TEST <- 1 - TRAIN
nr <- nrow(housing)

train <- round(nr*TRAIN)

samp <- sample(nr,nr)
housing_train <- housing[samp[1:train]]
housing_test <- housing[samp[(train+1):nr]]

dim(housing_train)
dim(housing_test)
 
  1. 2051
  2. 88
 
  1. 879
  2. 88
 

Define a generic frequencies function and exercise it on several housing_train features.

In [5]:
frequenciesdyn <- function(DTstr, xstr)
{    
        return(eval(parse(text=sprintf('%s[,.(count=.N),.(%s)]', DTstr, xstr))) %>% 
               arrange(desc(count)) %>% mutate(percent=100*count/sum(count)))
}
In [6]:
frequenciesdyn("housing_train","mszoning")
frequenciesdyn("housing_train","decgrlivarea")
frequenciesdyn("housing_train","mszoning,bath")
 
mszoningcountpercent
RL160778.35202340
RM30614.91955144
FV1024.97318381
RH180.87762067
C (all)160.78010726
A (agr)20.09751341
 
decgrlivareacountpercent
621610.531448
521310.385178
421310.385178
720910.190151
22039.897611
12029.848854
92019.800098
82009.751341
31979.605071
101979.605071
 
mszoningbathcountpercent
RL268633.44709898
RL147323.06192101
RL341820.38030229
RM11547.50853242
RM21416.87469527
FV3643.12042906
FV2371.80399805
RL4261.26767431
C (all)1110.53632374
RM3110.53632374
RH290.43881034
RH170.34129693
C (all)240.19502682
RL620.09751341
RH320.09751341
C (all)010.04875670
RL510.04875670
A (agr)110.04875670
FV410.04875670
A (agr)210.04875670
RL010.04875670
 

Define a ggplot2 function that "dot plots" attribute frequencies from largest to smallest. The graphs include text frequency counts and percentages. Exercise the function on attributes "mszoning", "bath", and "neighborhood".

In [7]:
find <- function(invec,outvec) invec %in% outvec

mkdgrph <- function(dtstr,colstr, howmany=1000)
{
   
str1 <- 'frequenciesdyn("%s", "%s")'
nstr1 <- sprintf(str1,dtstr,colstr)
slug <- eval(parse(text=nstr1))
    
nrow <- sum(slug$count)
ncat <- nrow(slug)

maxct <- max(slug$count)
    
str2 <- 'ggplot(slug, 
    aes(y=count, x=reorder(%s,count), ymin=0, ymax=count), col="blue") +
    theme_economist_white() +
    geom_point(size=1, color="blue") + 
    geom_linerange(size=.5, color="blue") +
    theme(plot.title = element_text(size = 10,colour="black")) +  
    coord_flip() +
    theme(axis.text=element_text(size=rel(0.5)),axis.title=element_text(size=rel(0.6))) +  
    geom_text(aes(label = paste(count,round(percent,1),sep=":::"), y = count + .085*maxct), size = 2) +
    ggtitle(paste("%s\n", "%s: ", nrow, " records, ", ncat, " categories", sep="")) +
    ylab("Frequency") +
    xlab("%s\n")'
nstr2 <- sprintf(str2,colstr,dtstr, colstr, colstr)
g1 <- eval(parse(text=nstr2))
    
print(g1)

}
In [8]:
mkdgrph("housing_train","mszoning")
 
In [9]:
mkdgrph("housing_train","bath")
 
In [10]:
mkdgrph("housing_train","neighborhood")
 
 

Define a function to generate graphical frequencies of all factor variables in the housing_train data set. Execute the function and produce a pdf file with the results. This output is my point of departure for exploration. Note that similar generating functions can be used to "loop" over numerical attributes.

In [11]:
mkallgrph <- function(dtstr)
{
    str1 <- 'data.frame(%s)'
    nstr1 <- sprintf(str1, dtstr)
    slug <- eval(parse(text=nstr1))
    map(names(slug), 
          function(nme)
            {
              if (class(slug[,nme]) == "factor") mkdgrph(dtstr, nme)  
            })
    return(NULL)
}
In [12]:
pdf("ames housing_freqs.pdf",height=7)
mkallgrph("housing_train")
dev.off()
 
NULL
 
png: 2
 

Just as frequency plots are central to factor attributes, so are histograms to numerical variables. Let's look at a histogram of main depvar "log10saleprice" with 15 categories or bins.

In [13]:
g <- ggplot(housing_train, aes(x=log10saleprice)) +
    geom_histogram(col="yellow",bins=15)
g
 
 
 

Next do a similar graph with a density histogram where the area under the curve should be seen as summing to 1.

In [14]:
g <- ggplot(housing_train, aes(x=log10saleprice)) +
    geom_histogram(aes(y=..density..),col="yellow",bins=15)
g
 
 
 

Now superimpose a smoothed kernel density plot over the the density histogram.

In [15]:
g <- ggplot(housing_train, aes(x=log10saleprice)) +
    geom_histogram(aes(y=..density..),col="yellow", bins=15) +
    geom_density(col="red") 

g
 
 
 

Finally, note how the histogram and kernel densities begin to converge as the bin size gets smaller (and the number of bins gets larger). This isn't serendipity.

In [16]:
g <- ggplot(housing_train, aes(x=log10saleprice)) +
    geom_histogram(aes(y=..density..),col="yellow", bins=105) +
    geom_density(col="red") 
g
 
 
 

The net-net is that we can use the kernel density curve alone to depict the distribution of numeric variables with the area under the curve equal to 1. Coupling kernel densities with the trellis or facteting capabilities of ggplot2 lets the analyst explore relationships between factor independent and numeric dependent variables.

 

Let's take a look at faceted kernel density plots of "log10saleprice", first by factor "mszoning", and then by binned variable "decgrlivarea". For "mszoning", we omit the category "A (agr)", which has a frequency of just 2. The plots sort the panels left to right, top to bottom by median sale price for each factor category. Each panel's density has a different color, with the overall median "log10saleprice" depicted by an identical vertical solid line in each panel, and the individual category medians indicated by dashed colors. Notice how the density distributions shift in a positive direction as panels proceed along categories left to right, top to bottom.

 

In the "mszoning" example, the differences in the density distributions by category are pretty pronounced -- starting at the "C (all)" level with 16 observations and a median sale price of $66,678, to "FV" with 102 observations and a median sale price of $213,349. This would suggest a pretty strong relationship between "saleprice" (and hence "log10saleprice") and "mszoning".

 

The findings are similar for the binned attribute "decgrlivarea", indicating a positive relationship between "grlivgarea" and "log10saleprice". Of course, in this case we could have shown the relationship in a simple scatterplot, but binning can be helpful when data size is large.

In [17]:
frequenciesdyn("housing_train","mszoning")
frequenciesdyn("housing_train","decgrlivarea")
 
mszoningcountpercent
RL160778.35202340
RM30614.91955144
FV1024.97318381
RH180.87762067
C (all)160.78010726
A (agr)20.09751341
 
decgrlivareacountpercent
621610.531448
521310.385178
421310.385178
720910.190151
22039.897611
12029.848854
92019.800098
82009.751341
31979.605071
101979.605071
In [18]:
whichones <- frequenciesdyn("housing_train","mszoning")

slug <- left_join(whichones[count>5],housing_train)

slug <- slug[,`:=`(howmany=.N,medzone=median(log10saleprice)),.(mszoning)]
slug <- slug[,`:=`(both=paste(mszoning,"::",howmany,"::$",round(10^medzone),sep=""))] 
slug$both <- fct_reorder(slug$both, slug$log10saleprice, median)

ggplot(slug, aes(x=log10saleprice)) +
            geom_density(aes(color=both),  alpha = 0.4) +
            geom_vline(aes(xintercept=median(log10saleprice)),    
                color="black", size=.25) +
            geom_vline(aes(xintercept=medzone,color=both),    
                linetype="dashed", size=.25) +
            facet_wrap(~both) +
            theme(axis.text = element_text(size=5)) +
            theme(axis.title = element_text(size=5)) +
            theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=8)) +
            theme(legend.title=element_blank()) +            
            theme(legend.text=element_text(size=5)) +  
            theme(legend.position="none") +
            theme(strip.text=element_text(size=5)) +
            labs(title="mszoning", x="log10saleprice", y="density")
 
Joining, by = "mszoning"
 
 
In [19]:
whichones <- frequenciesdyn("housing_train","decgrlivarea")

slug <- left_join(whichones[count>5],housing_train)

slug <- slug[,`:=`(howmany=.N,medzone=median(log10saleprice)),.(decgrlivarea)]
slug <- slug[,`:=`(both=paste(decgrlivarea,"::",howmany,"::$",round(10^medzone),sep=""))] 

slug$both <- fct_reorder(slug$both, slug$log10saleprice, median)

ggplot(slug, aes(x=log10saleprice)) +
            geom_density(aes(color=both),  alpha = 0.4) +
            geom_vline(aes(xintercept=median(log10saleprice)),    
                color="black", size=.25) +
            geom_vline(aes(xintercept=medzone,color=both),    
                linetype="dashed", size=.25) +
            facet_wrap(~both) +
            theme(axis.text = element_text(size=5)) +
            theme(axis.title = element_text(size=5)) +
            theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=8)) +
            theme(legend.position="none") +
            theme(legend.text=element_text(size=5)) +            
            theme(strip.text=element_text(size=5)) +
            labs(title="decgrlivarea", x="log10saleprice", y="density")
 
Joining, by = "decgrlivarea"
 
 
 

We define a function that produces the graphs with much less angst than the above repeated cell code. And, of course, we can easily define a function (and do at the end) to invoke mkdens for each factor attribute in a data.table.

In [20]:
mkdens <- function(dtstr,xstr,ystr)
{
str0_ <- sprintf('vars <- c("%s","%s")',xstr,ystr)
eval(parse(text=str0_))
    
str1 <- 'whichones <- frequenciesdyn("%s","%s")'
str1_ <- sprintf(str1,dtstr, vars[1])
eval(parse(text=str1_))
#print(whichones)

str2 <- 'slug <- %s[%s %s whichones[count>=5]%s%s]'
str2_ <- sprintf(str2,dtstr,vars[1],"%in%", "$",vars[1])
eval(parse(text=str2_))

    
str3 <- 'slug <- slug[,`:=`(howmany=.N,medzone=median(%s)),.(%s)]'
str3_ <- sprintf(str3,vars[2], vars[1])
eval(parse(text=str3_))

str4 <- 'slug <- slug[,`:=`(both=paste(%s,"::",howmany,"::$",round(10^medzone),sep=""))]'
str4_ <- sprintf(str4,vars[1])
eval(parse(text=str4_))
    
str5 <- 'slug[, `:=`(both=fct_reorder(both, %s, median))]'
str5_ <- sprintf(str5,vars[2])
eval(parse(text=str5_))
    
str6 <- 'g <- ggplot(slug, aes(x=%s)) +
            geom_density(aes(color=both),  alpha = 0.4) +
            geom_vline(aes(xintercept=median(%s)),    
                color="black", size=.25) +
            geom_vline(aes(xintercept=medzone,color=both),    
                linetype="dashed", size=.25) +
            facet_wrap(~both) +
            theme(axis.text = element_text(size=5)) +
            theme(axis.title = element_text(size=5)) +
            theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=8)) +
            theme(legend.title=element_blank()) +            
            theme(legend.text=element_text(size=5)) +   
            theme(legend.position="none") +
            theme(strip.text=element_text(size=5)) +
            labs(title="%s")
'
str6_ <- sprintf(str6,vars[2],vars[2],vars[1])
eval(parse(text=str6_))
print(g)
g 
}

g <- mkdens("housing_train", "mszoning", "log10saleprice")
 
 

The trellised kernel density plots for factor variables against "log10saleprice" are quite useful, but ggpot2 also provides a "violin" plot which is a mixture of a kernel density and the older boxplot. Consider first the violin for "log10saleprice" in aggregate. The area under the curve is 1, so the girth indicates frequencies. The median, 75% and 25% "log10saleprice" are also denoted as horizontal lines.

In [21]:
g <- ggplot(housing_train, aes(x=rep(1,nrow(housing_train)), y=log10saleprice)) +
     geom_violin(trim = FALSE,draw_quantiles = c(0.25,0.5,0.75)) +
     labs(x="")

g
 
 
 

Next consider the violin plot "log10saleprice" against "mszoning". The distributions of "log10saleprice" by "mszoning" are denoted left to right ordered by median "log10saleprice" for each category. The overall median "log10saleprice" is indicated by the dashed black line, while individual "mszoning" level medians are solid black within each violin. Not surprisingly, as with the kernel density plot, this graph suggests a pretty strong relationship between "mszoning" and "log10saleprice".

 

Ditto for the plot that follows between "decgrlivarea" and "log10saleprice".

In [22]:
whichones <- frequenciesdyn("housing_train","mszoning")

slug <- left_join(whichones[count>=5],housing_train)

slug$mszoning <- with(slug, fct_reorder(mszoning, log10saleprice, median))

f <- slug[,.(count=.N,medmszoning=median(log10saleprice)),.(mszoning)][order(mszoning)][,lab:=paste(mszoning," :: ", count, " :: $", round(10^medmszoning), sep="")]

g <- ggplot(slug, aes(x=mszoning,y=log10saleprice, fill=mszoning)) +
     geom_violin(trim = FALSE,draw_quantiles = c(0.5)) +
            theme(legend.title=element_blank()) +
            theme(legend.position="none") +
            geom_hline(aes(yintercept=median(log10saleprice, na.rm=T)),   
            color="black", linetype="dashed", size=.25) +
            scale_x_discrete(breaks=f$mszoning,labels=f$lab) +
            theme(axis.text.x = element_text(size=6, angle = 45, hjust = 1)) +
            labs(title="mszoning", y="log10saleprice")
g
 
Joining, by = "mszoning"
 
 
In [23]:
whichones <- frequenciesdyn("housing_train","decgrlivarea")[count>=5,c("decgrlivarea"),with=F]

slug <- left_join(whichones,housing_train)

invisible(slug[, `:=`(countzone=.N,medval=median(log10saleprice)), 
               .(decgrlivarea)][,lab:=paste(decgrlivarea," :: ", countzone, " :: $",round(10^medval),sep="")])

slug$decgrlivarea <- with(slug, fct_reorder(decgrlivarea, log10saleprice, median))

f <- slug[,.(count=.N,medbath=median(log10saleprice)),.(decgrlivarea)][order(decgrlivarea)][,lab:=paste(decgrlivarea," :: ", count, " :: $", round(10^medbath), sep="")]

g <- ggplot(slug, aes(x=decgrlivarea,y=log10saleprice, fill=decgrlivarea)) +
        geom_violin(trim = FALSE,draw_quantiles = c(0.5)) +
            theme(legend.title=element_blank()) +
            theme(legend.position="none") +
            geom_hline(aes(yintercept=median(log10saleprice, na.rm=T)),   
            color="black", linetype="dashed", size=.25) +
            theme(strip.text.x = element_text(size = 5)) +
            scale_x_discrete(breaks=f$decgrlivarea,labels=f$lab) +                                                                             
            theme(axis.text.x = element_text(size=7, angle = 45, hjust = 1)) +
            labs(title="decgrlivarea", y="log10saleprice")

g
 
Joining, by = "decgrlivarea"
 
 
 

In contrast to density plots, with the violin plot, it's easier to expand from one dimension to two. In the following example, the factors are "mszoning" and "bath". Within each "mszoning" panel are violins of "bath". The panels are sorted by median "saleprice" for each "mszoning" category. Within each panel, violins are sorted by "bath".

In [24]:
whichones <- frequenciesdyn("housing_train","bath,mszoning")

slug <- left_join(whichones[count>=4],housing_train)

invisible(slug[, `:=`(countzone=.N,medval=median(log10saleprice)), 
               .(mszoning)][,lab:=paste(mszoning," :: ", countzone, " :: $",round(10^medval),sep="")])

slug$bath <- with(slug, fct_reorder(bath, log10saleprice, median))
slug$mszoning <- with(slug, fct_reorder(mszoning, log10saleprice, median))
slug$lab <- with(slug, fct_reorder(lab, log10saleprice, median))

g <- ggplot(slug, aes(x=bath,y=log10saleprice, fill=bath)) +
        geom_violin(trim = FALSE,draw_quantiles = c(0.5)) +
            facet_wrap(~lab,ncol=5) +
            theme(legend.title=element_blank()) +
            theme(legend.position="none") +
            geom_hline(aes(yintercept=median(log10saleprice, na.rm=T)),   
            color="black", linetype="dashed", size=.25) +
            theme(strip.text.x = element_text(size = 5)) +
            theme(axis.text.x = element_text(size=7, angle = 45, hjust = 1)) +
            labs(title="mszoning", y="log10saleprice")

g
 
Joining, by = c("bath", "mszoning")
 
 
 

Last but not least, density plots of "log10saleprice" for each data.table factor can be generated with the function below. A similar wrapper could be readily defined for violin plots.

In [25]:
mkalldens <- function(dtstr,ystr)
{
    str1 <- 'data.frame(%s)'
    nstr1 <- sprintf(str1, dtstr)
    slug <- eval(parse(text=nstr1))
    map(names(slug), 
          function(nme)
            {
              if (class(slug[,nme]) == "factor") mkdens(dtstr, nme, ystr)  
            })
    return(NULL)
}
In [26]:
pdf("ames housing_kerneldensity.pdf", height=7)
mkalldens("housing_train", "log10saleprice")
dev.off()
 
NULL
 
png: 2
 

That's it for now. We'll examine more exploratory techniques in a subsequent column.

 

 

 

 

 

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.