In today’s diverse data environments, it seems like JSON data is everywhere now. Often working with JSON datums means using one of the popular NoSQL databases like MongoDB or Cassandra. These are easy enough to install, but most organizations use SQL databases for analytics. This begs the question: Is it possible to use SQL-based tools to do analytics on a JSON dataset? Well, the short answer is YES!...with a smattering of technical gymnastics and a recent version of PostgreSQL.
The PostgreSQL community has been steadily adding JSON functionality over the last couple of major releases. While a simple web search provides plenty of how-to instructions for storing JSON data in PostgreSQL, few articles actually exist that tell you how to use it. This blog will help you understand how to not only capture, but to analyze JSON data within PostgreSQL.
Finding Some JSON Data
The first thing I had to do was find a firehouse of JSON that would work. While I was forming the idea for this blog, I was 100% focused on a riveting World Series between the Cleveland Indians and Chicago Cubs. It dawned on me that I could use a twitter search (stealing liberally from a previous Inquidia Labs exercise) of World Series tweets to gather my JSON data.
I designed a basic Postgres table to store the tweets. It contains 2 columns: an id and the tweet. Note that the tweet is a JSON datum.
I used Pentaho Data Integration (PDI) to consume the Twitter “data firehouse” and persist it into my new Postgres tweets table. Since my company, Inquidia Consulting, is one of the premier Pentaho consultancies on the planet, and that you are reading this on our website, I’ll assume the majority of you know at least a little about Pentaho. (If not, please explore our myriad blog posts to learn more about this wonderful platform.)
Trick #1: Writing JSON Data to Postgres using Views and Rules
Pentaho has a multitude of output capabilities, some specifically for JSON; Mongo and Cassandra just to name two. But, if you have ever tried to write to a JSON(B) datatype via PDI, you know it doesn’t work…or does it?
Trying to write directly to this table using the traditional Table Output step doesn’t work as PDI does not understand the JSONB data type.
Remember those technical gymnastics I mentioned earlier? It’s time to get limber now.
There are a couple of approaches to solve this. I chose to create a view which does an explicit type cast to text, which PDI can read with ease.
But, how then can I write data through a view? Well, the trick is to use the PostgreSQL rule system! You can present a view to PDI that has plain vanilla text fields. When PDI inserts to this view, Postgres handles the conversion and writes directly to the underlying table.
So, now we can use a simple Table Output step in PDI to insert the JSON configured per the diagram below.
Trick #2: Using PDI with a Little Java Magic to Obtain Historical Tweets
I would be remiss if I didn’t point out how, in the past, we’ve used a PDI extension point, User Defined Java Class, to consume tweets via the Twitter API. Since I was working my little experiment after the World Series actually began, I needed to use the Search API to look backward in time. So, a little Java magic and we have a “step” that calls the Twitter search API outputting rows to its result stream.
Since this API is documented to be non-exhaustive, we took a brute force approach and called it repeatedly, using the Transform Executor step, and ignored Primary Key violations at the Table Output using error handling as the search will return duplicates from its prior invocations.
This resulted in 150,000+ tweets across games 6 & 7. Admittedly, I turned it off after the Cubs victory as I am from Cleveland and have been an avid Indians fan since birth. So, I didn’t get nearly the volume of tweets I expected, and the general vitriol and “spam” made the dataset NSFW. But, the concept is sound and I got enough data to do my experiment.
Trick #3: Using Functions and Views to Unpack JSON for Ubiquitous SQL access
So, we now have a PostgreSQL repository of tweets that we’ve consumed via the Twitter API using Pentaho Data Integration. How do we interrogate/mine that information for meaning? I did not want to write an analytics tool from scratch and most of the packaged OLAP tools on the market are SQL based. I had to do some more technical gymnastics, but I found a way to make the open source Mondrian OLAP engine work.
My first challenge was figuring out how to expose elements within the JSON field such that they were queriable via the SQL an OLAP tool generates The answer: PostgreSQL functions. There are multitudes of functions available for use on JSON(B) fields. After some trial and error, I found a combination of operators and functions that produced a dataset that I could work with. This is by no means the best combination, but it does work for my quick POC analyses.
Having already used Pentaho to populate the data, my first choice was to see if I could create a Mondrian schema, using custom SQL code to expose JSON elements as columns, for use in dimensions and measures. I could and it was relatively easy.
Using the Expression capabilities within Mondrian schemas, I created SQL snippets that extract various elements from the JSON encoded tweet and present them as separate items. I created the schema using Pentaho’s schemaWorkbench tool. The [Tweet Country] field is created by retrieving the place object from within the tweet and then retrieving the country element as text.
Which when presented via Pentaho Analyzer, allows for simple slicing and dicing of JSON data!
Here is the log, just to prove we are analyzing the information encapsulated within the tweet JSON.
Extracting single elements from JSON proved to be a simple task. However, JSON allows for much more complex data structures, including arrays, specifically the Hashtags and User Mentions. Using the same technique above proved to be problematic. I was able to create a dimension that exploded the array contents into individual values.
But, it only worked under ideal conditions, and could not be used within a SQL WHERE clause -- basically invalidating its use within Mondrian, as eventually a user will drill on a value and generate SQL that includes a WHERE component.
This led me down the path of creating a view in PostgreSQL that would bundle all the necessary JSON functions together and present a “ROLAPy” SQL representation to Mondrian. Although not as “sexy” as using the Mondrian schema to parse out the JSON elements, the view approach did circumvent all the issues encountered with the JSON arrays.
With minor modifications to our existing cube, we now reference the view and can act upon each field as if they were table columns
The upside to doing the conversion of the JSON within the database (instead of the Mondrian schema), is that it can be used by any SQL-based analytics tool. I attached QlikSense to the view and was able to analyze JSON data with Qlik!
And it doesn’t stop there! Tableau 10.1 recently announced support for JSON files, so you can assume JSON within a database is on the horizon. Not surprisingly, we can use the same techniques; either using the view, or Custom SQL in the connector to make data available within Tableau. I really like how simple it is to do geographic analysis, and the following image definitely shows, baseball is an American game but has avid followers in Latin America and Asia.
But wait...there is a small problem with our approach
Those of you familiar with RDBMS optimizers will realize there is a flaw in the view approach. The view is fully resolved for every request, even when the Hashtags or User Mentions are not needed for the analytic query.
But, there are tweaks we can make to our view definition and physical structures to improve the situation. We’ll investigate these and the potential performance benefit of using MonetDB in a future post.