Amazon Redshift – Working with JSON Data

Amazon Redshift JSON Functions

Amazon Web Services Tutorial : Amazon Redshift JSON Query

Amazon Web Services tutorial : Amazon Redshift Working with Big JSON Data

Amazon Redshift has some built in JSON functions that allow extracting data out of JSON. You can easily modify JSON strings to store additional key=value pairs without needing to add columns to a table. Amazon Redshift JSON functions are alias of PostgreSQL JSON functions.

JSON_EXTRACT_PATH_TEXT Amazon Redshift function is the most popular function while working with JSON data. The JSON path can be nested up to five levels deep. Path elements are case-sensitive. If a path element does not exist in the JSON string, JSON_EXTRACT_PATH_TEXT returns an empty string.

In this amazon web services tutorial we are mainly going to focus on Amazon Redshift JSON_EXTRACT_PATH_TEXT function.

When to Use Amazon Redshift JSON Queries?

Amazon Redshift JSON queries are very useful in below cases:

  • Your redshift schema is keep growing. You don’t need to add new columns every time you have new business requirement or new column needs to be added.
  • You have dynamic data list which needs to be stored and run complex analytic queries.
  • When you have small set of key=value pairs, you can save space by storing them in a single column. You can skip NULL data while storing.
  • You can read more about Amazon Redshift JSON functions here.

Query Example

How to Store Data in Redshift for JSON Queries to Work

We do extensive tracking of every action on our website. Amazon Redshift was obvious choice for this purpose. We had requirement that we need to store all url query parameters in key=value format. Now query parameters are not fixed. It’s a dynamic list. Adding column for each query parameter was not a solution since it’s dynamic. We also had requirement that extensive analytic queries needs to be run on this data.

Possible Solutions

  • Store them in a text field and run “like” queries on them.
  • Store them in text field in JSON format and use Amazon Redshift JSON functions.

We chose 2nd solution since bench marking showed it was faster. Also the query parameters can be extracted as separate columns. This was useful for our business intelligence team while doing presentations. Query parameters can be extracted as separate columns using Amazon Redshift JSON functions.

Below are few things to keep in mind for Redshift JSON queries to work:

  • Use column data type as text.
  • Store data as JSON. In ruby we first convert the key=value list to hash and then use to_json method to convert it into JSON format before storing.
  • JSON data can be stored with Redshift COPY command. It’s as simple as storing normal text data. You can read more about this here.

JSON Query Example

Using Redshift JSON Function in Where Clause

You can also use Amazon Redshift JSON functions in where clause. Your query can be as complex as below:

Query Example

Using Redshift JSON Function in Group By Clause

JSON functions are allowed in group by clause. Sample example is below:

Query Example

Type Cast JSON Function Result

In below example we are type casting entity_id values to integer. There could be issues in using CAST & COALESCE function if JSON is not correctly formatted.

Amazon Redshift Bulk Update JSON Data

We had requirement that we we need to update 1+ million redshift rows. The field which needs to update was text column and we were storing data in JSON format. We needed to update substring in a text column.

In above example query_parameter_json is the column name. {“utm_source”: “campaign”, utm_type: “u”} is the value.

Need to  replace “campaign” with “newsletter” which are present in 1+ million rows.

Possible Solutions

  • Create a new column. Parse old column data and update parsed data to new column. Drop old column in the end.
  • Write a ruby script and update using amazon redshift COPY command in batches.
  • Run bulk update query.

Solution 1 and 2 were feasible, however it’s a big effort especially solution 1. We needed to do it quickly possibly in couple of hours.

Amazon Redshift Substring Function to the Rescue

After exploring various options we concluded to below solution. Since it was a text column we can run Amazon Redshift substring functions.

Updating 1+ million rows in single update can take time. It can block incoming queries. To avoid this we updated in batches. You can read more about Amazon Redshift substring functions here.

Drawback of Amazon Redshift JSON Functions

We also benchmarked on 1+  million rows in SQL workbench tool. Running Amazon Redshift select queries on JSON column can be 20-30% slower than normal queries.

If you are a beginner Amazon Web Service developer you can get started with below aws tutorials.

Thank you for reading my article. If you face any problem or having any doubts, let me know in comment. If you like my article please like our Facebook page and also follow us on Twitter.  For regular updates you can also subscribe to with your email.

Please also share on Facebook and Twitter to help other amazon web services developers.