In my first post, I talked about the importance of site search analytics for e-commerce optimization. In this follow-up, I would like to show one way how to easily build a site search analytics system at scale, without spending much time and effort on answering these ever present questions:
- Which database is best for analytics?
- How do I operate that database at scale?
- What are the operating costs for the database?
How-To Site-Search Analytics without the Headache
These questions are important and necessary. Thankfully, in the age of cloud computing, others have already thought about, and found solutions to abstract out the complexity. One of them is Amazon Athena. This will help us build a powerful analysis tool from, in the simplest case, things like CSV files. Amazon Athena, explained in its own words:
Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Amazon Athena
This introductory sentence from the Amazon website already answers our questions 1 and 2. All that remains is to answer question 3: how much does it cost? This is answered quickly enough:
- $5.00 per TB of data scanned by Athena
- Standard AWS S3 rates for storage, requests, and data transfer
AWS offers a calculator to roughly estimate the cost. Because Amazon Athena uses Presto under the hood, it works with a variety of data formats. This includes CSV, JSON, ORC, Apache Parquet, and Apache Avro. Choosing the right file format can save you up to a third of the cost.
No data, no DIY analytics
A site search analytics tool requires a foundation. Either data from an e-commerce system or any site search tracking tool like the searchhub search-collector will suffice. For now, we will focus on how to convert data into the best possible format, and leave the question of “how to extract data from the various systems” for a separate post.
As the database needn’t scan a complete row but only the columns which are referenced in the SQL query, a columnar data format is preferred to achieve optimal read performance. And to reduce overall size, the file format should also support data compression algorithms. In the case of Athena, this means we can choose between ORC, Apache Parquet, and Apache Avro. The company bryteflow provides a good comparison of these three formats here. These file formats are efficient and intelligent. Nevertheless, they lack the ability to easily inspect the data in a humanly readable way. For this reason, consider adding an intermediate file format to your ETL pipeline. Use this file to store the original data in an easy-to-read format like CSV or JSON. This will make your life easier when debugging any strange-looking query results.
What are we going to build?
We’ll now build a minimal Spring Boot web application that is capable of the following:
- Creating dummy data in a humanly readable way
- Converting that data into Apache Parquet
- Uploading the Parquet files to AWS S3
- Query the data from AWS Athena using JOOQ for creating type-safe SQL queries using the Athena JDBC driver.
Creating the application skeleton
Head over to Spring initializr and generate a new application with the following dependencies:
- Spring Boot DevTools
- Lombok
- Spring Web
- JOOQ Access Layer
- Spring Configuration Processor
Hit the generate button to download the project. Afterward, you need to extract the zip file and import the maven project into your favorite IDE.
Our minimal database table will have the following columns:
- query
- searches
- clicks
- transactions
We will use the jooq-codegen-maven plugin, to build type-safe queries with JOOQ, which will generate the necessary code for us. The plugin can be configured to generate code based on SQL DDL commands. Create a file called jooq.sql inside src/main/resources/db and add the following content to it:
CREATE TABLE analytics (
query VARCHAR,
searches INT ,
clicks INT,
transactions INT,
dt VARCHAR
);
Next, add the plugin to the existing build/plugins section of our projects pom.xml:
org.jooq
jooq-codegen-maven
generate-jooq-sources
generate-sources
generate
true
true
true
org.jooq.meta.extensions.ddl.DDLDatabase
PUBLIC
true
true
sort
semantic
scripts
src/main/resources/db/jooq.sql
true
com.example.searchinsightsdemo.db
target/generated-sources/jooq
org.jooq
jooq-meta-extensions
${jooq.version}
The IDE may require the maven project to be updated before it can be recompiled. Once done, you should be able to see the generated code under target/generated-sources/jooq.
Before creating SQL queries with JOOQ, we first need to create a DSL-context using an SQL connection to AWS Athena. This assumes we have a corresponding Athena JDBC driver on our classpath. Unfortunately, maven central provides only an older version (2.0.2) of the driver, which isn’t an issue for our demo. For production, however, you should u