2 000 000 000 Rows Zoined Data Warehouse Performance Benchmark

Goals for the benchmark

Zoined has developed a high performance data warehousing platform for retail chains as part of the Zoined Retail Analytics SaaS service offering.

To demonstrate the performance in real life high volume data warehousing scenario, we have implemented a benchmarking test with help of one of our customers – a large retail chain. The goal of the benchmarking test was to demonstrate the performance and response times of the Zoined platform when data to be analysed consists of 2 000 000 000 real life receipt rows. So, this translates to two thousand million rows, also referred to as 2 billion or 2 milliard rows.

Data warehouse architecture used in this benchmark consisted of five 32 core servers with 250 Gb of memory each.

Architecture explained

Zoined Retail Analytics service architecture consists of 3 main components affecting performance.

Column-oriented cloud based data warehouse cluster

Column-oriented data warehouse architecture stores data tables as sections of columns of data rather than as rows of data. This architecture enables extremely high performance for querying data but has limitations for writing data back directly to the data warehouse. Distributed server architecture enables almost linear scaling options in typical use cases in relation to performance so the cloud based architecture can be implemented also to handle much larger datasets than the benchmark described here.

Analytics engine Zorbas

Zorbas is a special analytics query engine developed by Zoined that enables complex queries beyond the limitations of traditional sql queries. The analytics engine has been developed from the start with high performance in mind and the key features include dynamic metadata driven query structures that can on the fly handle complicated data series combinations and aggregations.

Caches used in different layers

Various caches on different levels of the architecture are in use to speed up the response times and make user experience smoother.

Using cache in end user’s browser is the fastest option. If a user opens a dashboard or a report using the same query parameters, then server side queries are not rerun – only a check is made whether data is still up to date.

The first time data is queried, all data must be queried from the data warehouse cluster. In these cases the perceived response time is affected by network latency and service response time combined.

The response time perceived by the end user depends on at least the following:

  • Internet connection speed and latency
  • Previous queries made by the user (browser cache)
  • Previous queries made by other users (server side in-memory cache)
  • Similar queries made (Zorbas analytics engine cache)

The different cache layers are demonstrated in the following architecture diagram:

Cache layers

Benchmark results

Zoined customer testers generated a bit over 1 800 real life use case queries during a 30 minutes performance testing window.

Typical queries included fetching metrics like sales, sales margin percentage or average ticket value for different time periods from one hour to 24 months on different dimension levels from whole chain level to individual store to product to all the way to the lowest level of receipt rows.

Web portal server side performance response times

  • 66% of the queries returned the results within the first second
  • 95% of the queries returned the results in under 4,3 seconds

Query response time distribution (2 000 000 000 real life receipt rows) in seconds:

Query response time

Query response time distribution within the majority of queries in the first 0,7 seconds (2 000 000 000 real life receipt rows) in milliseconds:

Query response time

Analytics engine performance response times

Analytics engine Zorbas handled approximately 1 600 queries. The reason for the difference compared to all queries (1 800) is that some queries utilized caches. So, the reporting engine benchmark results also are a bit higher as all the 1 600 queries are those “heavy” queries where end user caches have not been utilized.

  • 62% of the queries returned the results within the first second
  • 95% of the queries returned the results in under 4,8 seconds

Query response time distribution (2 000 000 000 real life receipt rows) in seconds:

Query response time

Query response time distribution within the majority of queries in the first 0,7 seconds (2 000 000 000 real life receipt rows) in milliseconds:

Query response time

Conclusions

Zoined Data Warehouse platform is a truly scalable analytics platform that can satisfy high performance requirements from large retail enterprises with thousands of stores.

Zoined’s cloud based data warehouse cluster enables efficient scaling options to achieve high performance also in cases where the amount of data is greater than the dataset used in this benchmark. We expect the Zoined platform to be able to efficiently handle analytics needs for datasets well over 2 000 000 000 rows.

For smooth dashboard user experience the traditional Hadoop architectures can not provide fast enough response times in a way a distributed column-oriented data warehouse cluster can. Hadoop definitely has a place in batch oriented data processing tasks but fast response time analytics needs are today still better served with in-memory or column-oriented solutions or with a combination of these solutions such as with the Zoined platform.

Cost efficient high performance data warehouse solutions are still a rare breed and Zoined is proud to offer also large enterprises the option to use Zoined platform and Zoined Retail Analytics SaaS service for their analytics needs.

Like this article?

Share on Facebook
Share on Twitter
Share on Linkedin
Share on Pinterest

Leave a comment