Reduce ETL process time 9X with Cassandra and Elasticsearch

A Fortune 1000 company needed to reduce the time it took to ingest, read, and search its data. Read on to learn how partnering with Shadow-Soft helped the company reduce ETL process time 9X and improve search query time 37X.

Reduce ETL Process Time

Challenge

A Fortune 1000 client was holding structured data in NoDB with corresponding lookup tables in Solr. This overcomplicated design contributed to it taking a lengthy 3-day extract, transform, and load (ETL) process.

In addition, timeout errors in the production instances and slow query response times, averaging 7.5 seconds, made anything approaching real-time analytics impossible.

Scaling this architecture with customer demand did not achieve the desired results and created a large compute footprint that increased cost.

Shadow-Soft was contacted to address these issues, and recommended the replacement of NoDB and Solr with Cassandra backed by Elasticsearch.

Solution

The ETL process was dramatically reduced from 3 days to a 6.5-hour process time. This was done by removing complexity and using every node to process a portion of the net data, then enabling Logstash on each node to call Cassandra. Cassandra then handled the network traffic natively for updating.

Response time on production servers showed massive improvements. In the past, large volumes of data put unnecessary stress on the production services and occasionally led to a shutdown.

Built-in horizontal scaling functionality of both Elasticsearch and Cassandra makes it easier to spin up nodes. The process of adding nodes went from a complicated multi-week effort to just hours.

The old ETL process required a deduplication step that significantly increased the process time. Shadow-Soft’s solution improved upon this step by using the native features of Cassandra to dedupe the data faster.

Reference Architecture

Reduce ETL Process Time

Outcome

By using the native open source abilities of Cassandra and Elasticsearch, the large code redesign effort was accomplished in a matter of months. The implementation of these solutions reduced ETL process time by 9X, reduced data footprint by 2X, and improved query time by 37X.

Results

  • Reduced ETL process time from 3 days to 6.5 hours
  • Reduced data footprint from 360TB to 180TB
  • Increased query time from 7.5 seconds to 0.2 seconds

Need help modernizing old processes? Contact Shadow-Soft to learn how we partner with organizations to reduce ETL process time and improve data performance.


Never miss a blog post. Subscribe to our analytics & machine learning blog posts: