What to do when moving to the clouds... Slows you down

Optimizing costs and performance with Hyperscale

Challenge

Reduce processing time for larger and more complex ETL workflows

For one of our clients, we had to migrate their existing Data Warehouse (using Microsoft SSIS as an ETL tool and IBM Netezza PDA, a massively parallel processing (MPP) architecture, as their database infrastructure) from an on-premises environment to the cloud. As performance is a key indicator of successful cloud migration, picking the right cloud services for the target infrastructure is crucial.

Based on our client’s data volume, Azure SQL Server was chosen as the target environment for their Data Warehouse. Azure SQL Server provided a much more cost-efficient option than an MPP architecture in Azure (such as Synapse Analytics) while still delivering sufficient performance for their current workload. Another reason for choosing Azure SQL Server was the relatively short time until the existing Netezza PDA machine reached its end-of-support date.

A lift-and-shift approach proved to be the only way to migrate the on-premises Data Warehouse to the new cloud architecture in time. During the migration of smaller parts, we noticed longer processing times, though (even if performance remained sufficient for all queries to be processed in time). Those were due to sub-optimal queries developed in the on-premises Data Warehouse (where they do not directly impact costs, as opposed to a cloud environment).

When migrating larger parts later in the project, processing times of the larger and more complex ETL workflows became substantially higher than acceptable. As a result, the data was not being processed in a timely manner for the end-users to have their reports in time.

Approach

Investigate database monitoring metrics to decide on the best service tier upgrade


First, we carefully optimized processing time by rewriting complex database queries, indexing large data tables, and optimizing ETL workflow scheduling, ... until we found that we were achieving only marginal gains in terms of processing time.

Next, we investigated the standard database monitoring metrics (CPU usage, memory capacity, …). While it quickly became apparent that none of those metrics were hitting the database limits, we noticed that the Log IO hit its maximum daily during the ETL processing window. Upon further inspection, we discovered this occurs when hitting the IOPS (input/output operations per second)/throughput limits for the applicable pricing tier.

As our client was clearly facing an IO issue, we had 2 options:

  • Option 1: Upgrade the database compute size or service tier,
  • Option 2: Identify and tune database queries consuming the most IO.

Since we were using a lift-and-shift approach due to a strict deadline, identifying and tuning those database queries was not an option. Moving all ETL workflows to the target architecture was planned for phase 2 and was not an option in terms of timing. Therefore, we investigated if upgrading the database service tier could solve the issue. As IOPS/throughput was the bottleneck, we compared IOPS/throughput between the different service tiers for Azure SQL Server. Because the Business Critical service tier was out of the picture due to its relatively high costs, we only took the General Purpose and Hyperscale service tiers into account.

While comparing those different service tiers, we noticed that scaling up the General Purpose tier database would increase IOPS only to a limited extent. For example, an upgrade from the then current 24 vCores to 40 vCores (the service tier with the highest available IOPS) would raise IOPS from 7.680 to 12.800. This 66% increase falls short when considering that an upgrade to a Hyperscale tier database with 4 vCores more than doubles IOPS from 7.680 to 16.000, which is higher than the highest available IOPS with the General Purpose tier database.

So, an upgrade to a Hyperscale tier database proved to be the best option to address the IO issues our client was experiencing. That was all the more true because IOPS increases linearly with the number of vCores when scaling a Hyperscale tier database. Upgrading a Hyperscale tier database from 4 vCores to 18 vCores, for example, would increase IOPS from 16.000 to 72.000.

Impact

A significant decrease in our client’s ETL workflows’ processing window

As the upgrade to a Hyperscale tier database decreased the processing window of our customer’s ETL workflows by more than 15%, their data was processed in time. The Hyperscale SQL Server architecture provided our client with a distributed architecture with a multi-layer caching system that improves both speed and scale. This distributed architecture provided them with independently scalable compute and storage resources, resulting in considerable performance and cost-efficiency advantages.

In our experience, there are 3 major reasons why a Hyperscale service tier database can be the solution for Azure SQL Server performance issues:

  • Higher performance, the Hyperscale service tier being significantly faster than the General Purpose service tier for processing large data sets,
  • Better cost efficiency, especially when dealing with long run times due to lower performance (which invoke higher costs when using a General Purpose service tier),
  • Easy (reverse) migration from the General Purpose service tier, making it easy to try out and see if it delivers a solution. If not, reverse migration to the General Purpose service tier is straightforward (it's important to remember that 2 to 4 ETL runs are needed before concluding on processing time because the SQL Server database optimizes its processes internally based on how it’s used).

Shift from data to impact today

Contact datashift