Search
  • Kamil Kosno

SQL Server - improving pivot query performance using auxiliary table

Updated: Mar 21

In this post I focus on implementing auxiliary lookup table to reduce execution times of pivot queries run against large datasets.

Preparation


All examples in this post use a test data table populated with the following code:


use testdb;

go


drop table if exists #tmpFeatures;


create table #tmpFeatures

(

FeatureID smallint not null,

FeatureKey varchar(500) primary key

);


insert into #tmpFeatures(FeatureID, FeatureKey)

values

(1, 'population:person:feature:age'),

(2, 'population:person:feature:height'),

(3, 'population:person:feature:weight'),

(4, 'population:person:feature:size'),

(5, 'population:person:feature:body_mass_index');


drop table if exists dbo.PersonDetail;


create table dbo.PersonDetail

(

PersonDetailID bigint identity(1,1) not null,

PersonID bigint not null,

FeatureKey varchar(500) not null,

FeatureValue decimal(24, 8) not null

);


create unique clustered index idx_cl_unq on dbo.PersonDetail(PersonID, PersonDetailID);


declare @population_count bigint = 10000000;


--Using nums table method developed by Itzik Ban-Gan to generate large number of records

with

p1(a) as(select 1 union all select 2),

p2(a) as(select 1 from p1 as b cross join p1 as c),

p3(a) as(select 1 from p2 as b cross join p2 as c),

p4(a) as(select 1 from p3 as b cross join p3 as c),

p5(a) as(select 1 from p4 as b cross join p4 as c),

p6(a) as(select 1 from p5 as b cross join p5 as c),

nums as (select row_number() over(order by (select null)) as n from p6)

insert into dbo.PersonDetail (PersonID, FeatureKey, FeatureValue)

select nm.n, tf.FeatureKey, fv.FeatureValue

from nums as nm

cross join #tmpFeatures as tf

cross apply

(

select case

when tf.FeatureID = 1 then ABS(CHECKSUM(NewId())) % 80 + 5

when tf.FeatureID = 2 then ABS(CHECKSUM(NewId())) % 50 + 140

when tf.FeatureID = 3 then ABS(CHECKSUM(NewId())) % 60 + 40

when tf.FeatureID = 4 then ABS(CHECKSUM(NewId())) % 12 + 10

else ABS(CHECKSUM(NewId())) % 20 + 18

end as FeatureValue

) as fv

where nm.n < @population_count;


Test data is generated for a population of 10 million people. Each person's details include age, height, weight, size and body_mass_index. The values have very little significance in the following tests so they are random.


The structure of the table is a key-value pair for each feature, per person ID. PersonDetailID is added to the clustered index to make it unique.


The table dbo.PersonDetail's primary key is defined on PersonID and PersonDetailID, typical use case is to join to this table on PersonID first.


In order to create a large number of records we used a brilliant technique originally developed by Itzik Ben-Gan. N.b. Itzik has recently posted a challenge to take generating numbers series to the next level. If you are interested in reading more about it, here is the link to the first article of the series https://sqlperformance.com/2020/12/t-sql-queries/number-series-challenge.

Classic pivoting


Classic pivoting technique involves three main elements:

  1. pivoted columns

  2. aggregation function

  3. aggregated value/column

Aggregation is processed against one or more grouping columns. When writing a pivot query we need to bear in mind that any columns in the source table or query, which are not pivoted or aggregated, will be used for grouping the results. Therefore, in our examples we create a subquery to first eliminate the PersonDetailID column from the grouping column set.


Here is the code for the first example. This is how we would normally write a pivot query, when the pivoted column names are known in advance:


use testdb;

go

set statistics io, time on;


select

PersonID,

[population:person:feature:age],

[population:person:feature:height],

[population:person:feature:weight],

[population:person:feature:size],

[population:person:feature:body_mass_index]

from

(select PersonID, FeatureKey, FeatureValue from dbo.PersonDetail as pd) as pd

pivot(max(FeatureValue) for FeatureKey in

(

[population:person:feature:age],

[population:person:feature:height],

[population:person:feature:weight],

[population:person:feature:size],

[population:person:feature:body_mass_index]

)) as pv;


To measure the performance we use the "discard results after execution" option (under Query -> Query Options -> Results).


The query produces the following execution plan:

There is nothing significant to note about this plan, except that it is a serial plan. Otherwise it is quite efficient - the clustered index is scanned only once and in an ordered fashion:


The statistics obtained in the test are:

  • scan count 1

  • logical reads 464406

  • physical reads 2

  • read-ahead reads 464422

  • CPU time = 31531 ms.

  • elapsed time = 32591 ms.

Pivoting on steroids


Sometimes, by rephrasing the query or adding an element to it, we can force the SQL engine to behave differently. Interestingly, in the case of pivoting we may achieve a significant performance improvement, as the following example demonstrates.


Starting with the code transcript:


use testdb;

go

set statistics io, time on;


drop table if exists #tmpFeatures;


create table #tmpFeatures

(

FeatureID smallint not null,

FeatureKey varchar(500) primary key

);


insert into #tmpFeatures(FeatureID, FeatureKey)

values

(1, 'population:person:feature:age'),

(2, 'population:person:feature:height'),

(3, 'population:person:feature:weight'),

(4, 'population:person:feature:size'),

(5, 'population:person:feature:body_mass_index');


select

PersonID,

[1] as [population:person:feature:age],

[2] as [population:person:feature:height],

[3] as [population:person:feature:weight],

[4] as [population:person:feature:size],

[5] as [population:person:feature:body_mass_index]

from

(select PersonID, FeatureID, FeatureValue from dbo.PersonDetail as pd inner join #tmpFeatures as tf on tf.FeatureKey = pd.FeatureKey) as pd

pivot(max(FeatureValue) for FeatureID in

(

[1],

[2],

[3],

[4],

[5]

)) as pv;


At first glance it does seem like we are over-complicating a simple pivoting task. The steps are:

  1. add all of the expected pivoted column names to a lookup table with an optional second column indicating replacement value for each of the original values

  2. in the pivot input subquery add an inner join against the temporary lookup table and - optionally - instead of the original values use the replacement values (we don't necessarily need the replacement value - but I used it for simplicity of typing short numbers instead of lengthy keys)

  3. in the pivot subquery add replacement values instead of the original keys (or the keys if we chose not to use shortened names/numbers)

Here is the execution plan for the above query:

The plan itself is more complex than the original plan, however, because of the parallel execution we have made substantial gains in terms of performance (despite higher scan count):

  • scan count 9

  • logical reads 465663

  • physical reads 3

  • read-ahead reads 464414

  • CPU time = 18968 ms

  • elapsed time = 10352 ms

Dynamic pivot


In some cases we might want to be able to pick our pivoted columns dynamically. Here is an example code to use in those scenarios:


use testdb;

go


drop table if exists #tmpFeatures;


create table #tmpFeatures

(

FeatureID smallint not null,

FeatureKey varchar(500) primary key

);


insert into #tmpFeatures(FeatureID, FeatureKey)

select row_number() over(order by(FeatureKey)), FeatureKey from dbo.PersonDetail group by FeatureKey;


declare @select_cols nvarchar(max) =

(select stuff((select ',[' + cast(tf.FeatureID as varchar(255)) + '] as [' + tf.FeatureKey + ']' from #tmpFeatures as tf order by FeatureID for xml path(''), type).value('.', 'nvarchar(max)'), 1, 1, ''));

declare @pivoted_cols nvarchar(max) =

(select stuff((select ',[' + cast(tf.FeatureID as varchar(255)) + ']' from #tmpFeatures as tf order by FeatureID for xml path(''), type).value('.', 'nvarchar(max)'), 1, 1, ''));


declare @sql nvarchar(max) = N'

select

PersonID,' + @select_cols + '

from

(select PersonID, FeatureID, FeatureValue from dbo.PersonDetail as pd inner join #tmpFeatures as tf on tf.FeatureKey = pd.FeatureKey) as pd

pivot(max(FeatureValue) for FeatureID in

(

' + @pivoted_cols + '

)) as pv;';


exec sp_executesql @stmt = @sql;


Dynamic column list requires change in a way we populate the lookup table. We also need to create a list of pivoted columns (and aliasing those columns if needed), using dynamic SQL.


Otherwise, we end up enjoying the same benefits of the supplementary table approach described in the previous section.

Summary


Pivoting data is a relatively common task in SQL querying. This article presented an additional step with a potential to improve pivot query performance by forcing a parallel execution plan. This technique is useful with large amounts of data to be processed, when parallel execution brings all of its advantages.

20 views0 comments