Ways To Read Indexes Storage Engine Part 6

Hi everyone,

(docs)

today we are going to introduce the following:

  • Ordered scan
  • Merry-go-round or advanced scan
  • Allocation order scan
  • Index seek
  • range scan
  • Singleton lookup or point-lookup
  • SARGable Predicates

Ordered scan

let’s assume we ran the following query:

select*
from heapvsclustered 
order by id asc

Now here what SQL server does to get the data:

  • first, it goes to the root page
  • then it identifies the null range page or the lowest id intermediate page
  • then it goes to that and identifies the lowest key or the first null page
  • then if scans that page and from the doubles links
  • it goes to the next page etc. until it finishes the table data
  • then it is done

and in the physical operator in the execution plan, we can see the following

this is what you call an ordered scan

now it does not have to have the order by clause

it just means that it did scan the needed data through the index’s key order

now this order could be forward or backward since the default in SQL server is to create ascending keys

and our key was created with no specific order in mind

and we requested ascending order we scanned the index forward like we showed in the picture, it starts with the first page and picks it up from there

or it could go backward

now if we check the properties of our plan we can see the following:

Now if we order by desc like:

select*
from heapvsclustered 
order by id desc

now this one can not run in parallel

Merry-go-round or advanced scan

now here, two queries that share the same scan ranges,

for example in Stackoverflow(brent ozar)

if we write the following queries:

use StackOverflow
select EmailHash
from Users
where id >1 and id< 1000000000
select EmailHash
from Users
where id >5000
and id<1500000000

we would get the following plans

now these two could share the scan, by looking into the ordered property:

now these have to be in two different sessions since these would not execute simultaneously

these queries would use the same index scan twice, which would cut I/O costs

now what would happen if we assume the first query started first, then while executing, the second one started scanning the same pages, SQL server would go, and take the first result set and merge it with the remaining range, and find the rest of the range

in our case, it was the range between 5000 and 1000000000

now while doing that it does not respect the index order

and why is that? because it uses an allocation order scan

and there it might respect the page order, but it would not go through the index

more on it below

now since there is no order

we would see that in our execution plan

so we see that in the properties of our plan

now this does not mean we did use it, it just means that we did not stick to the key order

now here they observed the effects using wait_stats and i/o cost for each query

now since the order is not respected, if wanted and order we should use the ORDER BY clause

now, the only way to tell directly we used such a thing is to attach a debugger and take a look at the code, we will cover that in another blog

Allocation order scan

now instead of using the index key, it uses the IAM pages, just like the heaps and it scans the whole table, and the ordered property would be false as well,

same goes here, we don’t know if the order was not respected we scanned twice, but different pages

or we just went to the allocation unit map

and took whatever we wanted

secondly, here we are not scanning heaps

we are scanning clustered index pages

so no forwarding pointers

here we have index pages

so here we might have to scan pages twice or skipped

since it is not going to use the index pages

it might scan for a value that has an old range

but then got moved because of index fragmentation

but the allocation map does not have knowledge of that

which causes it to skip some rows

or scan the same rows twice

since the row has been moved after the select statement was issued

that is why this is not used much unless we specify READ UNCOMMITTED or SERIALIZABLE

the previous example fits this category

(Paul White)

Index seek

now here we have a very specific value or values that could be reached on a small number of pages and we don’t need to scan the whole index, that which reduces the cost

for example:


select id 
from dbo.heapvsclustered
where id between 1 and 5
and 1 = (select 1)

The plan would be like:

Now here it did this:

it looked for the minimum and max value requested in the root page

it identified which intermediate page it needed

then it went there, it looked for the leaf pages that contain our ids

then located the data page we needed, went there, and got the rows for us

so instead of scanning the whole thing it just requested some of them and got them for us

now there are two types of seeks point-lookup or singleton-lookup and range scan

range scan

we see this(Paul White) where we have a range of rows, not a specific one that we defined(like the query in the previous example)

there it defines the range it wants, from root to intermediate to page

then it gets it at one move, like we presented in the previous query,

Now if we check the plan tool tip like:

As you can see SQL server defined the range, then it starts at the start, now here we know from the properties:

The start is at the 1, and in the same plan we know it is forward, so we know it starts:

begin tran
select id 
from dbo.heapvsclustered
where id between 1 and 5
and 1 = (select 1)
OPTION (RECOMPILE, QUERYTRACEON 8605, QUERYTRACEON 8606, QUERYTRACEON 8607,querytraceon 3604)
COMMIT TRAN;

*** Converted Tree: ***
    LogOp_Project QCOL: [TestDatabase].[dbo].[heapVSclustered].id

        LogOp_Select

            LogOp_Get TBL: dbo.heapvsclustered dbo.heapvsclustered TableID=2069582411 TableReferenceID=0 IsRow: COL: IsBaseRow1000 

            ScaOp_Logical x_lopAnd

                ScaOp_Comp x_cmpGe

                    ScaOp_Identifier QCOL: [TestDatabase].[dbo].[heapVSclustered].id

                    ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1)

                ScaOp_Comp x_cmpLe

                    ScaOp_Identifier QCOL: [TestDatabase].[dbo].[heapVSclustered].id

                    ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=5)

                ScaOp_Comp x_cmpEq

                    ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1)

                    ScaOp_Subquery  COL: Expr1003 

                        LogOp_Project

                            LogOp_ConstTableGet (1) [empty]

                            AncOp_PrjList 

                                AncOp_PrjEl COL: Expr1003 

                                    ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1)

        AncOp_PrjList 

It starts with what it is going to project or:

  1. LogOp_Project QCOL: [TestDatabase].[dbo].[heapVSclustered].id = meaning this is the column that I want to get the rows from, now this one has two child operators
    1. AncOp_PrjList = which is here for patching sake only
    2. LogOp_Select: which is a logical operator that filters our request like the where clause in our query, here it defines which rows should we get and it has 2 child operators
      1. LogOp_Get TBL: that says we should get rows and opens the execution
      2. ScaOp_Logical x_lopAnd: scalar operator(does scalar operations = deals with single-valued math) that defines the and( the where clause) in our operations since we had an and int it, and this one has 3 child operators
        1. ScaOp_Comp x_cmpGe ScaOp_Identifier QCOL: [TestDatabase].[dbo].[heapVSclustered].id ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1): here it says it is a comparison operator that deals with greater than or equalt to(GE) and it defines the value int 1 so it is ≥ 1
        2. ScaOp_Comp x_cmpLe: less than or equal to 5
        3. ScaOp_Comp x_cmpEq: equality operator from equal hence ‘=’ we had in our where clause now this one has two child operators:
          1. ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1): Ti here means type of integer, meaning an integer that has the value of 1, hence our one that had a scalar constant which was 1
          2. ScaOp_Subquery COL: Expr1003: as it sounds it is a subquery logical operator that has a worktable(temporary table created and destroyed by SQL serve during execution) with one column called Expr1003 now this subquery has one child operator:
            1. LogOp_Project: defines what we are going to project= show in this query it has 2 child operators:
              1. LogOp_ConstTableGet (1) [empty]: this one gets from table empty a constant called 1 hence our 1 in the subquery
              2. AncOp_PrjList : shows us calculated columns or constant expressions in a query it has one child operator:
                1. AncOp_PrjEl COL: Expr1003 el at the end means element so here it is going to project an expression or computed column element that would be defined in the child operator
                  1. ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1): we just explained it
      then in simplification, it deletes the subquery since there is no need for it and settles down on the scalar operators that we saw in the execution plan then in optimization, while choosing physical operators it settles down for a seek and this is why we see our scalar operators in the execution plan now we have a start, and since we are in forward-range scans our start would be GE operators same as in the graphical plan ≥ 1 and we have an end which was ≤ 5 so it starts at 1 and keeps going until it finds the end and then it is done if it needed other pages, it would use the double links

Singleton lookup or point-lookup

now here we have a specific row or rows in mind

we generally use in or = in the syntax

so SQL server does a single lookup for each

for example:

select id 
from dbo.heapvsclustered
where id in(1,2)
and 1 = (select 1)
OPTION(RECOMPILE,QUERYTRACEON 8606,QUERYTRACEON 8607,
QUERYTRACEON 8608,QUERYTRACEON 3604,
QUERYTRACEON 8619,QUERYTRACEON 8620,
QUERYTRACEON 8621,QUERYTRACEON 8615,
QUERYTRACEON 8675)

The execution plan had a seek? what kind

if we look up the properties:

so there is no start or end

just two distinct seek operations

Now in the simplified tree:

*** Simplified Tree: ***
        LogOp_Select

            LogOp_Get TBL: dbo.heapvsclustered dbo.heapvsclustered TableID=2069582411 TableReferenceID=0 IsRow: COL: IsBaseRow1000 

            ScaOp_Logical x_lopOr

                ScaOp_Comp x_cmpEq

                    ScaOp_Identifier QCOL: [TestDatabase].[dbo].[heapVSclustered].id

                    ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1)

                ScaOp_Comp x_cmpEq

                    ScaOp_Identifier QCOL: [TestDatabase].[dbo].[heapVSclustered].id

                    ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=2)

  • so we have the regular logop_select that filters the rows for us it has two child operators
    • ScaOp_Logical x_lopOr: so instead of an and we have an ‘OR’ since it could be anything from the IN() we had it had two child operators
      • ScaOp_Comp x_cmpEq: equality it has two child operators
        • ScaOp_Identifier QCOL: now this one shows us where to identify, look for the row
        • ScaOp_Const TI: it says it should be 1
        • the same goes for the other operator

and from there it looks for alternatives, generates a physical operator, and then we have two seeks with two seek predicates

now if we add more values in the in we can see:


select id 
from dbo.heapvsclustered
where id in(1,2,3,4,5,7)
and 1 = (select 1)

So 6 in here now if we ask for 50 values


go
SET STATISTICS TIME ON
go
select id 
from dbo.heapvsclustered
where id in(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
  41, 42, 43, 44, 45, 46, 47, 48, 49, 50)
and 1 = (select 1)
option (recompile)
go

select id 
from dbo.heapvsclustered
where id between 1 and 50
go
SET STATISTICS TIME off

we get 50 seeks and if we compare with range scan seeks while we measure execution time

the output would be:


 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 40 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 33 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

so the range scan was shorter, so if we need 1 value point-lookup might be more effective but once the number increases the range scan might surpass it

both are seeks

but one of them seeks a range, and the other one seeks one row at a time

now if we = like:

select id 
from dbo.heapvsclustered 
where id = 1 

which is what we want

on the other hand, if we scan like this:


go
SET STATISTICS TIME on

go
select id 
from dbo.heapvsclustered
where id>1
and 1=(select 1)

go
select id 
from dbo.heapvsclustered
where 1=(select 1)

go
SET STATISTICS TIME off

The output was like:


 SQL Server Execution Times:
   CPU time = 63 ms,  elapsed time = 216 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.

 SQL Server Execution Times:
   CPU time = 32 ms,  elapsed time = 369 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

Now here it still takes less but if we use a larger table like salesorderheaderenlarged from Adam Machanic:

use adventureworks2022
go
SET STATISTICS TIME on

go
select SalesOrderID
from Sales.SalesOrderHeaderEnlarged
where SalesOrderID>1
and 1=(select 1)

go
select SalesOrderID
from Sales.SalesOrderHeaderEnlarged
where 1=(select 1)

go
SET STATISTICS TIME off

The output was like:

(1 row affected)

 SQL Server Execution Times:
   CPU time = 500 ms,  elapsed time = 4177 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.

(1290065 rows affected)

(1 row affected)

 SQL Server Execution Times:
   CPU time = 610 ms,  elapsed time = 4047 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

so it starts taking less

now if we do that for StackOverflow

use StackOverflow
go
set statistics time on
go
select id 
from Users
where id > 1
and 1 = (select 1)
go
select id 
from Users
where 1 = (select 1)
go
set statistics time off
go

the output

(8917505 rows affected)

 SQL Server Execution Times:
   CPU time = 3515 ms,  elapsed time = 37786 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.

(8917507 rows affected)

 SQL Server Execution Times:
   CPU time = 4094 ms,  elapsed time = 36262 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

so it starts taking less and less time

SARGable Predicates

now for a query to get an index seek, it has to be SARGable or search argument-able value

so if we want to use the index seek, we need to meet this criterion, meaning, SQL server has to be able to isolate a single value from the index and show it on its own

what arguments could be qualified as SARGable?

  • <
  • =
  • like in the case of prefix matching example: where id like ‘1%’
  • in
  • between

non-SARGable queries would be:

  • like in case of non-prefix matching, for example: when id like’%1%’
  • not
  • <>

SQL server has to do extra work to process each row before matching it leading to an index scan

we will end this blog with an example of this and dive deeper into another blog:


select id 
from dbo.heapVSclustered
where id in (1)
----
select id 
from dbo.heapVSclustered 
where id like'%1%'

now here as you can see it seeked in the first, scanned in the second

now logically, they are different operations with different results

and the second one has a warning that says there is a convert-implicit

we will dive into both in upcoming blogs, goodbye

One response to “Ways To Read Indexes Storage Engine Part 6”

Leave a Reply

Your email address will not be published. Required fields are marked *