Optimizing SQL Server: Understanding Parallelism in Execution Plan

Rafael Rampineli
8 min readDec 16, 2024

--

In this article, we’ll focus on parallelism in execution plans. If you’re looking for information on how to configure CTFP or MaxDop, this isn’t the right place!

Parallelism in SQL Server execution plans divides the query workload across multiple threads, enabling the system to utilize multi-core processors. While parallelism can improve query performance, it also has overhead costs. Dividing work into threads, managing those threads, and combining their results requires additional processing time and resources. For queries with only a few rows, this overhead often outweighs the performance gain of using multiple CPUs.

However, as data volume increases, the costs of certain operators — like Sort — grow significantly, making parallelism an effective strategy for long-running, processor-intensive queries. This is particularly beneficial for large-scale queries in reporting, data warehousing, or business intelligence systems, where columnstore indexes are often used.

On the other hand, OLTP (Online Transaction Processing) systems, which handle many small and fast transactions, might not always benefit from parallelism. Although parallel execution may slightly speed up some queries, it can consume more CPU resources and slow down other concurrent operations, reducing overall system performance. In OLTP environments, parallelism is often avoided by the query optimizer, but occasionally it makes suboptimal decisions.

Despite this, certain queries — such as those for reporting — can still benefit from parallelism in OLTP systems, particularly when their estimated cost exceeds the parallelism threshold. This highlights the importance of configuring the Cost Threshold for Parallelism (CTFP) setting to ensure that parallelism is applied only when it offers performance benefits without excessive overhead.

Blockers of Parallel query Execution

Several factors can prevent a query from executing in parallel in SQL Server, often referred to as blockers of parallelism. These blockers may be related to system settings, query design, or the nature of the data and operations involved.

1.Code Statements that Force Serial Execution:

  • Scalar functions using T-SQL
  • CLR multi-statement, table-valued, or user-defined functions that access data
  • Some internal SQL Server functions like ERROR_NUMBER(), IDENT_CURRENT(), and @@TRANCOUNT
  • Accessing system tables
  • Dynamic cursors

2.T-SQL Functions and Objects that Force Serial Execution in Part of the Plan:

  • Recursive CTEs
  • TOP keyword
  • Paging functions like ROW_NUMBER
  • Backward scans
  • Multi-statement, table-valued, user-defined functions
  • Global scalar aggregates

These parts of a T-SQL statement prevent parallel execution in the affected sections of the query plan. For a full list of blockers, refer to SQL Server Execution Plans by Grant Fritchey (Third Edition).

Query Execution in Parallelism

When SQL Server’s optimizer detects that a query can benefit from parallelism, it generates a parallel execution plan. In this plan, standard operators are marked with a yellow double arrow icon, indicating that the work is distributed across multiple processors. These operators perform the same tasks as in a serial plan, but on smaller subsets of data. Additionally, parallel plans include Parallelism (Exchange) operators, which distribute data across threads, coordinate processing, and merge the results.

Most operators don’t change in parallel plans — they process a portion of the rows instead of all of them. The key exceptions are scans and seeks, which behave differently when retrieving ranges of consecutive rows.

Let’s Get Hands-On

Let’s set up parallelism by configuring CTFP to 1, ensuring all queries run with parallelism enabled:

EXEC sys.sp_configure @configname = 'cost threshold for parallelism', @configvalue = 1;
GO
EXEC sys.sp_configure @configname = 'max degree of parallelism', @configvalue = 4;
RECONFIGURE WITH OVERRIDE;
GO

To revert to the default values after testing:


-- Command to reverse back to value 50 and 0 (all processors)
-- You Should run this step after all simulation.
EXEC sys.sp_configure @configname = 'cost threshold for parallelism', @configvalue = 50;
GO
EXEC sys.sp_configure @configname = 'max degree of parallelism', @configvalue = 0;
RECONFIGURE WITH OVERRIDE;
GO

Now, let’s run aquery example on AdventureWorks2019 database:

use AdventureWorks2019
GO

SELECT so.ProductID,
COUNT(*) AS Order_Count
FROM Sales.SalesOrderDetail AS so
WHERE so.ModifiedDate >= 'March 3, 2014'
AND so.ModifiedDate < DATEADD(mm,3, 'March 1, 2014')
GROUP BY so.ProductID
ORDER BY so.ProductID;

GO

Analyzing the Execution Plan

Looking at the SELECT operator, we can see the Degree of Parallelism property set to 4, which means the query was executed using four processors. This reflects the parallelism in the actual execution plan.

In the Clustered Index Scan operator, the Parallel property is set to TRUE, and the Number of Executions is 4, meaning the scan was performed four times, once per thread. The Actual Number of Rows Read is 121,317 rows, distributed across the four threads.

The data passes on to a Hash Match operator, that is performing an aggregate count for each ProductID value as defined by the GROUP BY clause within the T-SQL, but only for each row on its thread (the Hash Match is not parallelism aware). The result will be one row for each ProductID value that appears on a thread. It is likely that there will be other rows for the same ProductID in the other threads, so the resulting aggregates are not the final values, which is why, in the execution plans the logical operation performed by the Hash Match is listed as a Partial Aggregate.

If you inspect the Properties of the Hash Match (Partial Aggregate) operator, we’ll see that it was called 4 times, and again you will see the distribution of the partially aggregated rows across the threads.

You can see different row counts at this stage of the plan, depending on the degree of parallelism you see in your tests, and on how the rows are distributed across those threads.

Looking at execution plan, the rows pass to a Parallelism operator that performs a task called Repartition Streams.
Repartition Streams is used to route rows to the appropriate thread for processing. This can be done in two main ways:

  • Balancing Streams: Distributing rows evenly across threads to ensure each thread has a similar amount of work, and;
  • Grouping Data: Ensuring that rows needing to be processed together are sent to the same thread.

In our case, the operator groups rows by the ProductID column, so all rows with the same ProductID end up on the same thread. This allows the final global aggregation to process the data correctly.
The Actual Number of Rows property shows how rows were distributed across threads, with a roughly even balance. In this case, the even distribution was a fortunate outcome. However, if the hash algorithm used for ProductID values had resulted in an unequal spread, it could have caused data skew instead.

The operator’s properties show that the partitioning type used here is Hash, where rows are distributed based on a hash of the ProductID column. Other partitioning types include Round Robin (even distribution without considering data) and Broadcast (sending all rows to every thread).

After the partial aggregation and repartitioning, all rows with the same ProductID are grouped onto the same thread. This means each of the four threads will handle up to four rows per ProductID. These rows must then be aggregated again to complete the “local-global aggregation.”

With the number of rows reduced through partial aggregation, the optimizer determines that sorting the data into the correct order is more cost-effective than using another Hash Match for the final aggregation. So a Stream Aggregate operator are used instead of another Hash Match.

A sort operator is one that benefits greatly from parallelization, and often shows a significant reduction in total cost, compared to the equivalent serial sort.

Now, the next operator is another Parallelism operator, but performing the Gather Stream operation, that means, streams back together to return data as a single dataset to query (or another operator calling it). We can see, that the output from this operator now is single thread.

In the previous Parallelism operator (Repartition Streams), there was no guarantee of preserving the order of data, meaning rows were processed and passed to output threads as soon as they were available. This allowed for faster processing, but the order of the data might not be maintained.

However, if the order-preserving Exchange operator is used, the behavior changes. When the data in each thread is already sorted correctly, the operator will wait for all input threads to provide data before merging them into a single stream, while preserving the order. This can make the process slower compared to a non-order-preserving exchange, but because parallel sorting is highly efficient, the optimizer usually prefers a plan that uses parallel sorting and an order-preserving Parallelism operator over a plan that uses a non-order-preserving operator with a serial sort.

After this, the plan continues as a regular serial plan, processing the data on a single thread. The next step is the Compute Scalar operator, which converts the aggregated column to an integer.

Finally, the data is returned through the SELECT operator.

Conclusion

I hope this article has helped you understand the basics of interpreting a parallel execution plan. While parallelism doesn’t fundamentally change how you read execution plans, it does require familiarity with a few new Parallelism operators and an understanding of their impact on other operators. This knowledge allows you to identify which queries benefit from parallelism and recognize situations where the overhead may outweigh the advantages.

Parallel query execution can boost performance, but it can also degrade it. To make the most of parallelism, it’s crucial to properly configure your system, especially the Max Degree of Parallelism and Cost Threshold for Parallelism settings. When set correctly, these values will help ensure that parallel queries are used only when they truly offer performance improvements.

Reference Source: SQL Server Execution Plans by Grant Fritchey (Third Edition).

--

--

Rafael Rampineli
Rafael Rampineli

Written by Rafael Rampineli

I specialize in managing and optimizing complex database environments to ensure high availability, performance, and data integrity.

No responses yet