Amazon RedShift — Fast query performance, scalability, and integrations

Rafael Rampineli
8 min readDec 17, 2024

--

Amazon Redshift is a cloud-based data warehousing service provided by Amazon Web Services (AWS). It is designed to handle and analyze massive volumes of structured data efficiently. Redshift enables users to perform complex queries and analyses on data stored in a columnar format, which significantly accelerates the speed of data retrieval compared to traditional row-based databases.

At its core, Redshift is built on the Massively Parallel Processing (MPP) architecture, which allows it to distribute data and queries across multiple nodes (servers). This architecture ensures high performance and scalability, making Redshift suitable for businesses of all sizes, from startups to large enterprises.

Key Features of Amazon Redshift

  1. Scalability Amazon Redshift is highly scalable, allowing users to adjust resources based on their needs. You can start with a small cluster and scale up or down as your data grows. The service supports both vertical scaling (adding more power to a single node) and horizontal scaling (adding more nodes to the cluster). This elasticity makes it easy to accommodate ever-expanding data and evolving business needs.
  2. High Performance Redshift’s performance is driven by a combination of its MPP architecture, columnar storage, and data compression. By storing data in a columnar format instead of row-based storage, Redshift minimizes the I/O required for analytical queries. Additionally, it utilizes techniques like data compression and result caching to reduce query execution times and optimize storage.
  3. Fully Managed As a fully managed service, Amazon Redshift eliminates the need for customers to handle the complex tasks of managing, maintaining, and optimizing their data warehouse infrastructure. AWS automatically handles routine tasks like provisioning, patching, backups, and scaling, which allows businesses to focus on leveraging their data rather than managing the underlying infrastructure.
  4. Integration with AWS Ecosystem Redshift integrates seamlessly with other AWS services like AWS S3, AWS Glue, and AWS Lambda. For instance, data can be loaded directly from S3 into Redshift using the COPY command, and data transformations can be automated with AWS Glue. This interoperability with AWS services simplifies the process of building end-to-end data pipelines, from data ingestion to analytics.
  5. Security Amazon Redshift offers a robust security framework that includes encryption, identity and access management, and network isolation. It supports AES-256 encryption for data at rest and SSL encryption for data in transit. Redshift also integrates with AWS Identity and Access Management (IAM), enabling fine-grained access control to data and resources.
  6. Cost-Effective One of the attractive features of Amazon Redshift is its pay-as-you-go pricing model, which means businesses only pay for the resources they use. Redshift offers different pricing options based on the type of nodes used and the region in which the data warehouse is hosted. Users can choose between on-demand pricing or reserved instances for cost savings over long-term usage.
  7. Advanced Analytics and Machine Learning Redshift provides native support for SQL-based queries and integrates with popular data visualization tools like Tableau, Looker, and Power BI. Additionally, it offers integration with machine learning models through Amazon SageMaker. This enables businesses to run predictive analytics and gain deeper insights from their data directly within Redshift.

Redshift Spectrum

Amazon Redshift also supports Redshift Spectrum, a feature that allows you to run SQL queries directly against data stored in Amazon S3 without having to load it into Redshift first. This extends the capabilities of Redshift by enabling the analysis of both structured and unstructured data across a unified platform. This is especially useful for businesses that need to work with massive datasets stored in data lakes or need to combine historical data with real-time data.

Steps to do:

  1. You should have a bucket in same region as a cluster redshift;
  2. Upload your data in root folder;
  3. Create an AIM role for Redshift and associate it on cluster;
  4. Create an external schema;
  5. Create an external table;
  6. Enjoy doing your queries :F

Here are a short-code example about step 4and 5:

create external schema schema_new
from data catalog
database '{Database Source Name}'
iam_role '{Your arn:aws:iam Role}'
create external database if not exists;

create external table schema_new.employees (
EmployeeID bigint ,
LastName varchar ,
FirstName varchar ,
Address varchar,
City varchar,
Region varchar,
PostalCode varchar,
Country varchar
)
stored as parquet
location 's3://{Your_Bucket}/';

After this you can query your external table as a table inside on AWS RedShift, but your data are on S3 Bucket!
This is a nice feature to use when you have a data that should be used by a few differents services, so with this, we can storage our data just in one place — like S3 Bucket — and use by other services!

Sort Key, Dist Key, Dist Style

In Amazon Redshift, key concepts such as Sort Key, Dist Key, Dist Style, and Result Cache are crucial for optimizing query performance and managing large datasets efficiently. Let’s dive into each of these components to understand their functions and how they impact the performance of a Redshift data warehouse.

1. Sort Key

A Sort Key is used to determine how data is physically organized in the storage of Amazon Redshift. It plays a critical role in optimizing query performance, especially for range-based queries (such as queries filtering on date or numeric ranges). When you define a Sort Key on one or more columns, Redshift stores the data in sorted order based on the key, allowing it to perform fast lookups and reduce the number of disk reads during query execution.

Types of Sort Keys:

  • Compound Sort Key: The default sort key type. With this approach, all the columns listed in the sort key are used to physically organize the data. When a query uses the leading column(s) of the compound sort key, Redshift can efficiently skip over large blocks of data, which improves query performance.
  • Interleaved Sort Key: Allows Redshift to sort data by multiple columns and lets the query planner decide which columns to use for each specific query. This is more flexible but can result in less efficient performance than compound sort keys for queries that consistently filter on the same set of columns. However, interleaved sort keys are useful for workloads with multiple query patterns.

Best Practices:

  • Choose columns that are frequently used in WHERE clauses or as part of JOIN operations.
  • If your queries typically filter on different columns, Interleaved Sort Keys might be a better option than compound sort keys.

2. Dist Key (Distribution Key)

The Dist Key determines how data is distributed across the nodes in an Amazon Redshift cluster. In a distributed system, efficient data distribution is crucial to avoiding network bottlenecks and improving query performance, especially for complex joins.

When you choose a Dist Key, Redshift uses it to decide how to distribute data across the compute nodes. There are three primary methods of distributing data:

Distribution Styles:

  • Key Distribution (DISTKEY): Redshift distributes rows of the table based on the values of the selected distribution key column. Rows with the same value of the distribution key are placed on the same node. This is beneficial when joining large tables on the distribution key column because the system can colocate related data, minimizing data shuffling.
  • Even Distribution (DISTSTYLE EVEN): This method distributes the data evenly across all nodes in the cluster. Each row is randomly assigned to a node, and no column is chosen as a distribution key. This is useful when there isn’t a natural key for distribution, and no specific column is frequently used in joins or filtering.
  • All Distribution (DISTSTYLE ALL): Redshift copies the entire table to every node in the cluster. This is most efficient for small dimension tables that are frequently joined with larger tables (i.e., lookup tables). It minimizes data movement since every node already has a copy of the small table.

Best Practices:

  • Choose the correct distribution key: Use the column that is most frequently involved in joins to minimize data shuffling.
  • Use ALL distribution for small dimension tables: For small lookup tables, the DISTSTYLE ALL is the best approach because it avoids shuffling large amounts of data across nodes.
  • For large tables that don’t join frequently, or in cases where no specific column stands out as a frequent join column, the Even distribution style can be a safe choice.

3. Dist Style (Distribution Style)

The Dist Style defines how the data is physically distributed across the nodes in a Redshift cluster. There are three main distribution styles, and choosing the right one is essential for optimizing query performance and reducing network traffic:

  • DISTSTYLE EVEN: Data is distributed evenly across all nodes, and there is no primary key for distribution. This is useful when the table does not have a clear column to act as a distribution key.
  • DISTSTYLE KEY: Data is distributed based on a specific column (the distribution key). All rows with the same value of the distribution key will be placed on the same node. This is best used for large tables that will be frequently joined on the distribution key column.
  • DISTSTYLE ALL: A copy of the entire table is stored on each node, suitable for small dimension tables that are frequently used in joins with larger fact tables.

Best Practices:

  • Use DISTSTYLE EVEN for large tables with no common join key.
  • Use DISTSTYLE KEY for large fact tables that are frequently joined on a specific column.
  • Use DISTSTYLE ALL for small lookup tables.

TIP: To check a Distkey, Sortkey from a table, you can use the query above:

SELECT "COLUMN", TYPE, ENCODING, DISTKEY, SORTKEY, "NOTNULL" FROM pg_table_def
where tablename = 'TableName}';

4. Result Cache

The Result Cache is a feature in Amazon Redshift that helps speed up query execution by storing the results of previously run queries. When a query is executed, Redshift checks the result cache to see if the exact same query has been run before and if the underlying data hasn’t changed. If the data hasn’t changed, Redshift can return the cached result immediately without having to recompute the query, which greatly reduces query time.

Key Points about Result Cache:

  • Query Results Storage: The cache is stored in memory and is automatically populated when queries are run.
  • Automatic Management: The cache is managed automatically by Redshift, and users don’t need to configure or control it directly.
  • Cache Invalidations: The cache is invalidated if the data underlying the query has changed (e.g., new data has been loaded or a DELETE, UPDATE, or INSERT operation has been performed).

Best Practices:

  • If the same query is executed frequently with unchanged data, the result cache will provide significant performance benefits.
  • To ensure that data is fresh, remember that the cache will be invalidated when the data changes, so there might be a slight performance hit on the first query after data updates.

Conclusion

Amazon Redshift is a powerful, fully managed data warehousing solution that offers businesses scalability, performance, and security for complex data analyses. By leveraging features like Sort Keys, Dist Keys, Dist Styles, and the Result Cache, Redshift optimizes query performance, improves data distribution, minimizes data shuffling, and accelerates query execution. These optimizations help enhance performance and cost-efficiency, making Redshift an ideal choice for large-scale data warehousing, analytics, and business intelligence. Integrated seamlessly into the broader AWS ecosystem, Redshift supports diverse workloads, from building data pipelines to running advanced analytics, empowering businesses to fully harness the potential of their data while maintaining flexibility and efficiency.

--

--

Rafael Rampineli
Rafael Rampineli

Written by Rafael Rampineli

I specialize in managing and optimizing complex database environments to ensure high availability, performance, and data integrity.

No responses yet