Amazon Redshift Tips & Tricks: Top 10 Tips & Tricks for Using Amazon Redshift
By: Chun Wu
Last month’s webinar, Working with Redshift: Amazon’s Affordable MPP Analytic Database in the Cloud, was well received and sparked a lot of interest from viewers. Based on the feedback, interaction and questions received, Chun Wu was happy to create a more in depth level of Tips & Tricks for our Analytic Minute readers.
For those of you who may have missed last month’s webinar, you can see a replay of the training by following the link to the right.
Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiently and parallelizing queries across multiple nodes.
Below you will find 10 Tips & Tricks to help you get started working with Redshift and some sample copy commands that will help you while working with Redshift.
Below you will find 10 Tips & Tricks and some sample copy commands that will help you while working with Redshift.
10 Amazon Redshift Tips & Tricks:
|#||Tips & Tricks||Additional Comments|
|1.||Launch the Redshift cluster in the same AWS region as the S3 bucket to improve load performance of the COPY command.|
|2.||An encrypted Redshift cluster provides extra security but it also impacts performance. If possible, launch an encrypted cluster with Dense Compute (SSD) nodes||Encryption has no performance impact on Dense Compute (SSD) Redshift cluster but it adds 20% to Dense Storage (HDD) Redshift cluster. The SSD Redshift cluster runs at least 30% faster than the HDD Redshift cluster|
|3.||Choose the correct data distribution style to avoid data hot spot. The default is EVEN (round-robin) distribution which may not be the best choice depending on the size of tables being joined.||
Redshift distribution styles:
|4.||Collocate data from joined tables as much as possible to avoid data broadcasting||See Example 1 below for an illustration of data broadcasting|
|5.||Assign the appropriate sortkey for faster table scan||For example, sort the data by state for frequent geo/location analysis|
|6.||Take advantage of column compression to reduce storage space and to reduce amount of data read/scanned during query to improve performance||Compression can be applied manually when table is created or automatically using the COPY command|
|7.||Use a single COPY command to load multiple split data files to improve performance. DO NOT run multiple COPY commands at the same time||Issue:copy s3://my_bucket/data Do not issue:copy s3://my_bucket/data1copy s3://my_bucket/data2copy s3://my_bucket/data3|
|8.||Use a manifest or control file to ensure all data files will be processed||
See Example 2 below for a complete COPY command example
|9.||Run the COPY command with the noload option for a preload check||See Example 2 below for a complete COPY command example|
Redshift doesn’t have a merge or upsert command. Two alternatives:
|See Example 3 below for a sample pseudo codes for method #2|
Amazon Redshift Example 1: Data Broadcasting
Amazon Redshift Example 2: A sample COPY command
-- S3 location is a prefix
-- s3://my_bucket/data = s3://my_bucket/data*
-- specify the incoming date and timestamp format
dateformat as 'YYYYMMDD'
timeformat as 'YYYY-MM-DD HH:MI:SS‘
-- replace invalid characters
acceptinvchars as '^'
-- specify the compression method used if any
-- run a test load but don’t actually load the data
-- specify the manifest file name
-- if the manifest option is used, the data file name will be replaced by the manifest file name
Amazon Redshift Example 3: Merge sample pseudo codes
-- load the data into the staging table
-- update the target table with new values from the staging table
update tab_tgt tgt set col1 = stg.col1
from tab_stg stg
-- explicit join on the primary key and distribution key
where tgt.pkey = stg.pkey and tgt.dkey = stg.dkey
-- insert the new rows
insert into tab_tgt
select … from tab_stg
from tab_stg stg left join tab_tgt tgt
on stg.pkey = tgt.pkey and stg.dkey = tgt.dkey
where tgt.pkey is null
We hope you’ve enjoyed the added Tips & Tricks for working with Amazon’s Redshift and trust that it will make your next data warehouse implementation more efficient. We appreciate Steve Dine & Chun Wu for putting together the advanced Tips & Tricks list. Feel free to contact us for additional tips and tricks for working with Amazon Redshift.
Datasource Consulting is a premier leader in the Enterprise Data Management and Business Intelligence industry. We specialize in delivering lean, innovative, end-to-end solutions that leverage and express the full value of data.
Are you ready to discuss your project?
Let's chat about it, we look forward to helping and becoming your data partner.