ds-blog-icon.png

THE DATA DOWNLOAD

As a business intelligence consulting company, we pride ourselves on being able
to deliver on our projects as well as provide good quality content to our readers.

Amazon Redshift Tips & Tricks

Posted by Chun Wu on Dec 20, 2016 10:20:00 AM

Amazon Redshift Tips & Tricks:  Top 10 Tips & Tricks for Using Amazon Redshift


By: Chun Wu

Last month’s webinar, Working with Redshift: Amazon’s Affordable MPP Analytic Database in the Cloud, was well received and sparked a lot of interest from viewers.  Based on the feedback, interaction and questions received, Chun Wu was happy to create a more in depth level of Tips & Tricks for our Analytic Minute readers.

data_edited.jpg

For those of you who may have missed last month’s webinar, you can see a replay of the training by following the link to the right.

Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiently and parallelizing queries across multiple nodes.

Below you will find 10 Tips & Tricks to help you get started working with Redshift and some sample copy commands that will help you while working with Redshift.

Below you will find 10 Tips & Tricks and some sample copy commands that will help you while working with Redshift.

10 Amazon Redshift Tips & Tricks:

# Tips & Tricks Additional Comments
1. Launch the Redshift cluster in the same AWS region as the S3 bucket to improve load performance of the COPY command.  
2. An encrypted Redshift cluster provides extra security but it also impacts performance. If possible, launch an encrypted cluster with Dense Compute (SSD) nodes Encryption has no performance impact on Dense Compute (SSD) Redshift cluster but it adds 20% to Dense Storage (HDD) Redshift cluster. The SSD Redshift cluster runs at least 30% faster than the HDD Redshift cluster
3. Choose the correct data distribution style to avoid data hot spot. The default is EVEN (round-robin) distribution which may not be the best choice depending on the size of tables being joined.

Redshift distribution styles:

  • Even Distribution
  • Key Distribution
  • All Distribution
4. Collocate data from joined tables as much as possible to avoid data broadcasting See Example 1 below for an illustration of data broadcasting
5. Assign the appropriate sortkey for faster table scan For example, sort the data by state for frequent geo/location analysis
6. Take advantage of column compression to reduce storage space and to reduce amount of data read/scanned during query to improve performance Compression can be applied manually when table is created or automatically using the COPY command
7. Use a single COPY command to load multiple split data files to improve performance. DO NOT run multiple COPY commands at the same time Issue:copy s3://my_bucket/data Do not issue:copy s3://my_bucket/data1copy s3://my_bucket/data2copy s3://my_bucket/data3
8. Use a manifest or control file to ensure all data files will be processed

For example:

  1. Parallel transfer data files to S3
  2. Transfer the manifest file to S3
  3. Use COPY with the manifest option

See Example 2 below for a complete COPY command example

9. Run the COPY command with the noload option for a preload check See Example 2 below for a complete COPY command example
10.

Redshift doesn’t have a merge or upsert command. Two alternatives:

  1. Delete the old records and re-insert the latest version of the records
  2. Use a staging table to load the data first and then perform an UPDATE and an INSERT statement
See Example 3 below for a sample pseudo codes for method #2

 

 

Amazon Redshift Example 1: Data Broadcasting

Amazon Redshift Tips & Tricks

 

Amazon Redshift Example 2: A sample COPY command

COPY schema.table

-- S3 location is a prefix
-- s3://my_bucket/data = s3://my_bucket/data*

from ‘<your_s3_bucket>’

credentials 'aws_access_key_id=<your_key_id>;aws_secret_access_key=<your_access_key>'

-- specify the incoming date and timestamp format

dateformat as 'YYYYMMDD'

timeformat as 'YYYY-MM-DD HH:MI:SS‘

  -- replace invalid characters

acceptinvchars as '^'

csv

  -- specify the compression method used if any

gzip

  -- run a test load but don’t actually load the data

noload

  -- specify the manifest file name
  -- if the manifest option is used, the data file name will be replaced by the manifest file name

manifest;

 

Amazon Redshift Example 3: Merge sample pseudo codes

-- load the data into the staging table

copy tab_stg

from 's3://my_bucket/data_files'

-- update the target table with new values from the staging table

update tab_tgt  tgt set col1 = stg.col1

from tab_stg stg

-- explicit join on the primary key and distribution key

where tgt.pkey = stg.pkey and tgt.dkey = stg.dkey

-- insert the new rows

insert into tab_tgt

select … from tab_stg

from tab_stg stg left join tab_tgt tgt

on stg.pkey = tgt.pkey and stg.dkey = tgt.dkey

where tgt.pkey is null

 

We hope you’ve enjoyed the added Tips & Tricks for working with Amazon’s Redshift and trust that it will make your next data warehouse implementation more efficient.  We appreciate Steve Dine & Chun Wu for putting together the advanced Tips & Tricks list.  Feel free to contact us for additional tips and tricks for working with Amazon Redshift.

Datasource Consulting is a premier leader in the Enterprise Data Management and Business Intelligence industry.  We specialize in delivering lean, innovative, end-to-end solutions that leverage and express the full value of data.

Are you ready to discuss your project?

Let's chat about it, we look forward to helping and becoming your data partner. 

request-call-with-practice-lead

 

Topics: Amazon Redshift, Data Warehousing, Blog, Cloud

Written by Chun Wu