solomon_waters

Getting Data into Amazon Redshift

Blog Post created by solomon_waters Employee on Sep 8, 2016

In this article, I'll walk through the process of loading data into Redshift using Amazon's recommended process of loading data into Amazon S3 then copying data into Redshift vs inserting into Redshift directly.

 

 

Ahh, the life of the Sales Engineer. Several months ago, I was preparing for a POC (Proof Of Concept exercise with a prospective customer). As is typical for these engagements, this POC would require figuring out something new-to-me – in this case, working with Amazon Redshift. I took a look at the Redshift documentation and found that Redshift is basically just a Posgtres variant. Yes, optimized for analytics blah blah blah but still, basically a Postgres database and therefore connecting to Redshift and interacting with it should be just like working with any other JDBC compliant database. Easy. I closed out my Redshift browsers and started daydreaming about a relaxing trip to Seattle.

 

Well, fast forward a few days. There’s four of us locked into a small conference room – myself, the Account Executive (the “sales guy”), the business owner for the project, and a business analyst who will own the Boomi integrations from a technical perspective. We’re walking through Boomi from beginning to end, and everything is going great. We’ve connected to a LIMS (Laboratory Information Management System), we’re querying data, we’re seeing the test data, we haven’t had to write a line of code, we’re ordering lunches, and everything is going great. Until we go to insert data into Redshift. And…everything…comes…to…a…halt.

 

Why is it taking so long? Is Boomi broken? The sales guy is not looking happy (there go my free drinks at the hotel bar)… and there’s no good explanation. If memory serves, we were loading 100 rows of data – I forget the actual size of the payload, but quite light – and it was taking so long to get data into Redshift that it just didn’t make sense. We load data into various databases all the time – business as usual –and I couldn’t think of a reason why performance was so ugly.

 

So, Google to the rescue. I start banging out Google searches and lo and behold, there are a ton of hits on this topic, and it turns out that Redshift is well known for poor INSERT performance. As a matter of fact, Redshift is designed to ingest data via COPY:

 

We strongly recommend using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. 

 

It turns out that the preferred method to get data into Redshift is to load data files to an Amazon S3 Bucket, and to then execute a COPY command to ingest that data into Redshift. I’ll cover that in detail below.

 

Interestingly enough, I polled my team on this topic and found some variation in response. One commonality: in any and all cases where our customers had been in communication with Amazon, the consistent recommendation directly from Amazon was to use COPY vs INSERT. Beyond that though, it does seem that INSERT performance varies fairly widely. In fact, I had a colleague who had just completed a Redshift POC in which they loaded their test cases via INSERT. In running various tests while writing up this blog post I was able to insert two different 100 row datasets in ~ 1 minute – far better performance than I had expected based on previous experience.

 

Regardless, the fact remains that COPY is the preferred method over INSERT, and so here we’ll take a look at using Boomi to COPY data into Redshift.

 

Use Case

My use case is super simple. I just want to test the COPY command and verify that I can load a data file into an S3 Bucket, and then COPY it into Redshift.

 

Approach

I’m going to break this down to the following sections:

  • Redshift environment
  • Generate and upload the datafile
  • Develop a working COPY command
  • Build a Boomi process that puts it all together
  • Discuss detours, errors etc encountered along the way

 

Redshift environment

 Being new to Redshift, I ran through Amazon’s Redshift Getting Started Guide. In addition to configuring and launching the Redshift cluster, this guide provides good instructions for Creating an IAM Role and associating that role to your Redshift cluster, which will come into play when we need to execute the COPY command. In addition, the guide provides helpful information on downloading the Redshift JDBC driver and configuring a connection with a SQL query tool. I highly recommend this last step, as I find it faster and easier to build and test SQL statements (including the COPY command) via a SQL tool, and then to load the resulting SQL into Boomi, vs developing and testing the SQL in Boomi. For this exercise I followed Amazon’s instructions and downloaded the SQL Workbench/J tool and found it more than adequate for the task.

 

Generate and upload the data file

 

General overview – data file

Essentially, what I want to do is to create a data file that:

  • Is CSV formatted
  • Has the same number of columns, and same column order as my target Redshift table
  • Resides in an S3 Bucket

 

CSV format: By default Redshift expects a pipe (|) delimited data file. I prefer to work with comma delimited data files as they’re simply more commonly used than pipe delimited or other formats.

 

Column number and order: Per the Redshift documentation it's possible to specify file column to Redshift table column mapping rules within the file. However to keep things easy I’ll simply ensure the file columns match my table and skip this part. This will greatly simplify my process.

 

Resides in an S3 Bucket: I’ll simply use the Boomi Amazon S3 Connector to load the data file into an S3 Bucket, where it will be available to be copied into Redshift.

 

 In a real world use case, I’d be developing a Boomi process that exports data from some system(s) and formats that data into a CSV. However, here I’m focusing on the mechanics of loading data in to Redshift, so I’ll make use of an existing data set. Tableau has a number of nice data set available on their Resources page; I’ll use of one of those. For my test, I picked what looked to be the simplest of the sample data sets, “Top Baby Names in the US”.

 

Uploading a Data file to an S3 bucket

I covered the process of uploading data into an S3 bucket in detail in a previous bog post and won’t cover it in detail here.

 

However, one update worth mentioning is the S3 connector’s properties are now exposed as connector document properties vs. free text dynamic document properties, e.g. “S3_KEY”). As shown in the screenshot below, this update makes the S3 connector more intuitive and easier to use.

 

 

 

I’ve copied my data file to an FTP, and then built a quick Boomi process to upload the file to an S3 Bucket:

 

Now I should be able to simply COPY the data into Redshift.

  

Redshift COPY command

The COPY command is well documented and I recommend reviewing the docs. That said, the docs state:

 

You can perform a COPY operation with as few as three parameters: a table name, a data source, and authorization to access the data.

 

Table Name

For my COPY test, I’ll create a table that matches the data file that I’ve selected. (In practice you probably generate a file that matched the existing Redshift table.) Again, make sure the number and sequence of columns match between the file and table. I’m working with the “Top Baby Names in the US” data set and created “topbabynames” table in Redshift.

 

Data Source

I loaded my data file into an S3 Bucket in the previous step, so that’s ready to go.

 

Authorization to access the data

Amazon specifically recommends using an IAM role to authenticate the COPY command. I’ll cover that below. However, it is possible to use AWS Access keys for authentication as well.

 

Note: While this is not the recommended process for authenticating the COPY command I include this information because it utilizes the same auth information as the Boomi S3 Connector

 

 

 

The auth syntax for the COPY command looks like this:

 

CREDENTIALS 'aws_access_key_id=<value>;aws_secret_access_key=<value>'

 

That said, I’ll stick with Amazon’s recommendation and use an IAM role.

 

Amazon IAM Role

Amazon’s Redshift Getting Started Guide provides solid documentation here including good instructions for Creating an IAM Role and associating that role to your Redshift cluster. As a quick review, start on the AWS IAM page and create or modify your desired role. I created “seRedshiftRole” with permission “AmazonS3ReadOnlyAccess”, which is the only permission required to execute the COPY command. I’ll also copy the ARN from the top right of the page, to be used later in my COPY command:

 

 

Next, I’ll flip back over to the Redshift page. Select the appropriate cluster, click “Manage IAM Roles” and associate your new role to the Redshift cluster.

 

 

 

Redshift is now able to assume that IAM Role, in this case enabling it to read the S3 data file. All that needs to be done is to specify the Role in the copy syntax. The syntax looks like this:

 

'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'

 

The ARN copied in the previous step contains the account id and role name. In other words, the credentials string is essentially:

 

'aws_iam_role=<paste ARN string here>'

 

Redshift COPY command, again

 I started off this section with this quote from the docs:

 

You can perform a COPY operation with as few as three parameters: a table name, a data source, and authorization to access the data.

 

And now that we have the table name, data source, and an auth string we should be ready to go.

 

My data is comma delimited with a header row and looks like this:

 

 

 

so I'll need to specify in the copy command:

 

CSV IGNOREHEADER 1

 

To specify the CSV formatting and to ignore the header line. By the way, the CSV syntax is explained in the CSV Section of the COPY command’s Data Format Parameters section, whereas IGNOREHEADER is explained in the JSON section. It works just fine in conjunction with CSV though.

 

Now we have all of the bits and pieces required to make a COPY work.

 

COPY table-name

[ column-list ]

FROM data_source

[ WITH ] CREDENTIALS [AS] 'aws-auth-args'

[ [ FORMAT ] [ AS ] data_format ]

[ [ parameter [ argument ] [, ... ] ]

 

My final copy command becomes:

 

COPY topbabynames FROM 's3://boomi-se-demo/TopBabyNamesbyState.csv'
CREDENTIALS 'aws_iam_role=arn:aws:iam::382029670213:role/seRedshiftRole'
REGION 'us-east-1'
CSV IGNOREHEADER 1;

 

Where topbabynames is the name of my Redshift table, boomi-se-demo is the name of my S3 Bucket, and TopBabyNamesbyState.csv is my data file. See the Appendix below about the REGION parameter.

 

I’ll quickly test this command into SQL Workbench to make sure it's correct. Redshfit should copy about 10k rows into my table:

 

 

 

The screenshot is a little hard to read but the output message reads:

 

Warnings:
Load into table 'topbabynames' completed, 10506 record(s) loaded successfully.

0 rows affected
COPY executed successfully

Execution time: 2.45s

 

Yeah statement is good! Now all that I need to do is put all of this together in Boomi.

 

Making it work in Boomi

I’ve got my data file; I’ve got my COPY command; now I need to configure a Boomi Redshift connection and I can put the pieces together.

 

Amazon Redshift Connection

For Redshift, we’ll use the Boomi Database Connector, and we’ll use a Database type of “Custom”. If you're running the process in the hosted Dell Boomi Atom Cloud, the Redshift PostgreSQL JDBC driver has already been installed. However if you're using a locally installed Boomi Atom, you'll need to download the driver from Amazon and add it to the Atom. This means that we need to drop the JDBC driver .jar file into the../<atom_installation_directory>/userlib/database directory. Note that if the userlib/database directory does not exist, you must create it and then restart the Atom.

 

Once the JDBC driver has been added to the Atom, we can configure the connection. If you’re using a tool like SQL Workbench/J most of the information that you need is already present in the connection profile:

 

 

 

If not, it is very easy to get your connection string through the AWS Console. Then, setting up the Boomi database connection is straightforward: 

  1. Set the Driver Type to “Custom”
  2. Enter your username and password
  3. Enter the class name – driver class names vary per driver version and are listed on the driver download page
  4. Paste your connection string (JDBC URL)

 

 

 

Testing the Redshift Connection

At this point, you’re going to want to test your database connection and make sure that Boomi can successfully connect to Redshift. Unfortunately, there isn’t an easy “Test database connection” in Boomi. I’ve documented the process, but to keep us on track the step-by-step instructions for testing a database connection can be found in the final section of this post.

Boomi Process

OK, now that we’ve got all of the pieces working, it will be a piece of cake to assemble this into a Boomi process.

 

I’ve already got a process that uploads my data file to S3:

 

 

By default Amazon S3 will overwrite files, so I can keep re-running this process without issue. I’ll just add a second branch to execute the COPY command and I’ll be all set.

 

 

 

I already have a working SQL statement, and I just set up and tested my Boomi Redshift connection. Now I just need to drop both of those into a Program Command shape to execute my COPY command.

 

 

 

And that’s it – a working Boomi process:

 

 

The Execution report shows one document in/out, but remember that that represents a single data file containing 10k+ records

 

 

 

 

 

The screenshot is once again difficult to read, but shows that a select count() returns 1056 rows

 

And I am able to do some important data analysis:

 

 

Conclusion

That’s it, folks.

 

I started this journey scrambling to figure out why inserting data into Redshift was so **** slow. And I’ll admit, when I first starting reading about uploading files to S3 and copying them over into Redshift, my initial thought was “how hard is this going to be?” Happily though, I’ve found it to a very straightforward process. Between the fact that we’ve got a nice Amazon S3 connector to upload the data, and that the COPY command is a straightforward SQL statement that’s easily handled by Boomi’s database connector, loading data into Redshift is no more challenging than any other datasource – it’s just a little different than a typical database load process.

 

Appendix: Tips and Lessons Learned

And finally, a roundup of things worth mentioning but that didn’t really fit into the story:

  • How to test a database connection in Boomi
  • A tip for keeping a clean dataset while building and testing your process
  • “REGION” error encountered while testing the COPY command
  • Need to use a statically-defined COPY command

 

Testing the database connection in Boomi

Unfortunately, Boomi doesn’t have a simple “Test Connection” feature (although we do have an Idea for that, so feel free to upvote). Here’s how to do it. The easiest way to test a database connection is create a new database operation, and within that attempt to import a new database profile. The import operation acts as the connection test in this case.

 

So, to test my Redshift connection I’ll just create a new integration process, set the Start shape to be a database connector, plug in my new Redshift connection, and click the “+” button to create a new operation:

 

 

 

Inside the operation, I’ll click the “+” button to create a new database profile:

 

 

 

Finally, click on the “Statement” at the top of the profile tree to activate the Import button, and click Import:

 

 

 

Did I say finally? In the Database Import Wizard, be sure to select the radio button to Browse in Atom vs Cloud, and select the Atom to which you added the Redshift JDBC driver. Plug in your Redshift connection and press Next.

 

 

 

 

And that’s the easy way to test a database connection in Boomi! The Database Wizard will either connect to Redshift and present a list of tables, or will return a database error. An "Unable to instantiate driver class com.amazon.Driver, please make sure the appropriate jar files are loaded" typically indicates that either a) your driver was not copied to the correct driver in the Atom installation directory b) you browsed using a different Atom than the one to which you copied the driver or c) you attempted to browse via the Cloud. If you are presented with a list of tables, then you have a viable database connection.

Keeping a clean dataset

I followed a typical, iterative process in building out this integration, including that I first loaded a data file into S3, then tested the COPY command in SQL Workbench/J, then ported that COPY command to Boomi and tested it there.

 

Following the Boomi test, I did a quick count on the number of rows in my table, expecting 1056.

 

[screenshot shows 21012 rows]

 

Clearly, I neglected to delete the original data set from my table. I saw the same issue during a POC in which we were querying data from a LIMS and copying the data to Redshift: the first test to see if the data made it is to see how many rows are in the table, and it’s easiest to see how you’re doing if you have clear picture of what that last operation loaded. During that POC, my prospective customer simply added another branch to the Boomi process, executing a DELETE before the COPY command and thus ensuring that we were getting a clean data set with each execution.

 

Here I’ve added another branch and Program Command step to my process, which executes the DELETE before the COPY:

 

 

 

And now I’m see a clean result with each execution.

 

“REGION” error (COPY)

While building out and testing my COPY command, I ran into an unexpected error:

 

 

 

The error message reads:

An error occurred when executing the SQL command:
copy topbabynames from 's3://boomi-se-demo/TopBabyNamesbyState_100.csv'
CREDENTIALS
'aws_iam_role=arn:aws:iam::382029670213:role/seRedshiftRole'
CSV I...

[Amazon](500310) Invalid operation: S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect,Rid 62FCCEE90D15177B,ExtRid WFQzecf8byQch2URDdakYETVSX2e0eDt6hIh
Details:
-----------------------------------------------
  error:  S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect,Rid 62FCCEE90D15177B,ExtRid WFQzecf8byQch2URDdakYETVSX2e0eDt6hIh
  code:      8001
  context:   Listing bucket=boomi-se-demo prefix=TopBabyNamesbyState_100.csv
  query:     3733
  location:  s3_utility.cpp:532
  process:   padbmaster [pid=16200]
  -----------------------------------------------;
1 statement failed.

Execution time: 0.37s

 

This one took a little bit of research.

 

It turns out that the Amazon S3 connection (including the S3 user) that I used to upload my data file was set up by my colleague Thameem, who lives on the East coast. I live on the West coast. I set up the Redshift cluster. So what we have here is an AWS zone issue. My data file resides in an S3 bucket in US_EAST_1:

 

 

 

While my Redshift instance is us-west-2.

 

According to the docs I need to add a REGION parameter:

 

Important

If the S3 bucket that holds the data files does not reside in the same region as your cluster, you must use the REGION parameter to specify the region in which the data is located.

 

Note that it doesn’t seem to make sense to have the Bucket and Redshift instance in different regions, and also incurs an additional cross-region data transfer cost from Amazon. In practice, it would make more sense to configure a Bucket in the same region as my Redshift instance. For completeness though, we can get past this error by added the REGION parameter:

 

REGION 'us-east-1'

 

By the way, you might notice that the region in my S3 connection screenshot is  ‘US_EAST_1’ while in my statement I specify it as 'us-east-1' What’s the difference, and does it matter? I have no idea. However, the Regions table from the Amazon doc linked above states region names in lower case/hyphen format vs upper case/under bar, and so I’m using the format from the Amazon docs.

 

Need to use a statically-defined COPY command

You may have noticed I had to statically define the bucket, file name, credentials string, etc. in the COPY command. This is the Redshift driver doesn't allow prepared statements for COPY commands so that you could build the command dynamically within the Program Command shape. Trying to build the entire command ahead of time and dropping the "current data" into the Program Command doesn't work either.

 

No worries though: because the S3 upload overwrites files by default, you can simply use a static name per process (or individual file if you need to route and handle multiple files differently).

Outcomes