solomon_waters

Amazon S3 Made Easy

Blog Post created by solomon_waters Employee on Feb 15, 2017

Editors Note: This post was updated Feb 15, 2017 with the following changes: a discussion of a more complex use case was added, including an explanation of retrieving an object from S3.

 

Dell Boomi recently released an Amazon S3 connector that makes it really easy to integrate with Amazon’s S3 service. Let’s see what it takes to get up and running with an S3 integration.

 

Overview

This post was originally written to provide an overview of loading data files into S3, primarily to facilitate loading data into Amazon Redshift. After working with the S3 connector for some time, I realized that there were additional requirements, common to many of the use cases that I have seen, which I'll detail here. So, below you'll see the original post - a basic use case of writing a file to S3 - then an example demonstrating writing to dynamically created folders (Year, Month, Day), and finally an exploration of topics related to retrieving data from S3.

 

 

Basic Use Case - writing to S3

To begin with, I'll examine the steps required to simply to create a CSV formatted data file and upload it to the S3 service. Ultimately, this will be a part of a flow that will allow me to COPY data into Amazon Redshift. I’ve documented the COPY scenario in a separate post, here I’ll focus on creating the data file and uploading it to an S3 Bucket.

 

Amazon S3 Connection

The Amazon S3 connection uses Amazon access keys, which consist of an access key ID and a secret access key. You can create these keys using the AWS Management Console via the security credentials page.

 

 

Note that both the Amazon S3 Bucket name and AWS Region are configured at the connection level. If you have multiple accounts or Buckets, or multiple Regions, you’ll use a separate connection for each and configure the Amazon S3 connection accordingly.

 

Although there is a 1:1 relationship between the S3 Bucket and the connection (i.e., you need a different connection for each different Bucket), you can create and write to multiple folders within a Bucket using the same single connection. So – to read or write to multiple Buckets you need multiple connections, but to read or write to multiple folders within the same Bucket you only need one connection.

 

Process Flow

The Amazon S3 operations include Get, Create, and Delete. Here we’ll test the S3 connector by creating a file in an S3 Bucket. Again, in this example I’m creating and uploading a CSV data file to the S3 service, and in a later post I’ll review how I can then COPY that file into Amazon Redshift.

 

 

1) Get source data/file. I’ve configured a Salesforce connector to query Accounts (I limited the operation to 100 records to run in Test Mode without hitting the Test Mode data limits). I've then mapped the results to a flat file profile (CSV) and used a Data Process shape to combine the documents into a single flat file document.

 

2) Set file properties. The S3 Operation supports three properties which can be used to set the file name and folder name within S3:

 

Parameter Name
OptionalDefault ValueDescription
File KeyYesNoneFull key for Amazon S3 object.
File NameYesUUIDFile name for the target document.
FolderYesNoneDirectory path for the target document

 

Note: When you set the S3_KEY parameter the other parameters are omitted. If the S3_KEY is blank, the Amazon Key is created by concatenating the S3_FILENAME and S3_FOLDER.

 

Directory structure - File Name vs Folder Names

In S3, "folders" are really just part of the file name. From the S3 docs:

 

In Amazon S3, buckets and objects are the primary resources, where objects are stored in buckets. Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects.

For example, you can create a folder in the console called photos, and store an object called myphoto.jpg in it. The object is then stored with the key name photos/myphoto.jpg, where photos/ is the prefix.

Here are two more examples:

  • If you have three objects in your bucket—logs/date1.txt, logs/date2.txt, and logs/date3.txt—the console will show a folder named logs. If you open the folder in the console, you will see three objects:date1.txt, date2.txt, and date3.txt.

  • If you have an object named photos/2013/example.jpg, the console will show you a folder named photoscontaining the folder 2013 and the object example.jpg.

 

For convenience, the Boomi S3 connector provides Properties for both FIle Name and Folder Name, the S3 Connector then concatenates these values. In practice, this means that you have the option to utilize both Properties, or to build up a directory structure and file name using just the File Name property. 

 

 

Here I’ve set the properties to produce a file named “S3_Connector_test.csv”:

 

 

 

By adding a second property, I can create the file within a folder in my Bucket:

 

 

However, I can also achieve the same result by setting both the folder name and file name with the "File Name" property, either as a parameter or, in this case, using multiple parameters: 

Later in this post, I'll use this technique to dynamically build a directory structure with separate folders for Year, Month, Day..

 

3) Configuring the S3 operation. The operation must be imported--this is an easy step to miss. The Import action initializes the operation and generates a response XML profile. Note there is no request profile for the connector. This means that you can send data in any format to S3 (e.g. CSV, XML, binary, etc).

 

Before import:

 

After import:

 

4) S3 response. The S3 service provides a response that includes the file name, bucket name, and file URL, which may be of use within your integration process.

 

<s3UploadResult>

   <key>S3_Connector_test.csv</key>

<bucketName>boomi-se-demo</bucketName>

<uploadedFileUrl>http://boomi-se-demo.s3.amazonaws.com/S3_Connector_test.csv
</uploadedFileUrl>

</s3UploadResult>

 

I’ll use the S3 response to add an email alert to be sent when the file upload is complete:

 

 

I’ve configured a Message step to create an message based on the S3 response:

 

 

Which produces the following email:

 

 

Results

Success! Checking in S3, we can see the uploaded file in the bucket folder I specified above:

 

 

The file is now visible and available within S3. Again, the premise here is that writing to S3 would be the first step in loading data into Amazon Redshift, which is covered in some detail in this post.

 

Complex use case 

So, that's all well and good. We've loaded a file into a directory. However, in practice your use cases are bound to be more complex, and so here we'll add a few more details to our use case, and then figure out how to address those details with Boomi.

 

In real life, the use cases that I'm seeing typically revolve around loading data into a data lake, and read something like this: "we need to retrieve all recently modified Accounts from Salesforce, and write them to S3 as json files. We need to have the process run every 5 minutes, and the files need to be written to a different directory for every year, month, and day, and the files need to be timestamped. For example, we need to see "[bucket]/LOAD/Accounts/2017/02/07/accounts_20170207144314.json". We need to run this process ever five minutes, and then later we need to create a process that will retrieve all files from a particular directory for processing"

 

Based on this use case, we'll examine the following items:

  • Dynamically creating a directory structure
  • Retrieving files from S3 directories
  • Decoding the contents of S3 files (S3 file contents are Base64 encoded by default)

 

Folder or Directory Structure

Based on the use case above, we need to figure out how to create S3 folders, and how to dynamically write to the appropriate folder.


First, I've created a basic process which will query recently modified Salesforce accounts, and map the results to a JSON file.

Next, I need to leverage the S3 properties to build my folder structure and file name.

 

This is a typical Boomi task of using the Set Properties step to concatenate a series of static and dynamic parameters. Again, with the S3 Connector you have the option of using the Folder Name and File Name Properties, or  you can simply create or specify the folder as part of the File Name Property. Here I'll take the later approach, using the File Name Property to build up the folder structure and file name. As you can see, I'll begin with a static value - my top level "LOAD" folder, and then use a Current Date parameter with the appropriate mask to create folders - being careful to add "/" between each folder. Finally, I build up a filename in the same manner.

Resulting in the desired folder structure and file name.

 

 

Next, we need to figure out how to retrieve files back from S3.

 

Retrieving files from S3

 

The Boomi S3 connector provides 3 operations for getting objects: Get S3 Object, Get S3 Binary Object, and Get Object Listing. Get Object and Get Binary Object expect an Object ID as in input parameter. In other words, if you know exactly what you want to Get, you can use these operations. However, in many cases you'll need to first get an Object Listing, then use the results (list) to retrieve specific objects.

Working with S3 Object Listing

Per the documentation, "Get Listing retrieves a list of files and folders in the specified bucket, but not the contents of a file that include key, folderName, filename, isDirectory, lastModified, size, and bucketName." So, we'll get a directory listing, and then use the results to retrieve specific files. Similar to Get Object and Get Binary Object, Get Object Listing expects an ID as an input parameter.You can use a "*" to get a listing of everything in your bucket, or you can provide a folder path to get a directory listing for a particular folder. For instance, based on my example above I can provide an ID of "LOAD/2017/02/" to retrieve a directory listing of all files in the "02" directory.

 

The return:

<s3ObjectListing>
<s3ObjectSummary>
<key>LOAD/2017/02/07/accounts_20170207152832.json</key>
<fileName>accounts_20170207152832.json</fileName>
<isDirectory>false</isDirectory>
<lastModified>2017-02-07T15:28:36-08:00</lastModified>
<size>71</size>
<bucketName>boomi-se-demo</bucketName>
</s3ObjectSummary>
<s3ObjectSummary>
<key>LOAD/2017/02/08/accounts_20170208085443.json</key>
<fileName>accounts_20170208085443.json</fileName>
<isDirectory>false</isDirectory>
<lastModified>2017-02-08T08:54:45-08:00</lastModified>
<size>71</size>
<bucketName>boomi-se-demo</bucketName>
</s3ObjectSummary>
<s3ObjectSummary>
<key>LOAD/2017/02/08/accounts_20170208085444.json</key>
<fileName>accounts_20170208085444.json</fileName>
<isDirectory>false</isDirectory>
<lastModified>2017-02-08T08:54:45-08:00</lastModified>
<size>70</size>
<bucketName>boomi-se-demo</bucketName>
</s3ObjectSummary>
</s3ObjectListing>

 

What if I need to create a scheduled process that will always retrieve today's (or yesterday') files? Similar to creating the folder structure, one can use a Set Properties step to build up a property consisting of static text and relative dates, and then use that property as the input parameter for the Object Listing operation.

 

Typically, the next step is to add a Data Process/Split Documents step, splitting on <S3ObjectSummary>, so that we can route or otherwise act on a specific file. Then, the <key> element can be used as the input to a Get Object or Get Binary Object operation.

 

Working with S3 Get Object

Continuing with this use case, I've built out the following process:

As above, I'm returning an object listing, and then splitting the return into separate documents. I'm then passing in the <key> element to query for specific documents using the Get S3 Object operation

However, upon executing my process, the results are not quite what I expected:

<s3Object>
<key>LOAD/2017/02/07/accounts_20170207152832.json</key>
<folderName>LOAD/2017/02/07</folderName>
<fileName>accounts_20170207152832.json</fileName>
<objectContent>Ww0KICAiTGFtIFJlc2VhcmNoIiwNCiAgIjAwMTBNMDAwMDFRT3lZV1FBMSIsDQogICJQcm9zcGVjdCIsDQogIG51bGwNCl0=</objectContent>
</s3Object>

The key, folderName, and fileName are returned as expected. But, the objectContent has been Base64 encoded by S3. If you want to use the objectContent in your Boomi process this seems to present something of a challenge: some elements are returned un-encoded while the objectContent element is encoded. If the entire return was encoded, we could just use a Data Process step to decode. But how can we decode a single element?

 

The simplest solution is to capture whichever un-encoded elements you'll need in Dynamic Document Properties, then use a Message step to separate out the objectContent, then to decode with a data process step. The solution looks like this:

  1. As above, I'm using my Object Listing to enumerate the files in my bucket, then using Get Object to get the specific files, and finally capturing the key and fileName, to be used later in my process flow, as Dynamic Document Properties.
  2. Here I'm using a Message step to capture the Base64 encoded objectContent:
  3. Next, I can simply use the Data Process step's built in Base64 decode to decode the objectContent. I now have a clear-text document which I can use within my Boomi process - for instance, I can send this into a Map step - and I have my key and fileName captured in DDP's in case I need those as well. 

 

Conclusion

That wraps up the basics of working with the S3 Connector. Overall, it's pretty standard Boomi process development, though I find a few aspect - such as the Base64 encoded objectContent - to be a little bit less than intuitive. As always, please let me know if this article was helpful for you, and if you have any use cases that are not addressed in this post feel free to comment and I'll do my best to respond in a timely manner.

 

 

Sol Waters is a presales solution engineer with Dell Boomi where he is responsible for helping customers and prospects make best use of the features, capabilities, and benefits of the Dell Boomi products. Sol brings over a decade of experience in data management to the role.

Outcomes