With this new AWS Lambda function, it’s never been easier to get file data into Amazon Redshift. You simply push files into a variety of locations on Amazon S3 and have them automatically loaded into your Amazon Redshift clusters.
Using AWS Lambda with Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse available for less than $1000/TB/YR that provides AWS customers with an extremely powerful way to analyze their applications and businesses as a whole. To load clusters, customers ingest data from a large number of sources,such as FTP locations managed by third parties, or internal applications generating load files. The best practice for loading Amazon Redshift is to use the COPY command, which loads data in parallel from Amazon S3, Amazon DynamoDB or an HDFS file system on Amazon EMR.
Whatever the input, customers must run servers that look for new data on the file system, and manage the workflow of loading new data and dealing with any issues that might arise. That’s why we created the AWS Lambda-based Amazon Redshift loader (http://github.com/awslabs/aws-lambda-redshift-loader), to offer you the ability to drop files into S3 and load them into any number of database tables in multiple Amazon Redshift clusters automatically, with no servers to maintain. This is possible because AWS Lambda provides an event-driven, zero-administration compute service. It allows developers to create applications that are automatically hosted and scaled, while providing you with a fine-grained pricing structure.
The function maintains a list of all the files to be loaded from S3 into an Amazon Redshift cluster with DynamoDB. This list ensures that a file is loaded only one time, and allows you to determine when a file was loaded and into which table. Files found in Input locations are buffered up to a specified batch size that you control, or you can specify a time-based threshold that triggers a load.
You can specify any of the many COPY options available, and we support loading both CSV files (of any delimiter), as well as JSON files (with or without JSON path specifications). All passwords and access keys are encrypted for security. With AWS Lambda, you get automatic scaling, high availability, and built-in Amazon CloudWatch logging.
Finally, we’ve provided tools to manage the status of your load processes, with built-in configuration management and the ability to monitor batch status and troubleshoot issues. We also support sending notifications of load status through Amazon SNS, so you have visibility into how your loads are progressing over time.
Accessing the AWS Lambda Amazon Redshift Database Loader
You can download this AWS Lambda function today from AWSLabs: http://github.com/awslabs/aws-lambda-redshift-loader. For example, perform the following steps to complete local setup:
git clone https://github.com/awslabs/aws-lambda-redshift-loader.git
Getting Started: Preparing your Amazon Redshift Cluster
In order to load a cluster, we’ll have to enable AWS Lambda to connect. To do this, we must enable the cluster security group to allow access from the public internet. In the future, AWS Lambda will support presenting the service as though it was inside your own VPC.
To configure your cluster security group for access:
- Log in to the Amazon Redshift console.
- Select Security in the navigation pane on the left.
- Choose the cluster security group in which your cluster is configured.
- Add a new Connection Type of CIDR/IP and enter the value 0.0.0.0/0.
- Select Authorize to save your changes.
We recommend granting Amazon Redshift users only INSERT rights on tables to be loaded. Create a user with a complex password using the CREATE USER command, and grant INSERT using GRANT.
Getting Started: Deploying the AWS Lambda Function
To deploy the function:
- Go to the AWS Lambda console in the same region as your S3 bucket and Amazon Redshift cluster.
- Select Create a Lambda function and enter the name MyLambdaDBLoader (for example).
- Under Code entry type, select Upload a zip file and upload the AWSLambdaRedshiftLoader-1.1.zip from GitHub.
- Use the default values of index.js for the filename and handler for the handler, and follow the wizard for creating the AWS Lambda execution role. We also recommend using the max timeout for the function, which is 60 seconds in preview.
Next, configure an event source, which delivers S3 PUT events to your AWS Lambda function.
- On the deployed function, select Configure Event Source and select the bucket you want to use for input data. Select either the lambda_invoke_role or use the Create/Select function to create the default invocation role.
- Click Submit to save the changes.
When you’re done, you’ll see that the AWS Lambda function is deployed and you can submit test events and view the CloudWatch Logs log streams.
A Note on Versions
We previously released version 1.0 in distribution AWSLambdaRedshiftLoader.zip. This version didn’t use the Amazon Key Management Service for encryption. If you’ve previously deployed and used version 1.0 and want to upgrade to version 1.1, do the following:
- Recreate your configuration by running node setup.js.
- Re-enter the previous values, including connection password and S3 secret key.
- Upgrade the IAM policy for the AWS Lambda Execution Role as described in the next section, because it now has permissions to talk to the Key Management Service.
Getting Started: Lambda Execution Role
You also need to add an IAM policy as shown below to the role that AWS Lambda uses when it runs. After your function is deployed, add the following policy to the lambda_exec_role to enable AWS Lambda to call SNS, use DynamoDB, and perform encryption with the AWS Key Management Service
Getting Started: Support for Notifications
This function can send notifications on completion of batch processing, if required. Using SNS, you can receive notifications through email and HTTP Push to an application, or put them into a queue for later processing. If you would like to receive SNS notifications for succeeded loads, failed loads, or both, create SNS topics and take note of their IDs in the form of Amazon Resource Notations (ARN).
Getting Started: Entering the Configuration
cd aws-lambda-redshift-loader && npm install
Note: To ensure communication with the correct AWS Region, you’ll need to set an environment variable AWS_REGION to the desired location. For example, for US East use us-east=1, and for EU West 1 use eu-west-1.
Next, run the setup.js script by entering node setup.js. The script asks questions about how the load should be done, including those outlined in the setup appendix as the end of this document. The database password, as well as the secret key used by Amazon Redshift to access S3 will be encrypted by the Amazon Key Management Service. Setup will create a new Customer Master Key with an alias named `alias/LambaRedshiftLoaderKey`.
Viewing Previous Batches & Status
If you ever need to see what happened to batch loads into your cluster, you can use the queryBatches.js script to look into the LambdaRedshiftBatches DynamoDB table . It takes the following arguments:
- region: The region in which the AWS Lambda function is deployed
- status: The status you are querying for, including ‘error’, ‘complete’, ‘pending’, or ‘locked’
- date: Optional date argument to use as a start date for querying batches
Running node queryBatches.js eu-west-1 error would return a list of all batches with a status of error in the EU (Ireland) region, such as:
If you require more detail on a specific batch, you can use describeBatch.js to show all detail for a batch. It takes the following arguments:
- region: The region in which the AWS Lambda function is deployed
- batchId: The batch you would like to see the detail for
- s3Prefix: The S3 prefix the batch was created for
These arguments return the batch information as it is stored in DynamoDB:
Clearing Processed Files
We’ll only load a file one time by default, but in certain rare cases you might want to re-process a file, such as when a batch goes into error state for some reason. If so, use the processedFiles.js script to query or delete processed files entries. The script takes an operation type and filename as arguments. Use -q to query if a file has been processed, and -d to delete a file entry. Below is an example of the processed files store:
Reprocessing a Batch
If you need to reprocess a batch, for example if it failed to load the required files for some reason, you can use the reprocessBatch.js script. This takes the same arguments as describeBatch.js (region, batch ID and input location). The original input batch is not affected; instead, each of the input files that were part of the batch are removed from the LambdaRedshiftProcessedFiles table, and then the script forces an S3 event to be generated for the file location in S3. This will be captured and reprocessed by the function as it was originally. You can only reprocess batches that are not in “open” status.
Ensuring Periodic Loads
If you have a prefix that doesn’t receive files very often, and want to ensure that files are loaded every N minutes, use the following process to force periodic loads.
When you create the configuration, add a filenameFilterRegex such as .*.csv, which only loads CSV files that are put into the specified S3 prefix. Then every N minutes, schedule the included dummy file generator through a cron job.
./path/to/function/dir/generate-dummy-file.py <region> <input bucket> <input prefix> <local working directory>
- region: The region in which the input bucket for loads resides
- input bucket: The bucket that is configured as an input location
- input prefix: The prefix that is configured as an input location
- local working directory: The location where the stub dummy file will be kept prior to upload into S3
This writes a file called lambda-redshift-trigger-file.dummy to the configured input prefix, which causes your deployed function to scan the open pending batch and load the contents if the timeout seconds limit has been reached.
Under normal operations, you won’t have to do anything from an administration perspective. Files placed into the configured S3 locations will be loaded when the number of new files equals the configured batch size. You may want to create an operational process to deal with failure notifications, but you can also just view the performance of your loader by looking at Amazon CloudWatch. Open the CloudWatch console and then click Logs in the navigation pane on the left. You can then select the log group for your function, with a name such as /aws/lambda/<My Function>.