Tuesday, December 16, 2008

Amazon S3 from the command line

Amazon.com offers a variety of web services to make building highly scalable web applications easier. One of those services is called the Simple Storage Service (S3). It provides a way to store and retrieve files (up to 5 GB each) on Amazon's distributed servers. S3 has both a SOAP and REST API. You can use the S3 REST API with only the native command line utilities installed by default in OS X Leopard (and Linux).

Why use Amazon S3?

Amazon S3 is designed to be robust, fast, and inexpensive. It is ideal for serving small images or small files for web sites. It is also a good solution for off site backups (less than 5 GB) because the storage is redundant and inexpensive. For very large backup files, it is not the best solution because of the file size limitation (which can be worked around in some cases), and because the bandwidth is prohibitive.

How does S3 work?

While S3 is cheap, it is not free. To use it, you first need to sign up with Amazon Web Services and obtain an access key ID and secret key. The access keys are required to authenticate to Amazon and to use the API.

Conceptually, S3 stores files (objects) in buckets. First, you create a bucket, then you store objects in the bucket. The combination of the bucket and object ID must be unique. To S3, all objects are binary blobs. You can attach some meta data to an object when it is stored, but S3 doesn't verify that the meta data accurately describes the object.

With the REST API, standard HTTP transactions are used to GET, PUT, and DELETE objects in buckets. There is no incremental update option. To update an object, upload and replace it with the new version.

Using S3 from BASH

As part of a consulting project, I was asked to upload rotating backup files to S3. The backup files consisted of compressed source code and database dumps.

I started by looking at the sample code on the Amazon developer site. I downloaded some working PHP code that used the REST API and began hacking on it. However, I ran into a serious problem when I tested uploading one of the backups. The sample code was designed for small files that could be uploaded in a single HTTP transaction. It had a limitation of 1 MB, while my backup files were around 250 MB. I could have tried to retool the code to break apart the backup file and upload in chunks but that seemed daunting. Instead, I went back to the developer site and found a set of BASH scripts that use the curl and openssl utilities to handle the heavy lifting.

There are two scripts in the package provided by the developer "nescafe5", hmac and s3. The s3 script is the main script and it calls hmac to calculate the hashes needed to authenticate to S3. Then, it uses curl to upload, download, list, or delete the objects in a bucket. It can also create and destroy buckets.

To use the s3 script, you need to store your Amazon secret key in a text file and set two environment variables. The INSTALL file included with the package has all the details. The only tricky part I ran into, and from the comments on Amazon, other people ran into, is how to create the secret key text file.

If you open a text editor like vim or nano, copy in your secret key and save the file, the editors will add a new line character (hex 0A) to the end of the file. The script requires the file to be exactly 40 bytes, the length of the secret key, or it will complain and stop. To create the text file without a new line, use the echo command:

echo -n "secret-key" > secret-key-file.txt

The -n switch tells echo to not include a new line character and results in a text file of exactly 40 bytes. Once I got the key file created correctly, the s3 script started working, and I was able to upload, download, and list objects in S3.

Here is an example of a test script I used:

#!/bin/bash
# export required variables
export S3_ACCESS_KEY_ID="99XC79990C39996AR999"
export S3_SECRET_ACCESS_KEY="secret-key-file.txt"

# store a file
./s3 put bucketname objectname /path/to/local/file

# list objects in a bucket
./s3 ls bucketname

# download a file
./s3 get bucketname objectname /path/to/local/file

Mac and Linux friendly

I tested the s3 scripts on both OS X Leopard and Red Hat Enterprise Linux 4 and they worked perfectly on both. Amazon Web Services offers powerful solutions to some tough problems. Other services offered include the Elastic Compute Cloud (EC2) for virtual servers, Simple Queue Service (SQS) for passing messages between applications, and Mechanical Turk for delegating work to humans. There is some truly cutting edge stuff going on at Amazon.