BiggData: ec2 ami, ebs and s3 storage

Following up on my previous post, I am trying to replicate DataWrangling's excellent TrendingTopics website using Amazon's EC2 cloud and Hadoop. The first part of DW's instructions was to get Cloudera's Hadoop installed and successfully tested. So far, so good as my previous post showed.

Today's Goal
Next, DataWrangling was to have me fire up a virtual machine in the EC2 cloud, create a chunk of storage and copy over some files to later munge through with Hadoop. I thought this would have been easy, but Amazon's litany of access, secret keys and gpg keypairs proved to be roadblocks. Until I read the appropriate section in the manual that told me what each one did:

Amazon login and password to launch and administer Amazon EC2 instances through the AWS Management Console
Access Key ID and Secret Access Key to launch and administer Amazon EC2 instances through the Query API and many UI-based tools (e.g., ElasticFox)
X.509 certificate and private key to launch and administer Amazon EC2 instances through the SOAP API and command line interface
Amazon EC2 Key Pair (SSH) enables you to connect to Linux/UNIX instances through SSH
Tags key-value pair to simplify EC2 administration

Unbelievable. Of course, it is better to be more secure than less.

A few years back I had played around with EC2, but my memory has grown foggy since then. It was time to bite the bullet and figure out this latest installment of Amazon Web Services.

AWS Management
The main thing to learn about Amazon Web Services is that you can manage the various services like EC2 and S3 in two ways:
1) via the AWS Management Console
2) via Amazon EC2 command line tools download from here

I started off using the management console to fire up a virtual machine or what Amazon calls an Amazon Machine Image (ami). Doing this via the console was easy enough. I simply logged and started configuring an ami. However, I noticed that most of DataWrangling's instructions used the command line tools. So I decided to tackle the AWS Toolkit install.

Again, the download for the command line tools is here. The tools are java based, so they run on any platform. Prereqs are a java jre along with a couple of environment variables, like so:
export EC2_HOME=/usr/bin/ec2-api-tools-1.4.1.2
export JAVA_HOME=/usr

Once the toolkit was installed and the tools accessible via the path, the tools rely on another two environment variables for access to the AWS environment:
export EC2_PRIVATE_KEY=/mnt/doc/software/amazon/certs/pk-QRHO7BYVS7ZI3ACWFSEZOB.pem
export EC2_CERT=/mnt/doc/software/amazon/certs/cert-BYVS7ZI3C2CWFSEBC7ZOB.pem

The X509 certs and private key above are found under Account -> Security credentials. Once the 509 certificates are created, the AWS tools come to life! If you don't have the proper certs, you'll see error messages like this:
[sodo@ogre ~]$ ec2-describe-instances
Client.MalformedSOAPSignature: Invalid SOAP Signature. Failed to check signature with X.509 cert

Here's a very nice explanation of the purpose and use of the different AWS certs and keys needed. Look for Mitch's response @ Oct 30, 2009 5:03 PM.

"Keep the x509 private key safe, because there is NO WAY to redownload it if you've lost it" http://www.amazon.com/gp/help/customer/display.html?ie=UTF8&nodeId=200123040

A handy test to validate the public and private key match is to compare the modulus output of the commands:
[sodo@ogre 509cert]$ openssl x509 -in cert-BYVS7ZI3C2CWFSEBC7ZOB.pem -text
[sodo@ogre 509cert]$ openssl rsa -in pk-QRHO7BYVS7ZI3ACWFSEZOB.pem -text

As long as they match, you're good to go!

Some EC2 commands
[sodo@ogre ~]$ ec2-describe-regions
REGION eu-west-1 ec2.eu-west-1.amazonaws.com
REGION us-east-1 ec2.us-east-1.amazonaws.com
REGION ap-northeast-1 ec2.ap-northeast-1.amazonaws.com
REGION us-west-1 ec2.us-west-1.amazonaws.com
REGION ap-southeast-1 ec2.ap-southeast-1.amazonaws.com

[sodo@ogre ~]$ ec2-describe-instances
RESERVATION r-e6cb234b 73784173 default
INSTANCE i-2134419 ami-5394733a ec2-XX-XX-XX-XX.compute-1.amazonaws.com running rook 0 m1.small us-east-1d
BLOCKDEVICE /dev/sdf vol-3e3c05 2011-03-26T03:53:39.000Z

Elastic Block Storage hosting the Wikipedia data
Once I had the base ami running, I added an Elastic Block Storage (EBS) volume that included a publically available snapshot of the Wikipedia logfile dataset..all 300GB of it! Neato.
[sodo@ogre ~]$ ec2-create-volume --snapshot snap-753dfc1c -z us-east-1d
VOLUME vol-6d3e0c05 320 snap-753dfc1c us-east-1d creating 2011-03-26T03:51:28+0000
[sodo@ogre ~]$ ec2-attach-volume vol-6d3e0c05 -i i-73e19 -d /dev/sdf
ATTACHMENT vol-6d3e0c05 i-73e19 /dev/sdf attaching 2011-03-26T03:53:31+0000

Writing to S3 Storage
I used the AWS Management tool to provision a storage bucket for myself. Here are a few S3 commands I learned along the way:
root@ec2-67-202-43-31:~# s3cmd ib s3://$MYBUCKET
Bucket 'sodotrendingtopics':
Location: any
root@ec2-67-202-43-31:~# s3cmd ls s3://$MYBUCKET
Bucket 'sodotrendingtopics':
2011-03-26 04:48 39158 s3://sodotrendingtopics/wikistats

I had access to the public dataset via the EBS, but the idea is that you connect to EBS and then copy off the data to an S3 storage bucket. At that point, you can then munge through the data with Hadoop, R or any other statistical tools you have. It was at the data copying stage where I hit a couple of snags:
1) I received a socket error trying to write files to my S3 storage:
root@ec2-production:/mnt/wikidata# time s3cmd put --force /mnt/wikidata/wikistats/pagecounts/pagecounts-20090401* s3://$MYBUCKET/wikistats/
Traceback (most recent call last):
File "/usr/bin/s3cmd", line 740, in
cmd_func(args)
..
File "", line 1, in sendall
socket.error: (32, 'Broken pipe')

This was because I had a capital letter in my S3 storage bucket name! ARGH! Luckily, someone had already encountered the problem. So I simply deleted my empty bucket and created a new one with an all lowercase name. Silly. By the way, another environment variable $MYBUCKET can be used if you don't feel like typing the name of your storage bucket all the time.

2) The DataWrangling command to copy over the Wikipedia log dataset did not work as expected. This command:
/mnt# time s3cmd put --force wikidata/wikistats/pagecounts/pagecounts-200904* s3://$MYBUCKET/wikistats/

Just kept on overwriting the wikistats directory and did not plop any files into the bucket. I changed the command to this:
for FILE in $(ls -1 pagecounts-200904*); do ls $FILE;time s3cmd put --force $FILE s3://sodotrendingtopics/wikistats/$FILE;echo;done

This way, the pagecounts files will get plopped into the s3 storage bucket directory properly. I decided to only copy over one month of data. This was about 40GB of data and took almost two hours to copy from EBS to my S3 bucket.

Long Day, but Success
In any case, that's been today's progress. Slow but sure. Next up:
Customizing the Cloudera Hadoop Ubuntu launch scripts

References
Amazon Web Services
Amazon S3 Beginner's Guide
DataWrangling's instructions
Amazon S3 FAQ (including charge calculator)
Understanding AWS Access Credentials
Download S3cmd

BiggData

Friday, March 25, 2011

ec2 ami, ebs and s3 storage

No comments:

Post a Comment