Amazon AWS is great for MapReduce jobs. You can take advantage of the AWS Elastic MapReduce service and load and transform data directly from/to S3 buckets, or you can roll your own and create a Hadoop image in EC2 and spin up instances as required.
There are two ways to create a Hadoop image: you can customize a raw instance and once you have it working the way you would like, take a snapshot of the volume. Further instances can be created by cloning the snapshot. Obviously we need a way to differentiate instances into NameNodes, DataNodes, JobTrackers, and TaskTrackers. The easiest way to do this is probably to set up an initialization script in your image which reads instance user data/tags and customizes the configuration accordingly. User data/tags can be set when the instance is created, either through the AWS console or ec2-api command line.
Alternatively, a startup script can be supplied to a raw instance that will be executed when the instance initializes and configure the instance from scratch. User scripts can be injected via the user-data field in the AWS console or ec2-api command line. How the script is handled exactly depends on the operating system image selected. Official Ubuntu images from Canonical are configured to check the first two characters of the user-data script for #! in which case it executes the user data as a shell script. As in the case above, the initialization script can use user tags and other user data fields to customize the instance into the specific type of Hadoop node required.
In addition to the facilities described above, Canonical Ubuntu images also include a package called cloud-init. Cloud-init is the Ubuntu package that handles early initialization of a cloud instance. It is installed in the UEC Images and also in the official Ubuntu images available on EC2. Cloud-init operates on user data received during initialization and accepts this data in multiple formats: gzipped compressed content, cloud-config scripts, shell scripts, and MIME multi part archive, among others. Shell scripts function pretty much the same as described above while cloud-config scripts use a custom syntax specifically for configuring a Ubuntu instance via the built-in administrative interfaces (mostly APT package manager). Gzip content is just a way to compress the user-data payload and MIME multi part archive is a method to combine cloud-init scripts and shell scripts into a single payload.
The cloud-init script below performs an APT update and upgrade as well as installing specific packages required for the Hadoop install, including OpenJDK. The Oracle/Sun JDK is a bit harder to install since the related packages are no longer distributed by Canonical and setup requires configuring a thirdparty repository and pre-accepting the Oracle license, all tasks which can be performed using Cloud-init fucntions albeit with a bit more effort. The script also sets the server timezone and configures log sinks, very useful for debugging.
In addition to the cloud-init script, we created a shell script to download the Hadoop source, compile, install and configure. In this case we configured a single node as a pseudo cluster and have not used user data/tags to customize settings as described above. The script also downloads and installs the LZO compression modules for HDFS, imports SSH keys to allow communication between the JobTracker and TaskManagers, although not required in this single server deployment. Finally, the script disables IPv6, formats the HDFS filesystem and starts the Hadoop daemons.
The two scripts above are packaged as a MIME multi part message to be delivered to cloud-init via the EC2 user data payload. The cloud-init script write-mime-multipart is used to perform this function as shown in the simple Makefile below.