Processing math: 100%

mathjax

Friday, June 15, 2012

Bootstrapping an EC2 Hadoop Instance with a User-Data script

Amazon AWS is great for MapReduce jobs. You can take advantage of the AWS Elastic MapReduce service and load and transform data directly from/to S3 buckets, or you can roll your own and create a Hadoop image in EC2 and spin up instances as required.

 There are two ways to create a Hadoop image: you can customize a raw instance and once you have it working the way you would like, take a snapshot of the volume. Further instances can be created by cloning the snapshot. Obviously we need a way to differentiate instances into NameNodes, DataNodes, JobTrackers, and TaskTrackers. The easiest way to do this is probably to set up an initialization script in your image which reads instance user data/tags and customizes the configuration accordingly. User data/tags can be set when the instance is created, either through the AWS console or ec2-api command line.

Alternatively, a startup script can be supplied to a raw instance that will be executed when the instance initializes and configure the instance from scratch. User scripts can be injected via the user-data field in the AWS console or ec2-api command line. How the script is handled exactly depends on the operating system image selected. Official Ubuntu images from Canonical are configured to check the first two characters of the user-data script for #! in which case it executes the user data as a shell script. As in the case above, the initialization script can use user tags and other user data fields to customize the instance into the specific type of Hadoop node required.

In addition to the facilities described above, Canonical Ubuntu images also include a package called cloud-init. Cloud-init is the Ubuntu package that handles early initialization of a cloud instance. It is installed in the UEC Images and also in the official Ubuntu images available on EC2. Cloud-init operates on user data received during initialization and accepts this data in multiple formats: gzipped compressed content, cloud-config scripts, shell scripts, and MIME multi part archive, among others. Shell scripts function pretty much the same as described above while cloud-config scripts use a custom syntax specifically for configuring a Ubuntu instance via the built-in administrative interfaces (mostly APT package manager). Gzip content is just a way to compress the user-data payload and MIME multi part archive is a method to combine cloud-init scripts and shell scripts into a single payload.

The cloud-init script below performs an APT update and upgrade as well as installing specific packages required for the Hadoop install, including OpenJDK. The Oracle/Sun JDK is a bit harder to install since the related packages are no longer distributed by Canonical and setup requires configuring a thirdparty repository and pre-accepting the Oracle license, all tasks which can be performed using Cloud-init fucntions albeit with a bit more effort. The script also sets the server timezone and configures log sinks, very useful for debugging.

#cloud-config
# Provide debconf answers
debconf_selections: |
debconf debconf/priority select low
debconf debconf/frontend select readline
# Update apt database on first boot
apt_update: true
# Upgrade the instance on first boot
apt_upgrade: true
# Install additional packages on first boot
packages:
- lzop
- liblzo2-2
- liblzo2-dev
- git-core
- openjdk-6-jre-headless
- openjdk-6-jdk
- ant
- make
- jflex
# timezone: set the timezone for this instance
# the value of 'timezone' must exist in /usr/share/zoneinfo
timezone: America/Halifax
# manage_etc_hosts:
# default: false
# Setting this config variable to 'true' will mean that on every
# boot, /etc/hosts will be re-written from /etc/cloud/templates/hosts.tmpl
# The strings 'hostname' and 'fqdn' are replaced in the template
# with the appropriate values
manage_etc_hosts: true
# Print message at the end of cloud-init job
final_message: "The system is finally up, after $UPTIME seconds"
# configure where output will go
# 'output' entry is a dict with 'init', 'config', 'final' or 'all'
# entries. Each one defines where
# cloud-init, cloud-config, cloud-config-final or all output will go
# each entry in the dict can be a string, list or dict.
# if it is a string, it refers to stdout and stderr
# if it is a list, entry 0 is stdout, entry 1 is stderr
# if it is a dict, it is expected to have 'output' and 'error' fields
# default is to write to console only
# the special entry "&1" for an error means "same location as stdout"
# (Note, that '&1' has meaning in yaml, so it must be quoted)
output:
init: "> /tmp/cloud-init.out"
config: [ "> /tmp/cloud-config.out", "> /tmp/cloud-config.err" ]
final:
output: "> /tmp/cloud-final.out"
error: "&1"

In addition to the cloud-init script, we created a shell script to download the Hadoop source, compile, install and configure. In this case we configured a single node as a pseudo cluster and have not used user data/tags to customize settings as described above. The script also downloads and installs the LZO compression modules for HDFS, imports SSH keys to allow communication between the JobTracker and TaskManagers, although not required in this single server deployment. Finally, the script disables IPv6, formats the HDFS filesystem and starts the Hadoop daemons.

#!/bin/bash
#
# Install and configure Hadoop on this node
#
# This script will be run as root
#
CONFIGURED_USER=ubuntu
HADOOP_HOME=/usr/local/hadoop
HADOOP_VERSION=0.20.203.0
HADOOP_INSTALL={HADOOP_HOME}/hadoop-{HADOOP_VERSION}
HADOOP_TMP_DIR=${HADOOP_INSTALL}/datastore
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
export CFLAGS=-m32
export CXXFLAGS=-m32
export HADOOP_INSTALL={HADOOP_HOME}/hadoop-{HADOOP_VERSION}
export PATH={JAVA_HOME}/bin:{HADOOP_INSTALL}/bin:${PATH}
# Instance meta-data
# Add profile options
cat >>/etc/bash.bashrc <<EOF
# Added by cloud-init
export JAVA_HOME=${JAVA_HOME}
export LD_LIBRARY_PATH={JAVA_HOME}/lib:LD_LIBRARY_PATH
export HADOOP_HOME=${HADOOP_HOME}
export HADOOP_INSTALL=${HADOOP_INSTALL}
export PATH=\$JAVA_HOME/bin:\$HADOOP_INSTALL/bin:\$PATH
EOF
# Create hadoop user
useradd -s /bin/bash -c "Hadoop user" -m -d ${HADOOP_HOME} hadoop
# Download hadoop distribution
wget http://mirror.csclub.uwaterloo.ca/apache//hadoop/common/hadoop-0.20.203.0/hadoop-{HADOOP_VERSION}rc1.tar.gz -O /mnt/hadoop-{HADOOP_VERSION}rc1.tar.gz
tar zxvf /mnt/hadoop-{HADOOP_VERSION}rc1.tar.gz -C {HADOOP_HOME}
# Hadoop config: pseudo cluster
cat >>${HADOOP_INSTALL}/conf/hadoop-env.sh <<EOF
# Added by cloud-init
export JAVA_HOME=${JAVA_HOME}
EOF
cat >${HADOOP_INSTALL}/conf/core-site.xml <<\EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/datastore/hadoop-${user.name}</value>
</property>
</configuration>
EOF
cat >${HADOOP_INSTALL}/conf/mapred-site.xml <<\EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
</configuration>
EOF
cat >${HADOOP_INSTALL}/conf/hdfs-site.xml <<\EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
EOF
# Download hadoop-lzo distribution
echo "### Downloading hadoop-lzo distibution"
sleep 300
HADOOP_PLATFORM=$(hadoop org.apache.hadoop.util.PlatformName)
pushd /mnt
git clone https://github.com/kevinweil/hadoop-lzo.git
cd hadoop-lzo
ant compile-native tar
cp build/hadoop-lzo-*.jar ${HADOOP_INSTALL}/lib
cp -r build/native/{HADOOP_PLATFORM}/lib/libgplcompression.* {HADOOP_INSTALL}/lib/native/${HADOOP_PLATFORM}
popd
# Change file ownership to hadoop
chown -R hadoop:hadoop ${HADOOP_INSTALL}
# Set up Hadoop tmp dir
mkdir ${HADOOP_TMP_DIR}
chown hadoop:hadoop ${HADOOP_TMP_DIR}
# Clean up
#echo "### Cleaning up hadoop tarball"
#rm -rf /mnt/hadoop-${HADOOP_VERSION}rc1.tar.gz
#rm -rf /mnt/hadoop-lzo
# Get mapreduce code
#su -l ${CONFIGURED_USER} -c 'git clone git://github.com/justinkamerman/fiddlenet.git'
# Set up passphraseless ssh
echo "### Setting up passphraseless ssh"
mkdir ${HADOOP_HOME}/.ssh
cat >${HADOOP_HOME}/.ssh/config <<EOF
Host *
StrictHostKeyChecking no
EOF
cat >${HADOOP_HOME}/.ssh/id_dsa <<EOF
-----BEGIN RSA PRIVATE KEY-----
MIIEpAIBAAKCAQEAqq3q4eQi9xqPRoTa+gsNIfnB1tIvnNIu6P4B/rejcAgmfxNa
TLILqSDBWYwI6xsL3MaH0hBY+c/SghcxrT3JL2z0TKfmexaUtC3+js8h++F+vPv9
fZaXojelNyXsOlgTBa+e4thIn7j/44ozMs/Uvy6PkGVukrPXRV57b9KEriAtEYJF
3DdW/Y6S89+gsV7Scwty3pOJ7eeNJEirTCdl0ox+/aNFNxlM29sOqqB43q1fKLZx
223CginVtd+fO8VLzU0+PdyxFZ9qCMVPuEP+bl3uSWCdlU9cfj+LY5cXch+kFdoN
Key7VqH2aTxXBDL8t1dwssUnlSkwtOt4Wg1qzQIDAQABAoIBAFpFXeNXa/7Rd1HO
1ppE2g9ML29U/4WrzM/B+IAl1DVeui2fqLTDvlMXVevsmpLuXRnJjvBVYRnPBwFz
Dv0Xnp6Mu7EHZGlZihC5+tbBSrITk5qUlH+l9FEBqUo/rm81QepR9nD3/4EqsXxB
Dc8kCNuM3rV6UD8bCxJPZG3CJBaLZId2xueaejWgAnx9L6bx4G8hTfBGgGpPeZ6t
sfCtukMZnAAADZRImjxfIUjDTNbdGVTCmYgWL9cg7j8VsFa7nISbPCmmyyTU6wQa
6nXEdeazejYmGcjZopPlojcQJ6V6uQMljIrI+3mx7UzQuEz/Fjqa4vvTV7H+t6fJ
qSSErgECgYEA3f092vTWF3yRRVV6MSyrvvw4DfPgFZs1IXIP72WxmK4SfWY1vwEe
/pQprSvctd6jW71G9K3SZsFoDQmuDrYmda/66YC61jDgeInbTxb9Cc9Wiy11D4v1
ddeqkKX2yVBUAhd80bhAXFJNDWhesd6nXzgB9QuPHKnmLYSr1WMwO8ECgYEAxNQ5
cJpLH0W2AG94cfRjYAfdxT421tsYX0W/MRtq3Rqj0L74UVeYgG+YOl30GJkg5H1i
D04EWk94hfNviLg6HfIDo5iFqIbFWuzXjI194JzGAMPaEPJi3w/PgJAVc1QiDjkq
DV7BQXgvz/InTQnWznOVIp8sU6SBLB2kUzpa4g0CgYEA3d7kWdlnuaW5FFEwhcGe
Do7L/7YF+9JasgjswFslu/IPbOIhSbx3G/8+AGTcfbH+GAz/xEGPD0CzHITWQMHx
gqLW51bQZpAHarJuTYgudAWU/Bn87AL43EUnptcZ52+v5z9Oc9XyDdP8SzBLpP9i
zZqO6joZWY6+DjSSAf7XEIECgYEAifFaGCpqP45xkTiOJv7prmGU8Sk68bU3DX4q
ElZuvGpxKFjOWuOTA2AyRaWW7q5SuQ+Oa793mXtcsjP7lMvYHyh/mGXKNmPNaH3Y
Sq7W61W0BtE7wOi+linUePuBrQPnoiQ57ojb0/BRQeEp3fnpS2MBv/Ph8vS1ep+D
jLi2/PkCgYAzqknNfAkItNFRljQ1VDRpq8vrCANvQ1OaRa22zuUe7g8Vi6qr8uiB
eAWfWirpwIB2YnvKcpKlgM6DASO5a8t6tyCC6l1iV3BMgXLcUeMwmbi5ocwpCVoy
8mlCSVDCFzkwPSPr7h2OFI+fYPery0y9L3Vs19m+pbcN==
-----END RSA PRIVATE KEY-----
EOF
cat >${HADOOP_HOME}/.ssh/authorized_keys <<EOF
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCqrerh5CL3Go9GhNr6Cw0h+cHW0i+c0i7o/gH+t6NwCCZ/E1pMsgupIMFZjAjrGwvcxofSEFj5z9KCFzGtPckvbPRMp+Z7FpS0Lf6OzyH74X68+/19lpeiN6U3Jew6WBMFr57i2EifuP/jijMyz9S/Lo+QZW6Ss9dFXntv0oSuIC0RgkXcN1b9jpLz36CxXtJzC3Lek4nt540kSKtMJ2XSjH79o0U3GUzb2w6qoHjerV8otnHbbcKCKdW13587xUvNTT493LEVn2oIxU+4Q/5uXe5JYJ2VT1x+P4tjlxdyH6QV2g0p7LtWofZpPFcEMvy3V3CyxSeVKT
EOF
chown -R hadoop:hadoop ${HADOOP_HOME}/.ssh
chmod 700 ${HADOOP_HOME}/.ssh
chmod 600 ${HADOOP_HOME}/.ssh/id_dsa
# Disable ipv6
echo "### Disabling ipv6 stack"
cat >>/etc/sysctl.conf <<EOF
# IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
EOF
sysctl -p
# Namenode
# Format hdfs filesystem
echo "### Formatting hdfs filesystem"
su hadoop -c "${HADOOP_INSTALL}/bin/hadoop namenode -format"
# Start hadoop daemons
echo "### Starting Hadoop daemons"
su hadoop -c "${HADOOP_INSTALL}/bin/start-all.sh"
# Create user directory
echo "### Creating hdfs directory for user ${CONFIGURED_USER}"
su hadoop -c "{HADOOP_INSTALL}/bin/hadoop fs -mkdir /user/{CONFIGURED_USER}"
# Change ownership of user directory
echo "### Changing ownership of hdfs directory for user ${CONFIGURED_USER}"
su hadoop -c "{HADOOP_INSTALL}/bin/hadoop fs -chown {CONFIGURED_USER} /user/${CONFIGURED_USER}"
1
view raw hadoop-setup.sh hosted with ❤ by GitHub

The two scripts above are packaged as a MIME multi part message to be delivered to cloud-init via the EC2 user data payload. The cloud-init script write-mime-multipart is used to perform this function as shown in the simple Makefile below.

WRITE-MIME_MULTIPART=./bin/write-mime-multipart
.PHONY: clean
cloud-config.txt: ubuntu-config.txt hadoop-setup.sh
(WRITE-MIME_MULTIPART) --output=@ $^
clean:
$(RM) cloud-config.txt
view raw gistfile1.mak hosted with ❤ by GitHub