Cloudera(tm) Hadoop Integration Guide

From OSNEXUS Online Documentation Site
Revision as of 22:21, 28 January 2013 by Qadmin (Talk | contribs)

Jump to: navigation, search

To get Cloudera Hadoop running within your QuantaStor system the steps are basically the same as they would be with a standard Ubuntu Precise server as this is the distro which QuantaStor v3 builds upon. As such, if you're following instructions from the Cloudera web site be sure refer to the sections regaring Ubuntu Precise.

Also keep in mind that you'll still need to setup the other Hadoop services and components like name-node and job tracker and so on. The intent with this how-to is just to enable you to use your QuantaStor storage appliance nodes as data-nodes for better Hadoop performance with less hardware.

To get started you'll need to login to your system using SSH or use the console then run the commands shown below in this HOW-TO guide. Note that all of these commands should be run as root so be sure to run 'sudo -i' to get super-user privileges before you start.

Step 1 - Add the Cloudera package server

Run this command to add the GPG key for Cloudera's packages.

sudo -i 
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -


Next you'll need to tell QuantaStor where the Cloudera package servers are located by creating a sources file called /etc/apt/sources.list.d/cloudera.list with the following contents:

deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib

You can create the above file using nano or vi or you can create this file with the necessary content using a couple of echo commands:

echo "deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" > /etc/apt/sources.list.d
echo "deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" >> /etc/apt/sources.list.d

Step 2 - Update the package repository

Now we can update the local system's package repository database like so:

apt-get update

Step 3 - Configure Java Path (JAVA_HOME)

Hadoop includes a nice script that tries to detect where the java runtime is installed but unfortunately it has a little trouble finding the amd64 JDK so you'll need to make this symbolic link to help it out:

ln -s /usr/lib/jvm/java-6-openjdk-amd64 /usr/lib/jvm/java-openjdk

If you forget this step you'll see an error like this when hadoop tries to startup:

root@QSDGRID-node001:/etc/apt/sources.list.d# service hadoop-hdfs-datanode start
 * Starting Hadoop datanode:
Error: JAVA_HOME is not set and could not be found.

Step 4 - Install Hadoop

Finally, we're ready to install Hadoop and you can find those instructions here:

https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation#CDH4Installation-Step2%3AInstallCDH4withMRv1

Generally speaking you'll run your name nodes and job trackers on separate servers from QuantaStor where QuantaStor nodes are running as just data node services like so:

sudo apt-get install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

Step 5 - Fixup JAVA_HOME for MapReduce Task Tracker service

The startup script for the mapreduce-tasktracker service doesn't have it's JAVA_HOME setup properly so you can fix this by just adding this line to the top of the file '/etc/init.d/hadoop-0.20-mapreduce-tasktracker' right after the line containing '### END INIT INFO'.

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

Alternatively you can add this script logic to the top of '/etc/init.d/hadoop-0.20-mapreduce-tasktracker' (which is included in '/etc/init.d/hadoop-hdfs-datanode') that does JAVA_HOME auto-detection:

# Autodetect JAVA_HOME if not defined
if [ -e /usr/libexec/bigtop-detect-javahome ]; then
  . /usr/libexec/bigtop-detect-javahome
elif [ -e /usr/lib/bigtop-utils/bigtop-detect-javahome ]; then
  . /usr/lib/bigtop-utils/bigtop-detect-javahome
fi

Step 6 - Create a Network Share via QuantaStor web management interface

When you create a network share in QuantaStor it will create a folder within the storage pool and will mount it to /export/<SHARENAME> where SHARENAME is the name you gave to your network share. This folder is where you should store the Hadoop data rather than in a folder on the QuantaStor system/boot drive. Besides using the /export/sharename path you can use the full absolute path to the folder which would be '/mnt/storage-pools/qs-<POOL-GUID>/<SHARENAME>'.

Step 7 - Setting the Hadoop DFS data directory

By default the hdfs-site.xml file is not configured to use the network share that you created to house your Hadoop data. To set this you'll need to edit the /etc/hadoop/conf/hdfs-site.xml file and add a property to it. If you named your new share hadoopdata then you'd add this property to the config file like so:

  <property>
     <name>dfs.data.dir</name>
     <value>/export/hadoopdata</value>
  </property>

After you've added the entry the default config file will look something like this

<configuration>
  <property>
     <name>dfs.name.dir</name>
     <value>/var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value>
  </property>
  <property>
     <name>dfs.data.dir</name>
     <value>/export/hadoopdata</value>
  </property>
</configuration>

Summary

That's the basics of getting Hadoop running with QuantaStor. We are looking at adding deeper integration features to monitor Hadoop this year and would appreciate any feedback on what would help you most. So if you have some ideas you'd like to share or would like to be a beta customer for new Hadoop integration features please write us at support@osnexus.com or write Steve directly at steve (at) osnexus.com.

Thanks and happy hadooping!