Difference between revisions of "Cloudera(tm) Hadoop Integration Guide"

From OSNEXUS Online Documentation Site
Jump to: navigation, search
m
m
(39 intermediate revisions by the same user not shown)
Line 1: Line 1:
This integration guide is focused on configuring your QuantaStor storage grid as [http://www.cloudera.com/content/cloudera/en/home.html Cloudera® Hadoop™] data cluster for better Hadoop performance with less hardware. Note that you can still use your QuantaStor system normally as SAN/NAS appliance with the Hadoop services installed.
+
[[Category:integration_guide]]
 +
This integration guide is focused on configuring your QuantaStor storage grid as [http://www.cloudera.com/content/cloudera/en/home.html Cloudera® Hadoop™] data cluster for better Hadoop performance with less hardware. Note that you can still use your QuantaStor system normally as SAN/NAS system with the Hadoop services installed.
  
Setting up [http://www.cloudera.com/content/cloudera/en/home.html Cloudera® Hadoop™] within your [http://www.osnexus.com QuantaStor®] storage appliance is very similar to the steps you would use with a standard Ubuntu™ Precise server as this is the Linux distribution upon which [http://www.osnexus.com/trynow-iso QuantaStor v3] builds.  However there are some important differences and this guide covers all that. Also note that if you're following instructions from the Cloudera web site for steps outside of this how-to be sure refer to the sections regarding Ubuntu Precise (v12.04).
+
Setting up [http://www.cloudera.com/content/cloudera/en/home.html Cloudera® Hadoop™] within your [http://www.osnexus.com QuantaStor®] storage system is very similar to the steps you would take to setup CDH4 with a standard Ubuntu™ 12.04 / Precise.  This is because [http://www.osnexus.com/trynow-iso QuantaStor v3] is built on top of Ubuntu Server.
  
The Hadoop installation procedure begins by invoking an installation script, which is included with the current version of [http://www.osnexus.com QuantaStor®]. The installation proceeds in stages, as illustrated by the following screen shots. Note that the illustrated example consists of a cluster of four QuantaStor® nodes, one of which is designated the "Hadoop Manager" node.
+
However, there are some important differences and this guide covers all that. Also note that if you're following instructions from the Cloudera web site for steps outside of this How-To be sure refer to the sections regarding Ubuntu Server v12.04/Precise.
 +
 
 +
The Hadoop installation procedure begins by invoking an installation script, which is included with the current version of [http://www.osnexus.com QuantaStor®] called ''hadoop-install''. From there the CDH installation proceeds in stages and is illustrated below in a series of screen shots. Note that the illustrated example consists of a cluster of four QuantaStor® nodes, one of which is designated the "Hadoop Manager" node.
  
 
Note also that on the Hadoop Manager node, the Hadoop management functionality will impose additional resource costs on the system. For example, when testing the installation using [http://www.osnexus.com QuantaStor®] nodes based on Virtual Machines, the Manager node required at least 10 GB of memory.
 
Note also that on the Hadoop Manager node, the Hadoop management functionality will impose additional resource costs on the system. For example, when testing the installation using [http://www.osnexus.com QuantaStor®] nodes based on Virtual Machines, the Manager node required at least 10 GB of memory.
  
To get started you'll need to login to your QuantaStor v3 storage appliance(s) using SSH or via the console.  Note that all of the commands shown in the how-to guide steps below should be run as ''root'' so be sure to run 'sudo -i' to get super-user privileges before you begin.   
+
To get started you'll need to login to your QuantaStor v3 storage system(s) using SSH or via the console.  Note that all of the commands shown in this How-To guide should be run as ''root'' so be sure to run 'sudo -i' to get super-user privileges before you begin.   
  
Last, if you don't yet have a QuantaStor v3 storage appliance you can get the [http://www.osnexus.com/trynow-iso CD-ROM ISO and a license key here].
+
Last, if you don't yet have a QuantaStor v3 storage system setup, you can get the [http://www.osnexus.com/trynow-iso CD-ROM ISO and a license key here].
<b>
+
<br>
 +
<br>
  
=== Step 1. Manager Installation Script ===
+
=== Phase 1. Manager Installation Script / Text Mode Configuration Steps ===
  
 
From the superuser command-line on the manager node (hostname "osn-grid2-mgr1" in this example), simply invoke the "hadoop-install" script. This executable is located in the /bin directory, but it is in the path so it can be run from anywhere.
 
From the superuser command-line on the manager node (hostname "osn-grid2-mgr1" in this example), simply invoke the "hadoop-install" script. This executable is located in the /bin directory, but it is in the path so it can be run from anywhere.
  
 
<pre>
 
<pre>
hadoop-install  
+
$ hadoop-install  
 
</pre>
 
</pre>
  
Line 23: Line 27:
  
 
See the following screen shot examples.
 
See the following screen shot examples.
 +
<br>
  
 +
<br>
 +
[[File:Hadoop_install_textmode_053.jpg | Figure 1-01]]
 +
<br>
 +
Figure 1-01
  
 +
<br>
 +
[[File:Hadoop_install_textmode_054.jpg | Figure 1-02]]
 +
<br>
 +
Figure 1-02
  
=== Step 2. Update the package repository ===
+
<br>
 +
[[File:Hadoop_install_textmode_055.jpg | Figure 1-03]]
 +
<br>
 +
Figure 1-03
  
Now we can update the local system's package repository database like so:
+
<br>
 +
[[File:Hadoop_install_textmode_056.jpg | Figure 1-04]]
 +
<br>
 +
Figure 1-04
  
<pre>
+
<br>
apt-get update
+
[[File:Hadoop_install_textmode_057.jpg | Figure 1-05]]
</pre>
+
<br>
 +
Figure 1-05
  
=== Step 3. Configure Java Path (JAVA_HOME) ===
+
<br>
 +
[[File:Hadoop_install_textmode_058.jpg | Figure 1-06]]
 +
<br>
 +
Figure 1-06
  
Hadoop includes a nice script that tries to detect where the java runtime is installed but unfortunately it has a little trouble finding the amd64 Open JDK we include with QuantaStor so you'll need to make this symbolic link to help it out:
+
<br>
 +
[[File:Hadoop_install_textmode_060.jpg | Figure 1-07]]
 +
<br>
 +
Figure 1-07
  
<pre>
+
<br>
ln -s /usr/lib/jvm/java-6-openjdk-amd64 /usr/lib/jvm/java-openjdk
+
==== Step 1.1 Package Dependencies / Troubleshooting ====
</pre>
+
  
If you forget this step you'll see an error like this when the Hadoop DataNode service tries to start:
+
Note that this step includes the installation of a Java package, using the apt-get utility. Successful installation includes the target node meeting the package dependency requirements. If this stage fails, and the log shows package dependency errors:
 +
- Return to the command line
 +
- Run 'apt-get -f install'
 +
- Respond with 'Y' (must be uppercase) and allow this to complete
 +
- Accept the PAM configuration modification screen, if offered
 +
- Retry the hadoop-install script
  
<pre>
+
See the following two screens as examples:
root@QSDGRID-node001:/etc/apt/sources.list.d# service hadoop-hdfs-datanode start
+
* Starting Hadoop datanode:
+
Error: JAVA_HOME is not set and could not be found.
+
</pre>
+
  
=== Step 4. Install Hadoop DataNode package ===
+
<br>
 +
[[File:Hadoop_install_textmode_059.jpg | Figure 1-08]]
 +
<br>
 +
Figure 1-08
  
Finally, we're ready to install Hadoop's DataNode service. Generally speaking you'll run your NameNodes on separate servers from QuantaStor where QuantaStor nodes are running as just DataNode with NodeManager or Task Tracker services like so:
+
<br>
<pre>
+
[[File:Hadoop_install_textmode_070.jpg| Figure 1-09]]
sudo apt-get install hadoop-hdfs-datanode
+
<br>
</pre>
+
Figure 1-09
  
More detailed instructions can be found here on the Cloudera web site:
+
<br>
[https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation#CDH4Installation-Step2%3AInstallCDH4withMRv1 Install CDH4 with MRv1]
+
  
=== Step 5. Install Hadoop MapReduce v1 (optional) ===
+
=== Phase 2. Initial Web-Based Installation ===
 +
 
 +
Follow the instructions seen on the last screen of the previous stage, for example:
 +
<br>
  
If you're going to use MapReduce v1 then run this step, otherwise skip it and go to Step 6.  You'll install MapReduce v2/YARN when you get to Step 8.
 
 
<pre>
 
<pre>
sudo apt-get install hadoop-0.20-mapreduce-tasktracker
+
Point your web browser to http://<hostname or IP>:7180/. Log in to the Cloudera Manager with
 +
the username and password set to 'admin' to continue installation
 
</pre>
 
</pre>
  
==== Fixup JAVA_HOME for MapReduce Task Tracker service ====
+
Log into the interface using the 'admin/admin' user/password. You will see the initial screen asking you which edition you wish to install.
If you're using MapReduce v1 you'll find that the startup script for the mapreduce-tasktracker service doesn't have it's JAVA_HOME setup properly so you can fix this by just adding this line to the top of the file '/etc/init.d/hadoop-0.20-mapreduce-tasktracker' right after the line containing '### END INIT INFO'.
+
  
<pre>
+
<br>
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
+
[[File:Hadoop_install_webintf_061.jpg| Figure 2-01]]
</pre>
+
<br>
 +
Figure 2-01
  
Alternatively you can add this script logic to the top of '/etc/init.d/hadoop-0.20-mapreduce-tasktracker' (which is included in '/etc/init.d/hadoop-hdfs-datanode') that does JAVA_HOME auto-detection which we fixed in Step 3:
+
<br>
 +
[[File:Hadoop_install_webintf_062.jpg| Figure 2-02]]
 +
<br>
 +
Figure 2-02
 +
<br>
  
<pre>
+
For the purposes of this example, the option for a minimal "Standard" edition is shown throughout. On this screen, as on all subsequent screens, hit 'Continue' to move on to the next step.
# Autodetect JAVA_HOME if not defined
+
if [ -e /usr/libexec/bigtop-detect-javahome ]; then
+
  . /usr/libexec/bigtop-detect-javahome
+
elif [ -e /usr/lib/bigtop-utils/bigtop-detect-javahome ]; then
+
  . /usr/lib/bigtop-utils/bigtop-detect-javahome
+
fi
+
</pre>
+
  
=== Step 6. Create a Network Share via QuantaStor web management interface ===
+
The next screen allows you to enter the host addresses of the nodes onto which Hadoop is to be installed. In this example, four IP addresses are shown, corresponding to the nodes 'osn-grid2-mgr1', 'osn-grid2a', 'osn-grid2b', and 'osn-grid2c'.
  
When you create a ''network share'' in QuantaStor it will create a folder within your storage pool and will mount it to /export/<SHARENAME> where SHARENAME is the name you gave to your network share. For example, if you name your share 'hadoopdata' there will be a folder created called /export/hadoopdata which is bound to your storage pool.  This folder is where you will store the Hadoop DFS data rather than in a folder on the QuantaStor system/boot drive.
+
By invoking the 'Search' button, the installation will test those node/addresses, and return status which should indicate that the nodes are ready and available for installation. When complete, hit 'Continue'.
  
Besides using the /export/hadoopdata path you can optionally use the full absolute path to the network share folder which would be something like '/mnt/storage-pools/qs-<POOL-GUID>/hadoopdata'.
+
<br>
 +
[[File:Hadoop_install_webintf_063.jpg| Figure 2-03]]
 +
<br>
 +
Figure 2-03
  
NOTE: Although you could access this share over NFS/CIFS to do backups you'll probably want to disable all other access for security reasons since this folder will be the repository for internal Hadoop DFS data and shouldn't be directly accessed by users via NFS.
+
<br>
 +
[[File:Hadoop_install_webintf_064.jpg| Figure 2-04]]
 +
<br>
 +
Figure 2-04
  
NOTE: You'll need to create a Storage Pool within your QuantaStor appliance before you can create a Network Share. Click on the 'System Checklist' in button QuantaStor manager ribbon-bar for a quick run-down of the initial configuration steps you'll need to do to get your appliance initially setup.
+
<br>
 +
[[File:Hadoop_install_webintf_065.jpg| Figure 2-05]]
 +
<br>
 +
Figure 2-05
  
=== Step 7. Setting the Hadoop DFS data directory ===
+
<br>
 +
The next screen, "Cluster Installation Screen 1", offers some installation options, again in this example we are installing the bare minimum, where we select the base CDH package only. The screen after that, "Cluster Installation Screen 2", requests login options. In this case, we are using the 'root' user where all nodes have the same root password.
  
By default the hdfs-site.xml file is not configured to use the network share that you created to house your Hadoop data.  To set this you'll need to edit the /etc/hadoop/conf/hdfs-site.xml file and add a property to it.  If you named your new share ''hadoopdata'' then you'd add the ''dfs.data.dir'' property to the config file like so:
+
<br>
<pre>
+
[[File:Hadoop_install_webintf_066.jpg| Figure 2-06]]
  <property>
+
<br>
    <name>dfs.data.dir</name>
+
Figure 2-06
    <value>/export/hadoopdata</value>
+
  </property>
+
</pre>
+
  
After you've added the entry the default config file will look something like this
+
<br>
 +
[[File:Hadoop_install_webintf_067.jpg| Figure 2-07]]
 +
<br>
 +
Figure 2-07
  
<pre>
+
<br>
<configuration>
+
The next screen, "Cluster Installation Screen 3", shows the cluster installation progress.  
  <property>
+
<br>
    <name>dfs.name.dir</name>
+
NOTE: the "Abort" popup dialog, do NOT hit "OK" here, this will abort the installation. Simply get rid of the popup.
    <value>/var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value>
+
  </property>
+
  <property>
+
    <name>dfs.data.dir</name>
+
    <value>/export/hadoopdata</value>
+
  </property>
+
</configuration>
+
</pre>
+
  
=== Step 8. Installing Hadoop YARN / MapReduce v2 / NodeManager (optional) ===
+
<br>
 +
[[File:Hadoop_install_webintf_069.jpg| Figure 2-08]]
 +
<br>
 +
Figure 2-08
  
If you're using MapReduce v1 you can skip this step. Otherwise to install MRv2/YARN run this apt-get command:
+
<br>
 +
If all goes well, this should progress as shown in the following example figures, resulting in "Installation completed successfully".
  
<pre>
+
<br>
sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce
+
[[File:Hadoop_install_webintf_071.jpg| Figure 2-09]]
</pre>
+
<br>
 +
Figure 2-09
  
More detailed instructions can be found here: [https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation#CDH4Installation-Step3%3AInstallCDH4withYARN Install CDH4 with YARN]
+
<br>
 +
[[File:Hadoop_install_webintf_074.jpg| Figure 2-10]]
 +
<br>
 +
Figure 2-10
  
==== Fix QuantaStor's embedded Tomcat service to use a different port number ====
+
<br>
As it happens the NodeManager component seems to want to use port 8080 which QuantaStor's tomcat service is also using.  The easiest way to fix this is to kick QuantaStor off that port number by editing two files.  Change instances of port '8080' to '9080' in the /etc/init.d/iptables file and in the /opt/osnexus/quantastor/tomcat/conf/server.xml file.  After you edit the server.xml file you can grep for 9080 and the output should look like this:
+
[[File:Hadoop_install_webintf_076.jpg| Figure 2-11]]
<pre>
+
<br>
root@QSDGRID-node001:/opt/osnexus/quantastor/tomcat/conf# grep 9080 *
+
Figure 2-11
server.xml:        Define a non-SSL HTTP/1.1 Connector on port 9080
+
server.xml:    <Connector port="9080" protocol="HTTP/1.1"
+
server.xml:              port="9080" protocol="HTTP/1.1"
+
</pre>
+
After you edit the /etc/init.d/iptables file the output should look like this:
+
<pre>
+
root@QSDGRID-node001:/etc/init.d# grep 9080 iptables
+
/sbin/iptables -t nat -I PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 9080
+
</pre>
+
  
Now restart the QuantaStor's iptables configuration and embedded tomcat service:
+
<br>
<pre>
+
Hit 'Continue' to move to the next phase.
service iptables restart
+
service tomcat restart
+
</pre>
+
  
=== Step 9. Opening up access in the firewall to Hadoop service port numbers ===
+
==== Step 2.1. Potential error, package dependency ====
As a last step you'll need to add these entries to the QuantaStor iptables configuration file located at '/etc/init.d/iptables'. Look in the iptables_start() section and then paste these lines right after the port range for Samba Support.
+
<br>
 +
As in Step 1.1 above, this step includes using the apt-get utility to install a Java package on all the other (besides the manager node) nodes in the cluster, and the same potential for package dependency errors can cause the cluster installation to fail on the non-manager nodes, as shown in the example figure below.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_072.jpg| Figure 2-12]]
 +
<br>
 +
Figure 2-12
 +
 
 +
<br>
 +
To correct, log into each of the nodes that failed and perform the same steps as shown in Step 1.1. When this is complete, hit the 'Retry Failed Nodes' button on the screen.
 +
 
 +
==== Step 2.2 Potential error, heartbeat detection failure ====
 +
<br>
 +
If installation succeeds past the package installation, the last phase of this step consists of heartbeat tests, as shown in the example figure below.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_074.jpg| Figure 2-13]]
 +
<br>
 +
Figure 2-13
 +
 
 +
<br>
 +
Sometimes these all succeed on the first try, other times some or all of the nodes fail.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_075.jpg| Figure 2-14]]
 +
<br>
 +
Figure 2-14
 +
 
 +
<br>
 +
Our testing showed that hitting "Retry Failed Nodes" (sometimes 2-3 times) resulted in heartbeat detection success.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_076.jpg| Figure 2-15]]
 +
<br>
 +
Figure 2-15
 +
 
 +
<br>
 +
The next screen, "Cluster Installation Screen 4", shows the progress of the "Installing Selected Parcels" stage. This is quite a lengthy phase, taking 1/2 hour to an hour typically.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_078.jpg| Figure 2-16]]
 +
<br>
 +
Figure 2-16
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_079.jpg| Figure 2-17]]
 +
<br>
 +
Figure 2-17
 +
 
 +
<br>
 +
Hitting 'Continue' at the completion of that phase takes you to "Cluster Installation Screen 5", the "Inspect hosts for correctness" phase. This should result in a comprehensive report, as shown in the following example screens.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_080.jpg| Figure 2-18]]
 +
<br>
 +
Figure 2-18
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_081.jpg| Figure 2-19]]
 +
<br>
 +
Figure 2-19
 +
 
 +
=== Phase 3. Web-Based Installation, CDH4 Services ===
 +
 
 +
<br>
 +
The screen allows you to select the services you wish installed. As before, in this example, we are showing the installation of only the basic core Hadoop service. Immediately following that screen is the screen for Database Setup, here we simple accept the default, hit 'Test Connection", and hit 'Continue' after this returns 'Success' as shown.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_083.jpg| Figure 3-01]]
 +
<br>
 +
Figure 3-01
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_084.jpg| Figure 3-02]]
 +
<br>
 +
Figure 3-02
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_085.jpg| Figure 3-03]]
 +
<br>
 +
Figure 3-03
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_086.jpg| Figure 3-04]]
 +
<br>
 +
Figure 3-04
 +
 
 +
=== Phase 4. Setting Configuration Parameters ===
 +
<br>
 +
 
 +
The following screen allows you to change certain configuration parameters. This is a large scrollable screen, shown below in two parts.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_088.jpg| Figure 4-01a]]
 +
<br>
 +
Figure 4-01a
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_089.jpg| Figure 4-01b]]
 +
<br>
 +
Figure 4-01b
 +
 
 +
<br>
 +
These configuration options bear special attention on QuantaStor installations - to intent here is to specifically setup Hadoop to use the storage on the QuantaStor Storage Pools. By default, the Hadoop installations create top-level directories on the nodes' root filesystem, such as '/dfs' and '/mapred'. So part of the pre-installation setup of the QuantaStor systems includes, on each node in the cluster, the creation of a Storage Pool and a Network Share on that Pool to function as the filesystem root for the Hadoop data store within the Pool.
 +
 
 +
<br>
 +
The following screens show examples of Storage Pool creation and Network Share creation on a QuantaStor node:
 +
 
 +
<br>
 +
[[File:CDH-Install-Fig-3-05c.jpg| Figure 4-01c]]
 +
<br>
 +
Figure 4-01b
 +
 
 +
<br>
 +
[[File:CDH-Install-Fig-3-05d.jpg| Figure 4-01d]]
 +
<br>
 +
Figure 4-01d
 +
 
 +
<br>
 +
Having created Storage Pools, and Network Shares on those pools, the directories where these data destinations should reside would be according to the following examples:
 +
<br>
 +
<br>
 +
/export/osn-grid2-mgr1-pool1-share1
 +
<br>
 +
/export/osn-grid2a-pool1-share1
 +
<br>
 +
/export/osn-grid2b-pool1-share1
 +
<br>
 +
/export/osn-grid2c-pool1-share1
 +
<br>
 +
 
 +
The Hadoop installation on the QuantaStor nodes must be altered to reflect these changes.
 +
 
 +
There are two ways this could be accomplished:
 +
 
 +
1. Change the Hadoop configuration directly. For example, in the "Review configuration changes" screen shown above, the "DataNode (Default)" parameter could be changed from its default value of '/dfs/dn' to '/export/osn-grid2a-pool1-share1/dfs/dn'. And similarly for the other paths related to 'mapred'.
 +
 
 +
2. By contrast, the approach shown in this example is to use symbolic links into the QuantaStor Pools, thereby allowing the Hadoop configuration default values to remain.
 +
The steps to accomplish option #2 above include, on each of the cluster nodes:
 +
<br>
 +
• $ mkdir /export/<sharename>/dfs
 +
<br>
 +
• $ mkdir /export/<sharename>/mapred
 +
<br>
 +
• $ ln -s /export/<sharename>/dfs /dfs
 +
<br>
 +
• $ ln -s /export/<sharename>/mapred /mapred
 +
<br>
 +
 
 +
For example:
 
<pre>
 
<pre>
        #port range for Hadoop support
+
$ mkdir /export/osn-grid2a-pool1-share1/dfs
        /sbin/iptables -A INPUT -p tcp --dport 50010 -j ACCEPT
+
$ ln -s /export/osn-grid2a-pool1-share1/dfs /dfs
        /sbin/iptables -A INPUT -p tcp --dport 50020 -j ACCEPT
+
$ mkdir /export/osn-grid2a-pool1-share1/mapred
        /sbin/iptables -A INPUT -p tcp --dport 50030 -j ACCEPT
+
$ ln -s /export/osn-grid2a-pool1-share1/mapred /mapred
        /sbin/iptables -A INPUT -p tcp --dport 50060 -j ACCEPT
+
        /sbin/iptables -A INPUT -p tcp --dport 50075 -j ACCEPT
+
</pre>
+
After you've updated the file just update your iptables configuration like so:
+
<pre>
+
service iptables restart
+
service iptables status
+
 
</pre>
 
</pre>
 +
<br>
 +
After accepting the "Review configuration changes" and hitting 'Continue', the installation proceeds to the "Starting your cluster services" screen (you may be prompted to re-enter your 'admin/admin' login credentials again), where it progresses through numerous steps, as shown in the example figures below. After all services are started, hitting 'Continue' should result in the "Congratulations Success" screen.
  
The status will show your new 'ACCEPT' entries for Hadoop.
+
=== Phase 5. Starting Cluster Services ===
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_093.jpg| Figure 4-02]]
 +
<br>
 +
Figure 4-02
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_094.jpg| Figure 4-03]]
 +
<br>
 +
Figure 4-03
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_095.jpg| Figure 4-04]]
 +
<br>
 +
Figure 4-04
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_096.jpg| Figure 4-05]]
 +
<br>
 +
Figure 4-05
 +
 
 +
<br>
 +
Hitting 'Continue' from here should then take you to the Manager Home screen, as shown in the example below.
 +
 
 +
<br>
 +
[[File:Hadoop_install_webintf_097.jpg| Figure 4-06]]
 +
<br>
 +
Figure 4-06
  
 
=== Summary ===  
 
=== Summary ===  
 
That's the basics of getting Hadoop running with QuantaStor. Now you're ready to deploy [https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation#CDH4Installation-Step4%3ADeployCDHandInstallComponents CDH and install components] if you haven't done so already.
 
That's the basics of getting Hadoop running with QuantaStor. Now you're ready to deploy [https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation#CDH4Installation-Step4%3ADeployCDHandInstallComponents CDH and install components] if you haven't done so already.
  
We are looking forward to automating some of the above steps and adding deeper integration features to monitor Hadoop this year and would appreciate any feedback on what you would most like to seeSo if you have some ideas you'd like to share or would like to be a beta customer for new Hadoop integration features write us at support@osnexus.com or write me directly at steve (at) osnexus.com.
+
We are looking forward to automating some of the above configuration steps and adding deeper integration features to monitor CDH in future releasesIf you have specific needs, ideas, or would like to share your feedback please write us at engineering@osnexus.com.
  
 
Thanks and Happy Hadooping!
 
Thanks and Happy Hadooping!
Line 177: Line 375:
  
 
''Cloudera is a registered trademark of Cloudera Corporation, Hadoop is a registered trademark of the Apache Foundation, Ubuntu is a registered trademark of Canonical, and QuantaStor is a registered trademark of OS NEXUS Corporation.''
 
''Cloudera is a registered trademark of Cloudera Corporation, Hadoop is a registered trademark of the Apache Foundation, Ubuntu is a registered trademark of Canonical, and QuantaStor is a registered trademark of OS NEXUS Corporation.''
 +
 +
[[Category:QuantaStor_Guide]]

Revision as of 11:00, 3 May 2019

This integration guide is focused on configuring your QuantaStor storage grid as Cloudera® Hadoop™ data cluster for better Hadoop performance with less hardware. Note that you can still use your QuantaStor system normally as SAN/NAS system with the Hadoop services installed.

Setting up Cloudera® Hadoop™ within your QuantaStor® storage system is very similar to the steps you would take to setup CDH4 with a standard Ubuntu™ 12.04 / Precise. This is because QuantaStor v3 is built on top of Ubuntu Server.

However, there are some important differences and this guide covers all that. Also note that if you're following instructions from the Cloudera web site for steps outside of this How-To be sure refer to the sections regarding Ubuntu Server v12.04/Precise.

The Hadoop installation procedure begins by invoking an installation script, which is included with the current version of QuantaStor® called hadoop-install. From there the CDH installation proceeds in stages and is illustrated below in a series of screen shots. Note that the illustrated example consists of a cluster of four QuantaStor® nodes, one of which is designated the "Hadoop Manager" node.

Note also that on the Hadoop Manager node, the Hadoop management functionality will impose additional resource costs on the system. For example, when testing the installation using QuantaStor® nodes based on Virtual Machines, the Manager node required at least 10 GB of memory.

To get started you'll need to login to your QuantaStor v3 storage system(s) using SSH or via the console. Note that all of the commands shown in this How-To guide should be run as root so be sure to run 'sudo -i' to get super-user privileges before you begin.

Last, if you don't yet have a QuantaStor v3 storage system setup, you can get the CD-ROM ISO and a license key here.

Phase 1. Manager Installation Script / Text Mode Configuration Steps

From the superuser command-line on the manager node (hostname "osn-grid2-mgr1" in this example), simply invoke the "hadoop-install" script. This executable is located in the /bin directory, but it is in the path so it can be run from anywhere.

$ hadoop-install 

This script will take you through a series of screens, simply accept the EULA dialogs and allow it to proceed. This stage can take 10-15 minutes, and will install the web server for the Hadoop™ Management interface on the Manager node. The script will end with instructions for browsing into that interface to begin the next stage of the installation.

See the following screen shot examples.


Figure 1-01
Figure 1-01


Figure 1-02
Figure 1-02


Figure 1-03
Figure 1-03


Figure 1-04
Figure 1-04


Figure 1-05
Figure 1-05


Figure 1-06
Figure 1-06


Figure 1-07
Figure 1-07


Step 1.1 Package Dependencies / Troubleshooting

Note that this step includes the installation of a Java package, using the apt-get utility. Successful installation includes the target node meeting the package dependency requirements. If this stage fails, and the log shows package dependency errors: - Return to the command line - Run 'apt-get -f install' - Respond with 'Y' (must be uppercase) and allow this to complete - Accept the PAM configuration modification screen, if offered - Retry the hadoop-install script

See the following two screens as examples:


Figure 1-08
Figure 1-08


Figure 1-09
Figure 1-09


Phase 2. Initial Web-Based Installation

Follow the instructions seen on the last screen of the previous stage, for example:

Point your web browser to http://<hostname or IP>:7180/. Log in to the Cloudera Manager with 
the username and password set to 'admin' to continue installation

Log into the interface using the 'admin/admin' user/password. You will see the initial screen asking you which edition you wish to install.


Figure 2-01
Figure 2-01


Figure 2-02
Figure 2-02

For the purposes of this example, the option for a minimal "Standard" edition is shown throughout. On this screen, as on all subsequent screens, hit 'Continue' to move on to the next step.

The next screen allows you to enter the host addresses of the nodes onto which Hadoop is to be installed. In this example, four IP addresses are shown, corresponding to the nodes 'osn-grid2-mgr1', 'osn-grid2a', 'osn-grid2b', and 'osn-grid2c'.

By invoking the 'Search' button, the installation will test those node/addresses, and return status which should indicate that the nodes are ready and available for installation. When complete, hit 'Continue'.


Figure 2-03
Figure 2-03


Figure 2-04
Figure 2-04


Figure 2-05
Figure 2-05


The next screen, "Cluster Installation Screen 1", offers some installation options, again in this example we are installing the bare minimum, where we select the base CDH package only. The screen after that, "Cluster Installation Screen 2", requests login options. In this case, we are using the 'root' user where all nodes have the same root password.


Figure 2-06
Figure 2-06


Figure 2-07
Figure 2-07


The next screen, "Cluster Installation Screen 3", shows the cluster installation progress.
NOTE: the "Abort" popup dialog, do NOT hit "OK" here, this will abort the installation. Simply get rid of the popup.


Figure 2-08
Figure 2-08


If all goes well, this should progress as shown in the following example figures, resulting in "Installation completed successfully".


Figure 2-09
Figure 2-09


Figure 2-10
Figure 2-10


Figure 2-11
Figure 2-11


Hit 'Continue' to move to the next phase.

Step 2.1. Potential error, package dependency


As in Step 1.1 above, this step includes using the apt-get utility to install a Java package on all the other (besides the manager node) nodes in the cluster, and the same potential for package dependency errors can cause the cluster installation to fail on the non-manager nodes, as shown in the example figure below.


Figure 2-12
Figure 2-12


To correct, log into each of the nodes that failed and perform the same steps as shown in Step 1.1. When this is complete, hit the 'Retry Failed Nodes' button on the screen.

Step 2.2 Potential error, heartbeat detection failure


If installation succeeds past the package installation, the last phase of this step consists of heartbeat tests, as shown in the example figure below.


Figure 2-13
Figure 2-13


Sometimes these all succeed on the first try, other times some or all of the nodes fail.


Figure 2-14
Figure 2-14


Our testing showed that hitting "Retry Failed Nodes" (sometimes 2-3 times) resulted in heartbeat detection success.


Figure 2-15
Figure 2-15


The next screen, "Cluster Installation Screen 4", shows the progress of the "Installing Selected Parcels" stage. This is quite a lengthy phase, taking 1/2 hour to an hour typically.


Figure 2-16
Figure 2-16


Figure 2-17
Figure 2-17


Hitting 'Continue' at the completion of that phase takes you to "Cluster Installation Screen 5", the "Inspect hosts for correctness" phase. This should result in a comprehensive report, as shown in the following example screens.


Figure 2-18
Figure 2-18


Figure 2-19
Figure 2-19

Phase 3. Web-Based Installation, CDH4 Services


The screen allows you to select the services you wish installed. As before, in this example, we are showing the installation of only the basic core Hadoop service. Immediately following that screen is the screen for Database Setup, here we simple accept the default, hit 'Test Connection", and hit 'Continue' after this returns 'Success' as shown.


Figure 3-01
Figure 3-01


Figure 3-02
Figure 3-02


Figure 3-03
Figure 3-03


Figure 3-04
Figure 3-04

Phase 4. Setting Configuration Parameters


The following screen allows you to change certain configuration parameters. This is a large scrollable screen, shown below in two parts.


Figure 4-01a
Figure 4-01a


Figure 4-01b
Figure 4-01b


These configuration options bear special attention on QuantaStor installations - to intent here is to specifically setup Hadoop to use the storage on the QuantaStor Storage Pools. By default, the Hadoop installations create top-level directories on the nodes' root filesystem, such as '/dfs' and '/mapred'. So part of the pre-installation setup of the QuantaStor systems includes, on each node in the cluster, the creation of a Storage Pool and a Network Share on that Pool to function as the filesystem root for the Hadoop data store within the Pool.


The following screens show examples of Storage Pool creation and Network Share creation on a QuantaStor node:


Figure 4-01c
Figure 4-01b


Figure 4-01d
Figure 4-01d


Having created Storage Pools, and Network Shares on those pools, the directories where these data destinations should reside would be according to the following examples:

/export/osn-grid2-mgr1-pool1-share1
/export/osn-grid2a-pool1-share1
/export/osn-grid2b-pool1-share1
/export/osn-grid2c-pool1-share1

The Hadoop installation on the QuantaStor nodes must be altered to reflect these changes.

There are two ways this could be accomplished:

1. Change the Hadoop configuration directly. For example, in the "Review configuration changes" screen shown above, the "DataNode (Default)" parameter could be changed from its default value of '/dfs/dn' to '/export/osn-grid2a-pool1-share1/dfs/dn'. And similarly for the other paths related to 'mapred'.

2. By contrast, the approach shown in this example is to use symbolic links into the QuantaStor Pools, thereby allowing the Hadoop configuration default values to remain. The steps to accomplish option #2 above include, on each of the cluster nodes:
• $ mkdir /export/<sharename>/dfs
• $ mkdir /export/<sharename>/mapred
• $ ln -s /export/<sharename>/dfs /dfs
• $ ln -s /export/<sharename>/mapred /mapred

For example:

$ mkdir /export/osn-grid2a-pool1-share1/dfs
$ ln -s /export/osn-grid2a-pool1-share1/dfs /dfs
$ mkdir /export/osn-grid2a-pool1-share1/mapred
$ ln -s /export/osn-grid2a-pool1-share1/mapred /mapred


After accepting the "Review configuration changes" and hitting 'Continue', the installation proceeds to the "Starting your cluster services" screen (you may be prompted to re-enter your 'admin/admin' login credentials again), where it progresses through numerous steps, as shown in the example figures below. After all services are started, hitting 'Continue' should result in the "Congratulations Success" screen.

Phase 5. Starting Cluster Services


Figure 4-02
Figure 4-02


Figure 4-03
Figure 4-03


Figure 4-04
Figure 4-04


Figure 4-05
Figure 4-05


Hitting 'Continue' from here should then take you to the Manager Home screen, as shown in the example below.


Figure 4-06
Figure 4-06

Summary

That's the basics of getting Hadoop running with QuantaStor. Now you're ready to deploy CDH and install components if you haven't done so already.

We are looking forward to automating some of the above configuration steps and adding deeper integration features to monitor CDH in future releases. If you have specific needs, ideas, or would like to share your feedback please write us at engineering@osnexus.com.

Thanks and Happy Hadooping!


Cloudera is a registered trademark of Cloudera Corporation, Hadoop is a registered trademark of the Apache Foundation, Ubuntu is a registered trademark of Canonical, and QuantaStor is a registered trademark of OS NEXUS Corporation.