2015-03-17

hadoop2.6.0单机伪分布式安装和配置

其实在一年前我就安装过hadoop，不过那时候就是照着教程安装的单机版，这次的安装过程虽然遇到了很多问题，并且很多问题的发生其实都是由于自己对linux系统下shell命令的不熟悉造成的，而且由于之前的电脑是中文版，想了想可以锻炼自己的英文水平就硬着头皮重装了纯英文版，之后又配了一次。
虽然在这次配的过程还是遇到了一些问题，不过也更了解了细节上的操作，而且悟出了一个道理：

如果你做的一件事情是一帆风顺的，那它便可有可无。

添加hadoop用户
添加hadoop-group组
使用下面命令添加一个hadoop组
1
$ addgroup hadoop-group

操作成功后会显示

1 2	Adding group `hadoop-group' (GID 1001) ... Done...

添加hadoop用户并加入hadoop-group组

1	$ sudo adduser --ingroup hadoop-group hadoop

操作成功后会显示

Adding user `hadoop' ...
Adding new user `hadoop' (1001) with group `hadoop-group' ...
Creating home directory `/home/hadoop' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:

根据提示输入hadoop用户的密码，敲回车
再重复输入hadoop用户密码，敲回车
成功后会显示

passwd: password updated successfully
Changing the user information for hadoop
Enter the new value, or press ENTER for the default
Changing the user information for hadoop
Enter the new value, or press ENTER for the default
	Full Name []: hadoop
	Room Number []: 
	Work Phone []: 
	Home Phone []: 
	Other []: 
Is the information correct? [Y/n] y

过程中的停顿按照我上面的来输入，最后输入y回车，添加用户就成功了

切换到hadoop用户进行下列操作

方法一：直接在终端界面切换用户

1	$ su hadoop

按照提示输入hadoop用户的密码就可以切换到hadoop用户

方法二：在图形界面以hadoop用户登录

右上角的电源键按钮下拉，有直接切换账户的功能

为hadoop用户进行ssh的配置

命令行生成密钥对

1	$ ssh-keygen -t rsa

过程中的所有停顿直接回车，然后成功后会显示如下

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
26:81:b7:0e:22:92:49:0f:ee:8a:18:ea:f7:24:2f:85 hadoop@LX-ubuntu
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|     .           |
| o  . o          |
|ooo  . o         |
|=...o o S        |
|o. E + o         |
|.. ....          |
|=. o+            |
|*.. oo           |
+-----------------+

复制公钥为authorized_keys

进入.ssh/目录中会看见两个文件：id_rsa、id_rsa.pub
输入下面命令将公钥id_rsa.pub复制为authorized_keys

1	$ cp id_rsa.pub authorized_keys

进行无密码ssh连接测试

1	$ ssh localhost

如果没有安装openssh，回车后会发现操作没有成功执行，而是提示如下

1	ssh: connect to host localhost port 22: Connection refused

输入命令查看是否由sshd进程

1	$ ps -e\|grep ssh

如果没有sshd进程，则需要【切换到拥有sudoers权限的账户下】进行安装

1	sudo apt-get install openssh-server

之后再进行ssh连接测试，如果成功显示如下

The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 64:25:7c:a1:59:92:80:e6:a5:30:8e:a0:83:23:ac:f4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-46-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

连接成功后退出，输入命令

$ exit

等到第二次进行ssh连接，一些警告信息会消失，如下

Welcome to Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-46-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

0 packages can be updated.
0 updates are security updates.

Last login: Tue Mar 17 21:17:18 2015 from localhost

下载并安装hadoop2.6.0

到apache hadoop官网（http://hadoop.apache.org）可以进行hadoop-2.6.0.tar.gz安装文件的下载，实际上就是一个压缩包

解压缩hadoop2.6.0.tar.gz文件到目录下

【切换到sudoer用户下】（如果要解压到系统目录中）进行解压操作，这里我将hadoop2.6.0.tar.gz文件解压到/usr/local/下

1	sudo tar -zxvf hadoop-2.6.0.tar.gz -C /usr/local/

将hadoop2.6.0文件夹重命名为hadoop

1	sudo mv hadoop-2.6.0/ hadoop/

修改hadoop文件所属用户为hadoop

1	$ sudo chown -R hadoop hadoop/

修改hadoop文件所属组为hadoop-group

1	$ sudo chgrp -R hadoop-group hadoop/

hadoop2.6.0相关文件的配置

hadoop-env.sh文件配置

用vi/vim打开hadoop下的etc/hadoop/hadoop-env.sh文件，把”export JAVA_HOME=“后面内容修改为你的jdk位置，保存并退出

到这一步其实已经完成了hadoop单机的安装，单机是否安装成功可以参考最后的单机运行wordcount实例来判断

core-site.xml文件配置

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/hadoop/tmp</value>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml文件配置

<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/usr/local/hadoop/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/usr/local/hadoop/dfs/data</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

mapred-site.xml文件配置

到这步大家会发现并没有mapred-site.xml这个文件，取而代之的是一个名为mapred-site.xml.template的文件，因此在这里需要增加一个拷贝的步骤

1	$ cp mapred-site.xml.template mapred-site.xml

拷贝完毕之后修改mapred-site.xml文件如下

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml文件配置

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

单机和伪分布式运行实例

单机运行wordcount实例

在hadoop文件夹下创建input文件夹，在文件夹下放入要进行wordcount的文本
然后执行如下命令

1	bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output

成功运行显示如下：

15/03/18 00:32:43 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/03/18 00:32:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/03/18 00:32:43 INFO input.FileInputFormat: Total input paths to process : 1
15/03/18 00:32:43 INFO mapreduce.JobSubmitter: number of splits:1
15/03/18 00:32:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local571613329_0001
15/03/18 00:32:43 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/03/18 00:32:43 INFO mapreduce.Job: Running job: job_local571613329_0001
15/03/18 00:32:43 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/03/18 00:32:43 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/03/18 00:32:43 INFO mapred.LocalJobRunner: Waiting for map tasks
15/03/18 00:32:43 INFO mapred.LocalJobRunner: Starting task: attempt_local571613329_0001_m_000000_0
15/03/18 00:32:43 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/03/18 00:32:43 INFO mapred.MapTask: Processing split: file:/usr/local/hadoop/input/hello:0+48
15/03/18 00:32:43 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/03/18 00:32:43 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/03/18 00:32:43 INFO mapred.MapTask: soft limit at 83886080
15/03/18 00:32:43 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/03/18 00:32:43 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/03/18 00:32:43 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/03/18 00:32:43 INFO mapred.LocalJobRunner: 
15/03/18 00:32:43 INFO mapred.MapTask: Starting flush of map output
15/03/18 00:32:43 INFO mapred.MapTask: Spilling map output
15/03/18 00:32:43 INFO mapred.MapTask: bufstart = 0; bufend = 92; bufvoid = 104857600
15/03/18 00:32:43 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214356(104857424); length = 41/6553600
15/03/18 00:32:43 INFO mapred.MapTask: Finished spill 0
15/03/18 00:32:43 INFO mapred.Task: Task:attempt_local571613329_0001_m_000000_0 is done. And is in the process of committing
15/03/18 00:32:43 INFO mapred.LocalJobRunner: map
15/03/18 00:32:43 INFO mapred.Task: Task 'attempt_local571613329_0001_m_000000_0' done.
15/03/18 00:32:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local571613329_0001_m_000000_0
15/03/18 00:32:43 INFO mapred.LocalJobRunner: map task executor complete.
15/03/18 00:32:43 INFO mapred.LocalJobRunner: Waiting for reduce tasks
15/03/18 00:32:43 INFO mapred.LocalJobRunner: Starting task: attempt_local571613329_0001_r_000000_0
15/03/18 00:32:43 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/03/18 00:32:43 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@6b1d6655
15/03/18 00:32:43 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
15/03/18 00:32:43 INFO reduce.EventFetcher: attempt_local571613329_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
15/03/18 00:32:44 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local571613329_0001_m_000000_0 decomp: 83 len: 87 to MEMORY
15/03/18 00:32:44 INFO reduce.InMemoryMapOutput: Read 83 bytes from map-output for attempt_local571613329_0001_m_000000_0
15/03/18 00:32:44 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 83, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->83
15/03/18 00:32:44 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
15/03/18 00:32:44 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/03/18 00:32:44 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
15/03/18 00:32:44 INFO mapred.Merger: Merging 1 sorted segments
15/03/18 00:32:44 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 79 bytes
15/03/18 00:32:44 INFO reduce.MergeManagerImpl: Merged 1 segments, 83 bytes to disk to satisfy reduce memory limit
15/03/18 00:32:44 INFO reduce.MergeManagerImpl: Merging 1 files, 87 bytes from disk
15/03/18 00:32:44 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
15/03/18 00:32:44 INFO mapred.Merger: Merging 1 sorted segments
15/03/18 00:32:44 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 79 bytes
15/03/18 00:32:44 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/03/18 00:32:44 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
15/03/18 00:32:44 INFO mapred.Task: Task:attempt_local571613329_0001_r_000000_0 is done. And is in the process of committing
15/03/18 00:32:44 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/03/18 00:32:44 INFO mapred.Task: Task attempt_local571613329_0001_r_000000_0 is allowed to commit now
15/03/18 00:32:44 INFO output.FileOutputCommitter: Saved output of task 'attempt_local571613329_0001_r_000000_0' to file:/usr/local/hadoop/output/_temporary/0/task_local571613329_0001_r_000000
15/03/18 00:32:44 INFO mapred.LocalJobRunner: reduce > reduce
15/03/18 00:32:44 INFO mapred.Task: Task 'attempt_local571613329_0001_r_000000_0' done.
15/03/18 00:32:44 INFO mapred.LocalJobRunner: Finishing task: attempt_local571613329_0001_r_000000_0
15/03/18 00:32:44 INFO mapred.LocalJobRunner: reduce task executor complete.
15/03/18 00:32:44 INFO mapreduce.Job: Job job_local571613329_0001 running in uber mode : false
15/03/18 00:32:44 INFO mapreduce.Job:  map 100% reduce 100%
15/03/18 00:32:44 INFO mapreduce.Job: Job job_local571613329_0001 completed successfully
15/03/18 00:32:44 INFO mapreduce.Job: Counters: 33
	File System Counters
		FILE: Number of bytes read=541252
		FILE: Number of bytes written=1047700
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=3
		Map output records=11
		Map output bytes=92
		Map output materialized bytes=87
		Input split bytes=99
		Combine input records=11
		Combine output records=8
		Reduce input groups=8
		Reduce shuffle bytes=87
		Reduce input records=8
		Reduce output records=8
		Spilled Records=16
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=0
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=525336576
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=48
	File Output Format Counters 
		Bytes Written=61

之后输出output文件夹下的统计结果进行查看

1	cat output/*

伪分布式运行wordcount实例

把要wordcount的文件夹put到hdfs分布式文件系统上

1	$ bin/hdfs dfs -put input /input

进行wordcount运算

1	bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input /output

查看wordcount的结果

1	$ bin/hdfs dfs -cat /output/*

nomadlx