1500字范文 > mapreduce 单词统计案例

mapreduce 单词统计案例

时间：2024-06-20 03:57:49

相关推荐

mapreduce 单词统计案例

一、Hadoop MapReduce 构思体现在如下的三个方面：

1.如何对付大数据处理：分而治之

2.构建抽象模型：Map 和 Reduce

Map: 对一组数据元素进行某种重复式的处理；

Reduce: 对 Map 的中间结果进行某种进一步的结果整理。

MapReduce 处理的数据类型是<key,value>键值对

3.统一构架，隐藏系统层细节

MapReduce 最大的亮点在于通过抽象模型和计算框架把需要做什么(what

need to do)与具体怎么做(how to do)分开了，为程序员提供一个抽象和高层的编

程接口和框架。

二、Mapreduce 框架结构：

一个完整的mapreduce 框架由三个实例进程：

1.MRAppMaster ：负责整个程序的过程调度以及状态协调。

2.MapTask :负责map阶段的整个数据的处理。

3.ReduceTask :负责reduce阶段的整个数据的处理。

三、Mapreduce 的编写规范：

（1）用户编写的程序分成三个部分：Mapper，Reducer，Driver(提交运行 mr 程

序的客户端)

Mapper 的输入数据是 KV 对的形式（KV 的类型可自定义）

（3）Mapper 的输出数据是 KV 对的形式（KV 的类型可自定义）

（4）Mapper 中的业务逻辑写在 map()方法中

（5）map()方法（maptask 进程）对每一个<K,V>调用一次

（6）Reducer 的输入数据类型对应 Mapper 的输出数据类型，也是 KV

（7）Reducer 的业务逻辑写在 reduce()方法中

（8）Reducetask 进程对每一组相同 k 的<k,v>组调用一次 reduce()方法

（9）用户自定义的 Mapper 和 Reducer 都要继承各自的父类

（10）整个程序需要一个 Drvier 来进行提交，提交的是一个描述了各种必要信

息的 job 对象

Mapreduce 的wrodcount 小程序：

1.创建maven工程引入 pom.xml

org.apache.hadoop

hadoop-common

2.7.4

org.apache.hadoop

hadoop-hdfs

2.7.4

org.apache.hadoop

hadoop-client

2.7.4

org.apache.hadoop

hadoop-mapreduce-client-core

2.7.4

<build><plugins><plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <version>2.4</version> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <classpathPrefix>lib/</classpathPrefix> <mainClass>cn.itcast.mapreduce.WordCountDriver</mainClass> </manifest> </archive> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.0</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> </plugin> </plugins></build>

2.创建一个wrodCountMapper 类继承Mapper<KEYYIN,VALUEIN,KEYOUT,VALUEOUT>

KEYIN:map输入中的key

在默认读取数据的组件下textInputFormat(一行一行读)

Key:表示是该行的起始偏移量（就是光标所在的位置值）longwritable

Value：表示该行的内容

VALUEIN:map输入kv中的value

在默认读取数据的组件下TextInputFormat(一行一行读)

表明的是内容（string—>text）

KEYOUT:map输出的kv中的key

在我们的需求中把单词作为输出的key (string -->text)

VALUEOUT：map输出kv中的value

在我们的需求中把单词的次数1作为输出的value （int–>intwritable）

简而言之:keyin 读取文本光标的偏移量，valuein：读取文本时该行文本的内容相当于map输入中的key.keyout:map输出时的key ;valueout:map输出时的value:数值 intwritable

核心命令：

hadoop fs -mkdir -p /wordcount/input

hadoop fs -put -p /root/wenben/a.txt /wordcount/input

hadoop fs -ls /

hadoop fs -cat /wordcount/input/a.txt

hadoop fs -rm -r /wordcont/output

hadoop jar jar报；//hadoop 执行jar包

hdfs: 监控端口 note1:50070

yarn :监控端口 note1:8088

Mapreduce 统计单词数案例：

创建maven 工程

1.* key:偏移量没啥用

value:map输入时读取文本文件其中的一行数据context:mapreduce 封装好的输出对象。*/

public class WrodCountMapper extends Mapper<LongWritable,Text,Text,IntWritable> {

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line=value.toString();

String[] words = line.split(" ");

for (String word:words) {context.write(new Text(word),new IntWritable(1));}}

}

2./*

key ：就是每个单词values：相当于处理后的单词如：hadoop[1,1,1]context :是mapreduce 封装好的输出对象*/

public class wordCountReduce extends Reducer<Text,IntWritable,Text,IntWritable>{

@Override

protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

int count=0;

for (IntWritable value :values) {count +=value.get();}context.write(key,new IntWritable(count));}

3.public class WordCountDirver {

public static void main(String[] args) throws Exception {

Configuration conf= new Configuration();

conf.set(“mapreduce.framework.name”,“yarn”);

Job job = Job.getInstance(conf);

//指定本次mr程序运行的主类job.setJarByClass(WordCountDirver.class);//指定本次mr运行程序的mapper 和reducerjob.setMapperClass(WrodCountMapper.class);//指定本次Mr程序的reducerjob.setReducerClass(wordCountReduce.class);//指定本次mr程序map阶段的输出类型job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);//指定本次mr程序reducer阶段的输出类型job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//指定要处理的数据所在位置FileInputFormat.setInputPaths(job,new Path("hdfs://note1:9000/wordcount/input"));//此目录hdfs上必须存在FileOutputFormat.setOutputPath(job,new Path("hdfs://note1:9000/wordcount/output"));注意此目录必须不存在//想yarn 集群提交job job.submit(); 这个我们看不到日志一般不用用下面job.waitForCompletion(true) true:代表监控并打印输出job.waitForCompletion(true);}

}

遇到的问题：

Yarn 网页端口： note1:8088 监控不到原因：

修改集群配置文件yarn-site.xml，加上如下几句：

mapreduce.framework.name

yarn

遇到的问题二：结果成功输出但报 inter… thread wait jion 等问题？

解决：在driver类中添加就是将mapreduce 提交到哪里去也是yarn-site.xml 配置的《property》

conf.set(“mapreduce.framework.name”,“yarn”);

同时这是指在yarn集群上运行

但是如果代码有问题，我们有需要重新打包等一系列操作太麻烦。

我们可以现在本地上运行，测试有没有bug