Flume部署文档 _文档

幼敏悟过人，读书辄成诵。这篇文章主要讲述Flume部署文档相关的知识，希望能为你提供帮助。

操作用户：hadoop
操作目录：/home/hadoop/apps
操作机器：hadoop1

1. ?下载安装包?

wget https://archive.apache.org/dist/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

2. ?解压、重命名?

# 解压
tar -zxvf apache-flume-1.9.0-bin.tar.gz
# 重命名目录
mv apache-flume-1.9.0-bin apache-flume-1.9.0

# 重命名conf下的flume-env.sh.template
mv flume-env.sh.template flume-env.sh

# 修改flume-env.sh
export java_HOME=/opt/jdk1.8.0_212

3. ?修改配置文件?
?案例一:监听端口?说明：
1.通过netcat工具向本机的6666端口发送数据
2.Flume监听本机的6666端口，通过source端读取数据
3.Flume将获取的数据通过sink端写出到控制台
主机安装netcat：yum install -y nc
netcat用法：输入 nc -lk 6666作为服务端，nc host 6666作为客户端，相互可以通信

步骤：
conf/flume-netcat-conf.properties配置

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 6666

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

配置文件有5个部分，之间用空行隔开
给Agent的组件命名。a1是Agent名，r1是sources名，c1是channels名，k1是sinks名。注意单词是有复数的，说明可以有多个组件。
配置source。r1这个source的类型是netcat，监听的主机是localhost，监听的端口号是6666。
配置sink。k1的类型是logger，输出到控制台。
配置channel。c1这个channel的类型是内存，缓存容量是1000个事件（Flume以事件Event为传输单元），事务容量为100个事件（一次传输的数据）。
绑定三个组件。由于source、channel和sink可以有多个，所以需要绑定。INFO是指INFO及以上的消息。注意channel的复数，一个source可以绑定多个channel，一个channel可以绑定多个sink，一个sink只能绑定一个channel。

启动Flume

bin/flume-ng agent --conf conf --conf-file conf/flume-netcat-conf.properties --name a1 -Dflume.root.logger=INFO,console

启动新终端输入 nc localhost 6666，输入字符串

flume服务控制台

?案例二:监控本地文件,上传到HDFS?说明：
1. 通过Flume监控本地文件的变化
2. Flume将数据输出到hdfs上
配置文件：flume-localfile2hdfs-conf.properties

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/testdata/data.log

# Describe the sink
a1.sinks.k1.type = hdfs
# 创建文件的路径
a1.sinks.k1.hdfs.path = hdfs://ns1/flume/%Y%m%d/%H
# 生产的文件前缀
a1.sinks.k1.hdfs.filePrefix = logs
# 是否按照时间滚动文件夹
# 下面3个参数一起配置
a1.sinks.k1.hdfs.round = true
# 多久创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = https://www.songbingjia.com/android/1
# 定义时间单位
a1.sinks.k1.hdfs.roundUnit = minute
# 是否使用本地时间戳（必须配置）
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 积累多少个Event才flush到HDFS一次（单位为事件）
a1.sinks.k1.hdfs.batchSize = 10
# 设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
# 多久滚动生成一个新的文件（单位为秒）
# 这个参数只是实验用，实际开发需要调大点
# 下面3个参数一起配置
# 30s滚动一次
a1.sinks.k1.hdfs.rollInterval = 30
# 设置每个文件的滚动大小（略小于文件块大小128M）
a1.sinks.k1.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关（0则不按照该值）
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.writeFormat = Text

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

bin/flume-ng agent --conf conf --conf-file conf/flume-localfile2hdfs-conf.properties --name a1 -Dflume.root.logger=INFO,console

往data.log中写入数据

echo “aaa” > > data.log
echo “bbb” > > data.log
echo “ccc” > > data.log

查看hdfs的webui

?案例三:监控本地目录新文件并上传HDFS?说明：
1.Flume对指定目录进行监控，被监控的文件夹每500毫秒扫描一次文件变动
2.向目录添加新文件
3.Flume将获取到的数据写入HDFS。上传后的文件在本地后缀默认为 .COMPLETED。没上传的文件在HDFS用 .tmp 后缀。

Flume通过上面这种方式判断是否有新文件，但如果目录中本来就存在没上传的且后缀为 .COMPLETED 的文件，那么Flume就不会将这个文件上传。
同时，如果修改了带有后缀为 .COMPLETED 的文件，Flume也不会将这个文件上传到HDFS，因为它有 .COMPLETED 后缀，Flume认为它已经上传了。
所以这种方式不能动态监控变化的数据。

步骤：
创建一个新的目录：mkdir directory。
创建Flume Agent配置文件flume-spooldir2hdfs-conf.properties，source的类型为spooling directory，sink的类型是hdfs。

配置：flume-spooldir2hdfs-conf.properties

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
# 监控的目录
a1.sources.r1.spoolDir = /home/hadoop/testdata/
#忽略所有以.tmp 结尾的文件，不上传
a1.sources.r1.ignorePatten = ([^]*\\.tmp)

# Describe the sink
a1.sinks.k1.type = hdfs
# 创建文件的路径
a1.sinks.k1.hdfs.path = hdfs://ns1/flume/%Y%m%d/%H
a1.sinks.k1.hdfs.filePrefix = spool-
# 是否按照时间滚动文件夹
# 下面3个参数一起配置
a1.sinks.k1.hdfs.round = true
# 多久创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = https://www.songbingjia.com/android/1
# 定义时间单位
a1.sinks.k1.hdfs.roundUnit = minue
# 是否使用本地时间戳（必须配置）
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 积累多少个Event才flush到HDFS一次（单位为事件）
a1.sinks.k1.hdfs.batchSize = 10
# 设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
# 多久滚动生成一个新的文件（单位为秒）
# 这个参数只是实验用，实际开发需要调大点
# 下面3个参数一起配置
a1.sinks.k1.hdfs.rollInterval = 30
# 设置每个文件的滚动大小（略小于文件块大小128M）
a1.sinks.k1.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关（0则不按照该值）
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.writeFormat = Text

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

bin/flume-ng agent --conf conf --conf-file conf/flume-spooldir2hdfs-conf.properties --name a1 -Dflume.root.logger=INFO,console

在directory目录下创建文件：

date > one.txt
date > two.txt

查看hdfs的webui

?案例四:监控追加文件（断点续传）?说明
Taildir Source 既能够实现断点续传，又可以保证数据不丢失，还能够进行实时监控。

步骤
1.创建一个新的目录：mkdir file。
2.在file目录下创建两个文件：touch one.txt，touch two.txt
3. 创建Flume Agent配置文件flume-taildir2logger.properties，source的类型为taildir，sink的类型是logger

配置：flume-taildir2logger.properties

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = taildir
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /home/hadoop/testdata/.*\\.txt
a1.sources.r1.positionFile = /home/hadoop/testdata/position.json

# Describe the sink
a1.sinks.k1.type = hdfs
# 创建文件的路径
a1.sinks.k1.hdfs.path = hdfs://ns1/flume/%Y%m%d/%H
a1.sinks.k1.hdfs.filePrefix = taildir
# 是否按照时间滚动文件夹
# 下面3个参数一起配置
a1.sinks.k1.hdfs.round = true
# 多久创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = https://www.songbingjia.com/android/1
# 定义时间单位
a1.sinks.k1.hdfs.roundUnit = minute
# 是否使用本地时间戳（必须配置）
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 积累多少个Event才flush到HDFS一次（单位为事件）
a1.sinks.k1.hdfs.batchSize = 10
# 设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
# 多久滚动生成一个新的文件（单位为秒）
# 这个参数只是实验用，实际开发需要调大点
# 下面3个参数一起配置
a1.sinks.k1.hdfs.rollInterval = 30
# 设置每个文件的滚动大小（略小于文件块大小128M）
a1.sinks.k1.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关（0则不按照该值）
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.writeFormat = Text

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

bin/flume-ng agent --conf conf --conf-file conf/flume-taildir2logger.properties --name a1 -Dflume.root.logger=INFO,console

启动新终端:

echo “aaa” > > 1.txt
echo “bbb” > > 2.txt

关闭Flume,继续往文件中写入内容：

echo “ccc” > > 1.txt
echo “ddd” > > 2.txt

再启动Flume，查看position.json可以发现，该文件记录了文件上次修改的位置，所以可以实现断点续传。（Unix/Linux系统内部不使用文件名，而使用inode来识别文件）。

查看hdfs的webui

?案例五:监控文件写到Kafka?说明
监控文件内容变化，想监控到的数据写到Kafka集群中。

步骤
启动kafka集群，并创建新的topic:flume-test
# 启动kafka

bin/kafka-server-start.sh -daemon config/server.properties

# 创建topic

bin/kafka-topics.sh --create --zookeeper hadoop1:2181,hadoop2:2181,hadoop3:2181 --replication-factor 3 --partitions 3 --topic flume-test

配置：flume-sink2kafka-conf.properties

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
#a1.sources.r1.type = netcat
#a1.sources.r1.bind = localhost
#a1.sources.r1.port = 6666
a1.sources.r1.type = taildir
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /home/hadoop/testdata/.*\\.txt
a1.sources.r1.positionFile = /home/hadoop/testdata/position.json

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = flume-test
a1.sinks.k1.kafka.bootstrap.servers = hadoop1:9092,hadoop2:9092,hadoop3:9092
a1.sinks.k1.kafka.flumeBatchSize = 10
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1