flume 1.9 源码阅读(四) Source配置大数据

上一篇文章最后说到了AgentConfiguration这个类，他是flume配置文件中最核心的东西。我们先放一放看一下flume配置文件到底都有哪些东西然后回头再来看这个类以及sources、channels、sinks等组件是如何实例化的。
从官网上扒下来的一些资料：
AvroSource

属性名称	默认值	描述
channels	-	sources要连接的channels
type	-	source的类型，这里是avro
bind	-	要监听的主机名称或者是ip地址
port	-	监听的的端口
threads	-	最大线程数
selector.type
selector.*
interceptiors	-	用空格分开的拦截器
interceptiors.*	-
compression-type	none	这个值应该为 none 或者 deflate，压缩类型必须匹配AvroSource的压缩类型
ssl	false	secure socket layer(安全套接层)，当ssl被启用时，还必须同时制定keystore和keystore-password，可以在component级别设置，也可以在全局进行设置。
keystore	-	java keystore 文件路径，如果这里面没有指定，将使用全局keystore,如果全局keystore也没有配置，将会报错。
keystore-password	-	java keystore的密码，如果这里面没有指定，将使用全局keystore-password,如果全局keystore-password也没有配置，将会报错。
keystore-type	JKS	java keystore 类型，可以取值为 jks 或者 pkcs12。如果这里没有指定将使用全局keystore-type，如果全局keystore-type没有配置，将使用默认值JKS
exclude-protocols	SSLv3	要排除的 SSL/TLS 协议列表(以空格分开)，SSLv3协议会一直被排除在外
include-protocols	-	要引入的 SSL/TLS协议列表(以空格分开), 当此配置为空的时候，将引用所有支持的协议
exclude-cipher-suites	-	要排除的加密方法(以空格分开)
include-cipher-suites	-	要引入的加密方法(以空格分开)，当此配置为空的时候，将引入所有支持的加密方式
ipFilter	false	为netty添加 ip 过滤
ipFilterRules	-	定义netty ipFilter 匹配规则

例如:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

Exec Source

属性名称	默认值	描述
channels	-	sources要连接的channels
type	-	source的类型，这里是exec
command	-	要执行的命令
shell		执行命令的shell
restartThrottle	10000	尝试重新启动之前要等待的时间（以毫秒为单位）
restart	false	当执行命令的cmd挂掉是否重启
logStdErr	false	标准错误的输出是否要被记录到日志
batchSize	20	可以一次读到的最大行数并发送给channel
batchTimeout	3000	超时时间（以毫秒为单位，当缓存没有达到刷新大小时
selector.type	replicating	这个配置为 replicatiing 或者 multiplexing
selector.*		根据 selector.type值决定
interceptors	-	拦截器列表(以空格分开)
interceptors.*

例如:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1

shell配置用来通过命令行调用命令，通常shell配置为 /bin/sh -c，/bin/ksh -c，cmd /c，powershell -Command等
例如：

a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done

Taildir Source 监视指定的文件，近实时查看文件尾部，当有新一行被追加的时候，这个source会一直读取这一行直到整个行写完，这个source非常可靠，即使在文件回滚的时候他也不会丢失任何数据，他周期性的将最后读到的位置信息已json格式存到文件中，当flume因为一些原因停止后，他还可以通过记录位置文件重新获取上次读到的位置。

属性名称	默认值	描述
channels	-	sources要连接的channels
type	-	source的类型，这里是TAILDIR
filegroups	-	一些用空格分开的文件集合，每一个集合代表一些列要被读取的文件
filegroups.	-	文件集合的绝对路径，正则表达式只能用于文件名
positionFile	~/.flume/taildir_position.json	以json格式记录的inode，记录每个文件的绝对路径以及该文件最后读到的位置信息
heders..	-	Header value which is the set with header key. Multiple headers can be specified for one file group.
byteOffsetHeader	false	Whether to add the byte offset of a tailed line to a header called ‘byteoffset’.
skipToEnd	false	是否直接定位到文件尾部，以防没有写入位置文件
idleTimeout	120000	超时时间(ms)，用来关闭长时间未激活的文件，如果已经被关闭的文件有新的行被追加，这个source将会自动重新打开该文件
writePosInterval	3000	将位置信息写入位置文件的间隔时间
batchSize	100	每一次读到最大的行数并发送给channel, 一般推荐使用默认值
maxBatchCount	Long.MAX_VLUE	控制从同一文件连续读取的批次数。如果source读取多个文件，并且其中一个文件的写入速度很快，则它可能会阻止其他文件被处理，因为繁忙文件将被无休止地读取。在这种情况下，请降低此值。
backoffSleepIncrement	1000	最后一次尝试未找到任何新数据时，重新尝试轮询新数据之前的时间延迟增量。
maxBackoffSleep	5000	当最后一次尝试未找到任何新数据时，每次重新尝试轮询新数据之间的最大时间延迟。
cachePatternMatching	true	Listing directories and applying the filename regex pattern may be time consuming for directories containing thousands of files. Caching the list of matching files can improve performance. The order in which files are consumed will also be cached. Requires that the file system keeps track of modification times with at least a 1-second granularity.
fileHeader	false	是否增加一个header用来存储文件的绝对路径
fileHeaderKey	file	Header key to use when appending absolute path filename to event header.

例如：

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.log.
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

Kafka Source kafka source就是一个消费者用于消费kafka 某些topics的，如果有多个kafka sources运行，我们可以配置他们为同一个消费者组，这样每一个消费者将会读取一个唯一的partitions集合，这个source只支持0.10.1.0以上版本

属性名称	默认值	描述
channels	-	sources要连接的channels
type	-	source的类型，这里是org.apache.flume.source.kafka.KafkaSource
kafka.bootstrap.servers	-	要执行的命令
kafka.topics	-	逗号分隔topics
kafka.topics.regex	-	正则表达式要订阅的topics，这个配置的优先级高于kafka.topics，他会覆盖kafka.topics
kafka.consumer.group.id	-	消费者id，当有多个source或者agent的时候使用相同的此配置代表他们的consumer group相同
batchSize	1000	一次batch中最大可写入channel的消息数
batchDurationMillis	1000ms	当batchSize没有达到最大值的时候，超过这个时间也会写入到 channel
backoffSleepIncrement	1000	Initial and incremental wait time that is triggered when a Kafka Topic appears to be empty. Wait period will reduce aggressive pinging of an empty Kafka Topic. One second is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
maxBackoffSleep	5000	Maximum wait time that is triggered when a Kafka Topic appears to be empty. Five seconds is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
useFlumeEventFormat	false	By default events are taken as bytes from the Kafka topic directly into the event body. Set to true to read events as the Flume Avro binary format. Used in conjunction with the same property on the KafkaSink or with the parseAsFlumeEvent property on the Kafka Channel this will preserve any Flume headers sent on the producing side.
setTopicHeader	true	When set to true, stores the topic of the retrieved message into a header, defined by the topicHeader property.
topicHeader	topic	Defines the name of the header in which to store the name of the topic the message was received from, if the setTopicHeader property is set to true. Care should be taken if combining with the Kafka Sink topicHeader property so as to avoid sending the message back to the same topic in a loop.
kafka.consumer.security.protocol	PLAINTEXT	Set to SASL_PLAINTEXT, SASL_SSL or SSL if writing to Kafka using some level of security. See below for additional info on secure setup.
mere consumer security props		If using SASL_PLAINTEXT, SASL_SSL or SSL refer to Kafka security for additional properties that need to be set on consumer.
Other kafka Consumer Properties	-	These properties are used to configure the Kafka Consumer. Any consumer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix kafka.consumer. For example: kafka.consumer.auto.offset.reset

例如：

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics.regex = ^topic[0-9]$
# the default kafka.consumer.group.id=flume is used

以上是我们项目中用到的一些source，https://flume.apache.org/FlumeUserGuide.html#setting-up-an-agent 官网上可以找到所有source的完整配置。需要的朋友可以去看一下。
【flume 1.9 源码阅读(四) Source配置】flume 通过这些配置，将用户想使用的 source 实例化。