spark|spark structured-streaming 最全的使用总结

一、spark structured-streaming介绍
我们都知道spark streaming在v2.4.5 之后 就进入了维护阶段,不再有新的大版本出现,而且 spark streaming一直是按照微批来处理streaming数据的,只能做到准实时,无法像flink一样做到数据的实时数据处理。所以在spark streaming进入到不再更新的维护阶段后,spark 推出了 structured-streaming 来同flink 进行竞争,structured-streaming 支持窗口计算,watermark、state 等flink 一样的特性。
二、spark structured-streaming如何对接kafka 数据源
jar依赖:

groupId = org.apache.spark artifactId = spark-sql-kafka-0-10_2.12 version = 3.2.0

1、从kafka数据源读取数据
// Subscribe to 1 topic val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]// Subscribe to 1 topic, with headers val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .option("includeHeaders", "true") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers") .as[(String, String, Array[(String, Array[Byte])])]// Subscribe to multiple topics val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]// Subscribe to a pattern val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]

2、Creating a Kafka Source for Batch Queries
// Subscribe to 1 topic defaults to the earliest and latest offsets val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]// Subscribe to multiple topics, specifying explicit Kafka offsets val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]// Subscribe to a pattern, at the earliest and latest offsets val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]

schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp timestamp
timestampType int
headers (optional) array
 Option参数:
Option value meaning
assign json string {"topicA":[0,1],"topicB":[2,4]} Specific TopicPartitions to consume. Only one of "assign", "subscribe" or "subscribePattern" options can be specified for Kafka source.
subscribe A comma-separated list of topics The topic list to subscribe. Only one of "assign", "subscribe" or "subscribePattern" options can be specified for Kafka source.
subscribePattern Java regex string The pattern used to subscribe to topic(s). Only one of "assign, "subscribe" or "subscribePattern" options can be specified for Kafka source.
kafka.bootstrap.servers A comma-separated list of host:port The Kafka "bootstrap.servers" configuration.
Option可选参数:
Option value default query type meaning
startingTimestamp timestamp string e.g. "1000" none (next preference is startingOffsetsByTimestamp) streaming and batch The start point of timestamp when a query is started, a string specifying a starting timestamp for all partitions in topics being subscribed. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the behavior will follow to the value of the option startingOffsetsByTimestampStrategy
Note1: startingTimestamp takes precedence over startingOffsetsByTimestamp and startingOffsets.
Note2: For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
startingOffsetsByTimestamp json string """ {"topicA":{"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000}} """ none (next preference is startingOffsets) streaming and batch The start point of timestamp when a query is started, a json string specifying a starting timestamp for each TopicPartition. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the behavior will follow to the value of the option startingOffsetsByTimestampStrategy
Note1: startingOffsetsByTimestamp takes precedence over startingOffsets.
Note2: For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
startingOffsets "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """ "latest" for streaming, "earliest" for batch streaming and batch The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
endingTimestamp timestamp string e.g. "1000" none (next preference is endingOffsetsByTimestamp) batch query The end point when a batch query is ended, a json string specifying an ending timestamp for all partitions in topics being subscribed. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the offset will be set to latest. Note: endingTimestamp takes precedence over endingOffsetsByTimestamp and endingOffsets.

endingOffsetsByTimestamp json string """ {"topicA":{"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000}} """ none (next preference is endingOffsets) batch query The end point when a batch query is ended, a json string specifying an ending timestamp for each TopicPartition. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the offset will be set to latest. Note: endingOffsetsByTimestamp takes precedence over endingOffsets.
endingOffsets latest or json string {"topicA":{"0":23,"1":-1},"topicB":{"0":-1}} latest batch query The end point when a batch query is ended, either "latest" which is just referred to the latest, or a json string specifying an ending offset for each TopicPartition. In the json, -1 as an offset can be used to refer to latest, and -2 (earliest) as an offset is not allowed.
failOnDataLoss true or false true streaming and batch Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
kafkaConsumer.pollTimeoutMs long 120000 streaming and batch The timeout in milliseconds to poll data from Kafka in executors. When not defined it falls back to spark.network.timeout.
fetchOffset.numRetries int 3 streaming and batch Number of times to retry before giving up fetching Kafka offsets.
fetchOffset.retryIntervalMs long 10 streaming and batch milliseconds to wait before retrying to fetch Kafka offsets
maxOffsetsPerTrigger long none streaming query Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
minOffsetsPerTrigger long none streaming query Minimum number of offsets to be processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume. Note, if the maxTriggerDelay is exceeded, a trigger will be fired even if the number of available offsets doesn't reach minOffsetsPerTrigger.
maxTriggerDelay time with units 15m streaming query Maximum amount of time for which trigger can be delayed between two triggers provided some data is available from the source. This option is only applicable if minOffsetsPerTrigger is set.
minPartitions int none streaming and batch Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
groupIdPrefix string spark-kafka-source streaming and batch Prefix of consumer group identifiers (group.id) that are generated by structured streaming queries. If "kafka.group.id" is set, this option will be ignored.
kafka.group.id string none streaming and batch The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query generates a unique group id for reading data. This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics. In some scenarios (for example, Kafka group-based authorization), you may want to use a specific authorized group id to read data. You can optionally set the group id. However, do this with extreme caution as it can cause unexpected behavior. Concurrently running queries (both, batch and streaming) or sources with the same group id are likely interfere with each other causing each query to read only part of the data. This may also occur when queries are started/restarted in quick succession. To minimize such issues, set the Kafka consumer session timeout (by setting option "kafka.session.timeout.ms") to be very small. When this is set, option "groupIdPrefix" will be ignored.
includeHeaders boolean false streaming and batch Whether to include the Kafka headers in the row.
startingOffsetsByTimestampStrategy "error" or "latest" "error" streaming and batch The strategy will be used when the specified starting offset by timestamp (either global or per partition) doesn't match with the offset Kafka returned. Here's the strategy name and corresponding descriptions:
"error": fail the query and end users have to deal with workarounds requiring manual steps.
"latest": assigns the latest offset for these partitions, so that Spark can read newer records from these partitions in further micro-batches.
Consumer Caching
It’s time-consuming to initialize Kafka consumers, especially in streaming scenarios where processing time is a key factor. Because of this, Spark pools Kafka consumers on executors, by leveraging Apache Commons Pool.
The caching key is built up from the following information:
  • Topic name
  • Topic partition
  • Group ID
The following properties are available to configure the consumer pool:
Property Name Default Meaning Since Version
spark.kafka.consumer.cache.capacity 64 The maximum number of consumers cached. Please note that it's a soft limit. 3.0.0
spark.kafka.consumer.cache.timeout 5m (5 minutes) The minimum amount of time a consumer may sit idle in the pool before it is eligible for eviction by the evictor. 3.0.0
spark.kafka.consumer.cache.evictorThreadRunInterval 1m (1 minute) The interval of time between runs of the idle evictor thread for consumer pool. When non-positive, no idle evictor thread will be run. 3.0.0
spark.kafka.consumer.cache.jmx.enable false Enable or disable JMX for pools created with this configuration instance. Statistics of the pool are available via JMX instance. The prefix of JMX name is set to "kafka010-cached-simple-kafka-consumer-pool". 3.0.0
The size of the pool is limited by spark.kafka.consumer.cache.capacity, but it works as “soft-limit” to not block Spark tasks.
Idle eviction thread periodically removes consumers which are not used longer than given timeout. If this threshold is reached when borrowing, it tries to remove the least-used entry that is currently not in use.
If it cannot be removed, then the pool will keep growing. In the worst case, the pool will grow to the max number of concurrent tasks that can run in the executor (that is, number of task slots).
If a task fails for any reason, the new task is executed with a newly created Kafka consumer for safety reasons. At the same time, we invalidate all consumers in pool which have same caching key, to remove consumer which was used in failed execution. Consumers which any other tasks are using will not be closed, but will be invalidated as well when they are returned into pool.
Along with consumers, Spark pools the records fetched from Kafka separately, to let Kafka consumers stateless in point of Spark’s view, and maximize the efficiency of pooling. It leverages same cache key with Kafka consumers pool. Note that it doesn’t leverage Apache Commons Pool due to the difference of characteristics.
The following properties are available to configure the fetched data pool:
Property Name Default Meaning Since Version
spark.kafka.consumer.fetchedData.cache.timeout 5m (5 minutes) The minimum amount of time a fetched data may sit idle in the pool before it is eligible for eviction by the evictor. 3.0.0
spark.kafka.consumer.fetchedData.cache.evictorThreadRunInterval 1m (1 minute) The interval of time between runs of the idle evictor thread for fetched data pool. When non-positive, no idle evictor thread will be run. 3.0.0
特别注意:
//"kafka.auto.offset.reset" -> "latest", //latest自动重置偏移量为最新的偏移量earliest,structured-streaming not support
"auto.offset.reset" -> "latest", //latest自动重置偏移量为最新的偏移量earliest无效,应该写为 kafka.auto.offset.reset
//"kafka.enable.auto.commit" -> "true", //如果是true,则这个消费者的偏移量会在后台自动提交structured-streaming not support ,通过.option("startingOffsets","latest")
"enable.auto.commit" -> "true", //如果是true,则这个消费者的偏移量会在后台自动提交无效,应该写为kafka.enable.auto.commit
其他的一些可选参数:比如连接kafka走ssl,

"kafka.security.protocol" -> "SSL",
"kafka.ssl.keystore.location" -> "/dbfs/FileStore/xxxxx/client_keystore.jks",
"kafka.ssl.keystore.password" -> "xxxxx",
"kafka.ssl.key.password" -> "xxxxxx",
"kafka.ssl.truststore.location" -> "/dbfs/FileStore/xxxxx/client_truststore.jks",
"kafka.ssl.truststore.password" -> "kafka",
"kafka.ssl.endpoint.identification.algorithm" -> "",
"kafka.request.timeout.ms" -> "100000",
"kafka.max.poll.records" -> "100",
"kafka.fetch.max.bytes" -> "12428800"
1、可以通过.option("maxOffsetsPerTrigger","30000") 限制每秒从kafka读取的最大数据条数。
2、.option("startingOffsets", "latest") 只针对首次没有进行过任何消费的时候生效,如果之前已经提交过offset,那么此时不会再从latest进行消费,而是从上次的offser进行消费。
3、切记,structured-streaming 实际无法支持"enable.auto.commit" -> "true",都是由structured-streaming自身进行维护的,在 streaming 写入sink的时候,设置.option("checkpointLocation", "")
,类似flink的checkpoint 一样。设置后,structured-streaming 会自己负责维护offset提交到checkpointLocation路径下面。
spark|spark structured-streaming 最全的使用总结
文章图片


3、streaming数据写入kafka sink
Kafka Sink for Streaming
/ Write key-value data from a DataFrame to a specific Kafka topic specified in an option val ds = df .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .start()// Write key-value data from a DataFrame to Kafka using a topic specified in the data val ds = df .selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .start()

Writing the output of Batch Queries to Kafka
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .save()// Write key-value data from a DataFrame to Kafka using a topic specified in the data df.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)") .write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .save()


The Dataframe being written to Kafka should have the following columns in schema:
Column Type
key (optional) string or binary
value (required) string or binary
headers (optional) array
topic (*optional) string
partition (optional) int
* The topic column is required if the “topic” configuration option is not specified.
The value column is the only required option. If a key column is not specified then a null valued key column will be automatically added (see Kafka semantics on how null valued key values are handled). If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the “topic” configuration option is set i.e., the “topic” configuration option overrides the topic column. If a “partition” column is not specified (or its value is null) then the partition is calculated by the Kafka producer. A Kafka partitioner can be specified in Spark by setting the kafka.partitioner.class option. If not present, Kafka default partitioner will be used.
The following options must be set for the Kafka sink for both batch and streaming queries.
Option value meaning
kafka.bootstrap.servers A comma-separated list of host:port The Kafka "bootstrap.servers" configuration.
The following configurations are optional:
Option value default query type meaning
topic string none streaming and batch Sets the topic that all rows will be written to in Kafka. This option overrides any topic column that may exist in the data.
includeHeaders boolean false streaming and batch Whether to include the Kafka headers in the row.
Producer Caching
Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.
The caching key is built up from the following information:
  • Kafka producer configuration
This includes configuration for authorization, which Spark will automatically include when delegation token is being used. Even we take authorization into account, you can expect same Kafka producer instance will be used among same Kafka producer configuration. It will use different Kafka producer when delegation token is renewed; Kafka producer instance for old delegation token will be evicted according to the cache policy.
The following properties are available to configure the producer pool:
Property Name Default Meaning Since Version
spark.kafka.producer.cache.timeout 10m (10 minutes) The minimum amount of time a producer may sit idle in the pool before it is eligible for eviction by the evictor. 2.2.1
spark.kafka.producer.cache.evictorThreadRunInterval 1m (1 minute) The interval of time between runs of the idle evictor thread for producer pool. When non-positive, no idle evictor thread will be run. 3.0.0
Idle eviction thread periodically removes producers which are not used longer than given timeout. Note that the producer is shared and used concurrently, so the last used timestamp is determined by the moment the producer instance is returned and reference count is 0.
Kafka Specific Configurations Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option("kafka.bootstrap.servers", "host:port"). For possible kafka parameters, see Kafka consumer config docs for parameters related to reading data, and Kafka producer config docs for parameters related to writing data.
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
  • group.id: Kafka source will create a unique group id for each query automatically. The user can set the prefix of the automatically generated group.id’s via the optional source option groupIdPrefix, default value is “spark-kafka-source”. You can also set “kafka.group.id” to force Spark to use a special group id, however, please read warnings for this option and use it with caution.
  • auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. Note that when the offsets consumed by a streaming application no longer exist in Kafka (e.g., topics are deleted, offsets are out of range, or offsets are removed after retention period), the offsets will not be reset and the streaming application will see data loss. In extreme cases, for example the throughput of the streaming application cannot catch up the retention speed of Kafka, the input rows of a batch might be gradually reduced until zero when the offset ranges of the batch are completely not in Kafka. Enabling failOnDataLoss option can ask Structured Streaming to fail the query for such cases.
  • key.deserializer: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the keys.
  • value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
  • key.serializer: Keys are always serialized with ByteArraySerializer or StringSerializer. Use DataFrame operations to explicitly serialize the keys into either strings or byte arrays.
  • value.serializer: values are always serialized with ByteArraySerializer or StringSerializer. Use DataFrame operations to explicitly serialize the values into either strings or byte arrays.
  • enable.auto.commit: Kafka source doesn’t commit any offset.
  • interceptor.classes: Kafka source always read keys and values as byte arrays. It’s not safe to use ConsumerInterceptor as it may break the query.
三、spark structured-streaming如何对接Azure Event hub数据源

jar依赖:

com.microsoft.azure
azure-eventhubs-spark_${scala.binary.version}
2.3.21
compile

1、从event hub数据源读取数据
val eventHubsConf = EventHubsConf("xxxx")//连接字符串 //https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/structured-streaming-eventhubs-integration.md#eventhubsconf .setConsumerGroup("xxxx") .setMaxEventsPerTrigger(3000) // .setStartingPosition(EventPosition.fromEndOfStream)//EventPosition.fromOffset("246812") //.setStartingPosition(EventPosition.fromStartOfStream) .setReceiverTimeout(30)

val df = spark
.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load().withColumnRenamed("value", "body").select(col("body").cast("string"))

import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition }// To connect to an Event Hub, EntityPath is required as part of the connection string. // Here, we assume that the connection string from the Azure portal does not have the EntityPath part. val connectionString = ConnectionStringBuilder("{EVENT HUB CONNECTION STRING FROM AZURE PORTAL}") .setEventHubName("{EVENT HUB NAME}") .build val eventHubsConf = EventHubsConf(connectionString) .setStartingPosition(EventPosition.fromEndOfStream)var streamingInputDF = spark.readStream .format("eventhubs") .options(eventHubsConf.toMap) .load()

2、streaming数据写入 event hub
df.writeStream .format("eventhubs") .outputMode("append") .options(eventHubsConfWrite.toMap) .option("checkpointLocation", "xxxxxxxxxxx") .trigger(ProcessingTime("25 seconds")) .start().awaitTermination()

import org.apache.spark.sql.streaming.Trigger.ProcessingTimeval query = streamingSelectDF .writeStream .format("parquet") .option("path", "/mnt/sample/test-data") .option("checkpointLocation", "/mnt/sample/check") .partitionBy("zip", "day") .trigger(ProcessingTime("25 seconds")) .start()

import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition } import org.apache.spark.sql.streaming.Trigger.ProcessingTime// The connection string for the Event Hub you will WRTIE to. val connString = "{EVENT HUB CONNECTION STRING}" val eventHubsConfWrite = EventHubsConf(connString)val source = spark.readStream .format("rate") .option("rowsPerSecond", 100) .load() .withColumnRenamed("value", "body") .select($"body" cast "string")val query = source .writeStream .format("eventhubs") .outputMode("update") .options(eventHubsConfWrite.toMap) .trigger(ProcessingTime("25 seconds")) .option("checkpointLocation", "/checkpoint/") .start()

JDBC Sink
import java.sql._classJDBCSink(url:String, user:String, pwd:String) extends ForeachWriter[(String, String)] { val driver = "com.mysql.jdbc.Driver" var connection:Connection = _ var statement:Statement = _def open(partitionId: Long,version: Long): Boolean = { Class.forName(driver) connection = DriverManager.getConnection(url, user, pwd) statement = connection.createStatement true }def process(value: (String, String)): Unit = { statement.executeUpdate("INSERT INTO zip_test " + "VALUES (" + value._1 + "," + value._2 + ")") }def close(errorOrNull: Throwable): Unit = { connection.close } }

val url="jdbc:mysql://:3306/test" val user ="user" val pwd = "pwd"val writer = new JDBCSink(url,user, pwd) val query = streamingSelectDF .writeStream .foreach(writer) .outputMode("update") .trigger(ProcessingTime("25 seconds")) .start()

更多详情请参考:https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/structured-streaming-eventhubs-integration.md#eventhubsconf
https://docs.microsoft.com/zh-cn/azure/databricks/_static/notebooks/structured-streaming-event-hubs-integration.html
注意:可以通过.setMaxEventsPerTrigger(3000) 控制每次从event hub读取的数据条数
四、spark structured-streaming的write 输出模式
There are a few types of output modes.
  • Append mode (default) - This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change. Hence, this mode guarantees that each row will be output only once (assuming fault-tolerant sink). For example, queries with only select, where, map, flatMap, filter, join, etc. will support Append mode.
  • Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.
  • Update mode - (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink. More information to be added in future releases.
Different types of streaming queries support different output modes. Here is the compatibility matrix.

Query Type Supported Output Modes Notes
Queries with aggregation Aggregation on event-time with watermark Append, Update, Complete Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold specified in withWatermark() as by the modes semantics, rows can be added to the Result Table only once after they are finalized (i.e. after watermark is crossed). See the Late Data section for more details.

Update mode uses watermark to drop old aggregation state.

Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table.
Other aggregations Complete, Update Since no watermark is defined (only defined in other category), old aggregation state is not dropped.

Append mode is not supported as aggregates can update thus violating the semantics of this mode.
Queries with mapGroupsWithState Update Aggregations not allowed in a query with mapGroupsWithState.
Queries with flatMapGroupsWithState Append operation mode Append Aggregations are allowed after flatMapGroupsWithState.
Update operation mode Update Aggregations not allowed in a query with flatMapGroupsWithState.
Queries with joins Append Update and Complete mode not supported yet. See the support matrix in the Join Operations section for more details on what types of joins are supported.
Other queries Append, Update Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table.
五、spark structured-streaming的其他sink
Sink Supported Output Modes Options Fault-tolerant Notes
File Sink Append path: path to the output directory, must be specified.
retention: time to live (TTL) for output files. Output files which batches were committed older than TTL will be eventually excluded in metadata log. This means reader queries which read the sink's output directory may not process them. You can provide the value as string format of the time. (like "12h", "7d", etc.) By default it's disabled.

For file-format-specific options, see the related methods in DataFrameWriter (Scala/Java/Python/R). E.g. for "parquet" format options see DataFrameWriter.parquet()
Yes (exactly-once) Supports writes to partitioned tables. Partitioning by time may be useful.
Kafka Sink Append, Update, Complete See the Kafka Integration Guide Yes (at-least-once) More details in the Kafka Integration Guide
Foreach Sink Append, Update, Complete None Yes (at-least-once) More details in the next section
ForeachBatch Sink Append, Update, Complete None Depends on the implementation More details in the next section
Console Sink Append, Update, Complete numRows: Number of rows to print every trigger (default: 20)
truncate: Whether to truncate the output if too long (default: true)
No
Memory Sink Append, Complete None No. But in Complete Mode, restarted query will recreate the full table. Table name is the query name.
// ========== DF with no aggregations ========== val noAggDF = deviceDataDf.select("device").where("signal > 10")// Print new data to console noAggDF .writeStream .format("console") .start()// Write new data to Parquet files noAggDF .writeStream .format("parquet") .option("checkpointLocation", "path/to/checkpoint/dir") .option("path", "path/to/destination/dir") .start()// ========== DF with aggregation ========== val aggDF = df.groupBy("device").count()// Print updated aggregations to console aggDF .writeStream .outputMode("complete") .format("console") .start()// Have all the aggregates in an in-memory table aggDF .writeStream .queryName("aggregates")// this query name will be the table name .outputMode("complete") .format("memory") .start()spark.sql("select * from aggregates").show()// interactively query in-memory table

Using Foreach and ForeachBatch The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch. Let’s understand their usages in more detail.
ForeachBatch foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) => // Transform and write batchDF }.start()streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) => batchDF.persist() batchDF.write.format(...).save(...)// location 1 batchDF.write.format(...).save(...)// location 2 batchDF.unpersist() } 

Foreach If foreachBatch is not an option (for example, corresponding batch data writer does not exist, or continuous processing mode), then you can express your custom writer logic using foreach. Specifically, you can express the data writing logic by dividing it into three methods: open, process, and close. Since Spark 2.4, foreach is available in Scala, Java and Python.
streamingDatasetOfString.writeStream.foreach( new ForeachWriter[String] {def open(partitionId: Long, version: Long): Boolean = { // Open connection }def process(record: String): Unit = { // Write string to connection }def close(errorOrNull: Throwable): Unit = { // Close the connection } } ).start()val spark: SparkSession = ...// Create a streaming DataFrame val df = spark.readStream .format("rate") .option("rowsPerSecond", 10) .load()// Write the streaming DataFrame to a table df.writeStream .option("checkpointLocation", "path/to/checkpoint/dir") .toTable("myTable")// Check the table result spark.read.table("myTable").show()// Transform the source dataset and write to a new table spark.readStream .table("myTable") .select("value") .writeStream .option("checkpointLocation", "path/to/checkpoint/dir") .format("parquet") .toTable("newTable")// Check the new table result spark.read.table("newTable").show()

六、spark structured-streaming的其他State Store
State store is a versioned key-value store which provides both read and write operations. In Structured Streaming, we use the state store provider to handle the stateful operations across batches. There are two built-in state store provider implementations. End users can also implement their own state store provider by extending StateStoreProvider interface.

HDFS state store provider
The HDFS backend state store provider is the default implementation of [[StateStoreProvider]] and [[StateStore]] in which all the data is stored in memory map in the first stage, and then backed by files in an HDFS-compatible file system. All updates to the store have to be done in sets transactionally, and each set of updates increments the store’s version. These versions can be used to re-execute the updates (by retries in RDD operations) on the correct version of the store, and regenerate the store version.

RocksDB state store implementation
As of Spark 3.2, we add a new built-in state store implementation, RocksDB state store provider.

If you have stateful operations in your streaming query (for example, streaming aggregation, streaming dropDuplicates, stream-stream joins, mapGroupsWithState, or flatMapGroupsWithState) and you want to maintain millions of keys in the state, then you may face issues related to large JVM garbage collection (GC) pauses causing high variations in the micro-batch processing times. This occurs because, by the implementation of HDFSBackedStateStore, the state data is maintained in the JVM memory of the executors and large number of state objects puts memory pressure on the JVM causing high GC pauses.

In such cases, you can choose to use a more optimized state management solution based on RocksDB. Rather than keeping the state in the JVM memory, this solution uses RocksDB to efficiently manage the state in the native memory and the local disk. Furthermore, any changes to this state are automatically saved by Structured Streaming to the checkpoint location you have provided, thus providing full fault-tolerance guarantees (the same as default state management).

To enable the new build-in state store implementation, set spark.sql.streaming.stateStore.providerClass to org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.

Here are the configs regarding to RocksDB instance of the state store provider:

Config Name Description Default Value
spark.sql.streaming.stateStore.rocksdb.compactOnCommit Whether we perform a range compaction of RocksDB instance for commit operation False
spark.sql.streaming.stateStore.rocksdb.blockSizeKB Approximate size in KB of user data packed per block for a RocksDB BlockBasedTable, which is a RocksDB's default SST file format. 4
spark.sql.streaming.stateStore.rocksdb.blockCacheSizeMB The size capacity in MB for a cache of blocks. 8
spark.sql.streaming.stateStore.rocksdb.lockAcquireTimeoutMs The waiting time in millisecond for acquiring lock in the load operation for RocksDB instance. 60000
spark.sql.streaming.stateStore.rocksdb.resetStatsOnLoad Whether we resets all ticker and histogram stats for RocksDB on load. True

State Store and task locality
The stateful operations store states for events in state stores of executors. State stores occupy resources such as memory and disk space to store the states. So it is more efficient to keep a state store provider running in the same executor across different streaming batches. Changing the location of a state store provider requires the extra overhead of loading checkpointed states. The overhead of loading state from checkpoint depends on the external storage and the size of the state, which tends to hurt the latency of micro-batch run. For some use cases such as processing very large state data, loading new state store providers from checkpointed states can be very time-consuming and inefficient.

The stateful operations in Structured Streaming queries rely on the preferred location feature of Spark’s RDD to run the state store provider on the same executor. If in the next batch the corresponding state store provider is scheduled on this executor again, it could reuse the previous states and save the time of loading checkpointed states.

However, generally the preferred location is not a hard requirement and it is still possible that Spark schedules tasks to the executors other than the preferred ones. In this case, Spark will load state store providers from checkpointed states on new executors. The state store providers run in the previous batch will not be unloaded immediately. Spark runs a maintenance task which checks and unloads the state store providers that are inactive on the executors.

By changing the Spark configurations related to task scheduling, for example spark.locality.wait, users can configure Spark how long to wait to launch a data-local task. For stateful operations in Structured Streaming, it can be used to let state store providers running on the same executors across batches.

Specifically for built-in HDFS state store provider, users can check the state store metrics such as loadedMapCacheHitCount and loadedMapCacheMissCount. Ideally, it is best if cache missing count is minimized that means Spark won’t waste too much time on loading checkpointed state. User can increase Spark locality waiting configurations to avoid loading state store providers in different executors across batches.
【spark|spark structured-streaming 最全的使用总结】
七、spark structured-streaming Join
未完待续

    推荐阅读