https://seatunnel.apache.org/
https://github.com/apache/incubator-seatunnel
https://github.com/InterestingLab/guardian
https://interestinglab.github.io/seatunnel-docs/
---
jdk1.8
Spark https://spark.apache.org/downloads.html
---安装seatunnel
wget https://github.com/apache/incubator-seatunnel/archive/refs/tags/2.1.3.zip -O seatunnel-2.1.3.zip
unzip seatunnel-2.1.3.zip
ln -s seatunnel-2.1.3 seatunnel
---配置
一个完整的seatunnel配置包含spark, input, filter, output, 即:
spark {
...
}
input {
...
}
filter {
...
}
output {
...
}spark是spark相关的配置,
可配置的spark参数见: Spark Configuration, 其中master, deploy-mode两个参数不能在这里配置,需要在seatunnel启动脚本中指定。
input可配置任意的input插件及其参数,具体参数随不同的input插件而变化。
filter可配置任意的filter插件及其参数,具体参数随不同的filter插件而变化。
filter中的多个插件按配置顺序形成了数据处理的pipeline, 上一个filter的输出是下一个filter的输入。
output可配置任意的output插件及其参数,具体参数随不同的output插件而变化。
filter处理完的数据,会发送给output中配置的每个插件。
---配置文件
spark {
# You can set spark configuration here
# seatunnel defined streaming batch duration in seconds
spark.streaming.batchDuration = 5
# see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
spark.app.name = "seatunnel"
spark.executor.instances = 2
spark.executor.cores = 1
spark.executor.memory = "1g"
}
input {
# This is a example input plugin **only for test and demonstrate the feature input plugin**
fakestream {
content = ["Hello World, InterestingLab"]
rate = 1
}
# If you would like to get more information about how to configure seatunnel and see full list of input plugins,
# please go to https://interestinglab.github.io/seatunnel-docs/#/zh-cn/v1/configuration/base
}
filter {
split {
fields = ["msg", "name"]
delimiter = ","
}
# If you would like to get more information about how to configure seatunnel and see full list of filter plugins,
# please go to https://interestinglab.github.io/seatunnel-docs/#/zh-cn/v1/configuration/base
}
output {
stdout {}
# If you would like to get more information about how to configure seatunnel and see full list of output plugins,
# please go to https://interestinglab.github.io/seatunnel-docs/#/zh-cn/v1/configuration/base
}
---Guardian
seatunnel的子项目,是一个监控和报警工具,可以提供seatunnel的存活情况监控以及调度延迟情况监控。Guardian能够在运行时动态加载配置文件,并提供HTTP API支持配置的实时修改。目前仅支持seatunnel on Yarn.
wget https://github.com/InterestingLab/guardian/releases/download/v1.0.0/guardian_1.0.0.tar.gz
tar -xvf guardian_1.0.0
cd guardian_1.0.0
./bin/guardian check config.json
分布式、高性能的数据集成平台,用于海量数据(离线和实时)的同步和转换
---SeaTunnel 使用场景
海量数据同步
海量数据整合
具有海量数据的 ETL
海量数据聚合
多源数据处理