为了使得水位在实时核算中更精准,咱们规划出一种中心化的水位办理思路,即实时核算的各个节点,包括source、operator、sinker等都会把自己核算的水位信息,一致上报给大局的Watermark Server,由Watermark Server 来进行水位信息的一致办理。
△中心化水位规划
Watermark Server :保护一个水位的信息表(hash_table),包括实时核算程序(APP)全体拓扑信息(Source、Operator 和Sinker等)各个层级对应的水位信息,以便于进行大局水位(比方low watermark)的核算,Watermark Server 定时和state做交互,以确保水位信息的不丢失。
Watermark Client:水位更新客户端,在source、worker和sinker等实时算子中,担任向Watermark Server 上报和恳求水位信息(比方上游或许大局水位),经过baidu-rpc服务恳求回调。
Low watermark(低水位):Low watermark是一个时刻戳,用来符号实时数据处理过程中最早(oldest)的没有处理的数据的时刻(Low watermark, which pessimistically attempt to capture the event time of the oldest unprocessed record the system is aware of.)。它许诺未来不会有早于该时刻戳的数据抵达。这里的时刻核算一般依据eventtime,即事情产生时刻,例如日志中用户行为产生的时刻,而较少运用数据处理时刻(processing time,某些场景也能够用),watermark核算的公式为(来自Google MillWheel 论文):
Low Watermark of A = min(oldest work of A, low watermark of C : C outputs to A)
可是在实践体系规划中,low watermark又能够按照算子处理的鸿沟区分如下:
Input Low Watermark: Oldest work not yet sent to this streaming stage.
InputLowWatermark(Stage) = min { OutputLowWatermark(Stage’) | Stage’ is upstream of Stage}
输入最低水位,能够理解为将要输入当时算子,即上游算子处理过的数据的watermark。
Output Low Watermark: Oldest work not yet completed by this streaming stage.
OutputLowWatermark(Stage) = min { InputLowWatermark(Stage), OldestWork(Stage) }
[1] T. Akidau, A. Balikov, K. Bekirolu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: Fault-tolerant stream processing at internet scale. Proc. VLDB Endow., 6(11):1033–1044, Aug. 2013.
[2] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernndez-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12):1792–1803, 2015.
[3] T. Akidau, S. Chernyak, and R. Lax. Streaming Systems. O’Reilly Media, Inc., 1st edition, 2018.
[4] “Watermarks – Measuring Time and Progress in Streaming Pipelines”, Slava Chernyak , Google Inc
[5] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.