
Spark核心设计的艺术:架构设计与实现
《Spark核心设计的艺术:架构设计与实现》由多位专家联袂推荐,360大数据专家撰写,基于Spark 2.1.0剖析架构与实现精髓。细化到方法级,提炼出多个流程图,立体呈现架构、环境、调度、存储、计算、部署、API七大核心设计。
本书特色:
按照源码分析的习惯设计,从脚本分析到初始化,再到核心内容。整个过程遵循由浅入深的基本思路。 每一章先对本章的内容有个总体介绍,然后深入分析各个组件的实现原理,最后将各个组件之间的关係通过执行流程来展现。本书儘可能地用图来展示原理,以加速读者对内容的掌握。本书讲解的很多实现及原理都值得借鉴,可以帮助读者提升架构设计、程式设计等方面的能力。本书儘可能保留较多的源码,以便于初学者能够在脱离办公环境的地方(如捷运、公交等),也能轻鬆阅读。
基本介绍
- 书名:Spark核心设计的艺术
- 又名:Spark核心设计的艺术:架构设计与实现
- 作者:耿嘉安
- ISBN:978-7-111-58439-1
- 页数:690
- 定价:139
- 出版社:机械工业出版社
- 出版时间:2018-01-01
- 装帧:平装
- 开本:16开
- 技术範畴:大数据
- 外文名:The Art of Spark Kernel Design
内容简介
《Spark核心设计的艺术:架构设计与实现》一书基于Spark 2.1.0对架构与实现的精髓进行剖析,旨在为Spark的最佳化、定製和扩展提供原理性的指导。
本书一共有10章内容,主要包括以下部分。
準备部分(第1~2章):简单介绍了Spark的环境搭建和基本原理。本部分通过详尽的描述,有效降低了读者进入Spark世界的门槛,同时能对Spark背景知识及整体设计有巨观的认识。
基础部分(第3~5章):介绍Spark的基础设施(包括配置、RPC、度量等)、SparkContext的初始化、Spark执行所需要的环境等内容。经过此部分的学习,将能够对RPC框架的设计、执行环境的功能有深入的理解,这也是对核心内容了解的前提。
核心部分(第6~9章):为Spark最核心的部分,包括存储体系、调度系统、计算引擎、部署模式等。通过本部分的学习,读者将充分了解Spark的数据处理体系细节,能够对Spark核心功能进行扩展、性能最佳化以及对线上问题进行精準排查。
API部分(第10章):这部分主要对Spark的新老API进行对比,对新API进行简单介绍。
準备部分(第1~2章):简单介绍了Spark的环境搭建和基本原理。本部分通过详尽的描述,有效降低了读者进入Spark世界的门槛,同时能对Spark背景知识及整体设计有巨观的认识。
基础部分(第3~5章):介绍Spark的基础设施(包括配置、RPC、度量等)、SparkContext的初始化、Spark执行所需要的环境等内容。经过此部分的学习,将能够对RPC框架的设计、执行环境的功能有深入的理解,这也是对核心内容了解的前提。
核心部分(第6~9章):为Spark最核心的部分,包括存储体系、调度系统、计算引擎、部署模式等。通过本部分的学习,读者将充分了解Spark的数据处理体系细节,能够对Spark核心功能进行扩展、性能最佳化以及对线上问题进行精準排查。
API部分(第10章):这部分主要对Spark的新老API进行对比,对新API进行简单介绍。
作者简介
耿嘉安,10余年IT行业相关经验。先后就职于阿里巴巴、艺龙、360,专注于开源和大数据领域。在大量的工作实践中,对J2EE、JVM、Tomcat、Spring、Hadoop、Spark、MySQL、Redis都有深入研究,尤其喜欢剖析开源项目的源码实现。早期从事J2EE企业级套用开发,对Java相关技术有独到见解。着有《深入理解Spark:核心思想与源码分析》一书。
图书目录
本书讚誉
前言
第1章 环境準备 ········································1
1.1 运行环境準备 ···········································2
1.1.1 安装JDK ·········································2
1.1.2 安装Scala ········································2
1.1.3 安装Spark ·······································3
1.2 Spark初体验 ···································4
1.2.1 运行spark-shell ·······························4
1.2.2 执行word count ······························5
1.2.3 剖析spark-shell ·······························9
1.3 阅读环境準备 ·········································14
1.3.1 安装SBT ·······································15
1.3.2 安装Git ·········································15
1.3.3 安装Eclipse Scala IDE外挂程式 ········15
1.4 Spark源码编译与调试 ·························17
1.5 小结 ···························23
第2章 设计理念与基本架构 ···············24
2.1 初识Spark ··································25
2.1.1 Hadoop MRv1的局限···················25
2.1.2 Spark的特点 ·································26
2.1.3 Spark使用场景 ·····························28
2.2 Spark基础知识 ······································29
2.3 Spark基本设计思想 ·····························31
2.3.1 Spark模组设计 ·····························32
2.3.2 Spark模型设计 ·····························34
2.4 Spark基本架构 ···································36
2.5 小结 ·································38
第3章 Spark基础设施 ·························39
3.1 Spark配置 ········································40
3.1.1 系统属性中的配置 ·······················40
3.1.2 使用SparkConf配置的API ·········41
3.1.3 克隆SparkConf配置 ····················42
3.2 Spark内置RPC框架 ····························42
3.2.1 RPC配置TransportConf ··············45
3.2.2 RPC客户端工厂Transport- ClientFactory ·······················47
3.2.3 RPC服务端TransportServer ········53
3.2.4 管道初始化 ···································56
3.2.5 TransportChannelHandler详解 ·····57
3.2.6 服务端RpcHandler详解 ··············63
3.2.7 服务端引导程式Transport-ServerBootstrap ·····················68
3.2.8 客户端TransportClient详解 ········71
3.3 事件汇流排 ····································78
3.3.1 ListenerBus的继承体系 ···············79
3.3.2 SparkListenerBus详解 ··················80
3.3.3 LiveListenerBus详解 ····················83
3.4 度量系统 ···········································87
3.4.1 Source继承体系 ···························87
3.4.2 Sink继承体系 ·······························89
3.5 小结 ·········································92
第4章 SparkContext的初始化 ·········93
4.1 SparkContext概述 ·································94
4.2 创建Spark环境 ·····································97
4.3 SparkUI的实现 ····································100
4.3.1 SparkUI概述 ·······························100
4.3.2 WebUI框架体系 ·························102
4.3.3 创建SparkUI ·······························107
4.4 创建心跳接收器 ··································111
4.5 创建和启动调度系统··························112
4.6 初始化块管理器BlockManager ·······114
4.7 启动度量系统 ·······························114
4.8 创建事件日誌监听器··························115
4.9 创建和启动ExecutorAllocation-Manager ··························116
4.10 ContextCleaner的创建与启动 ········120
4.10.1 创建ContextCleaner ·················120
4.10.2 启动ContextCleaner ·················120
4.11 额外的SparkListener与启动事件汇流排 ··························122
4.12 Spark环境更新 ··································123
4.13 SparkContext初始化的收尾 ···········127
4.14 SparkContext提供的常用方法 ·······128
4.15 SparkContext的伴生对象················130
4.16 小结 ····································131
第5章 Spark执行环境 ························132
5.1 SparkEnv概述 ·································133
5.2 安全管理器SecurityManager ············133
5.3 RPC环境 ·········································135
5.3.1 RPC端点RpcEndpoint ···············136
5.3.2 RPC端点引用RpcEndpointRef ···139
5.3.3 创建传输上下文TransportConf ···142
5.3.4 讯息调度器Dispatcher ···············142
5.3.5 创建传输上下文Transport-Context ·························154
5.3.6 创建传输客户端工厂Transport-ClientFactory ····················159
5.3.7 创建TransportServer ···················160
5.3.8 客户端请求传送 ·························162
5.3.9 NettyRpcEnv中的常用方法 ·······173
5.4 序列化管理器SerializerManager ·····175
5.5 广播管理器BroadcastManager ·········178
5.6 map任务输出跟蹤器 ··························185
5.6.1 MapOutputTracker的实现 ··········187
5.6.2 MapOutputTrackerMaster的实现原理 ·······················191
5.7 构建存储体系 ·······································199
5.8 创建度量系统 ·······································201
5.8.1 MetricsCon?g详解 ·····················203
5.8.2 MetricsSystem中的常用方法 ····207
5.8.3 启动MetricsSystem ····················209
5.9 输出提交协调器 ··································211
5.9.1 OutputCommitCoordinator-Endpoint的实现 ··················211
5.9.2 OutputCommitCoordinator的实现 ··························212
5.9.3 OutputCommitCoordinator的工作原理 ························216
5.10 创建SparkEnv ····································217
5.11 小结 ·····································217
第6章 存储体系 ·····································219
6.1 存储体系概述 ·······································220
6.1.1 存储体系架构 ·····························220
6.1.2 基本概念 ·····································222
6.2 Block信息管理器 ································227
6.2.1 Block锁的基本概念 ···················227
6.2.2 Block锁的实现 ···························229
6.3 磁碟Block管理器 ······························234
6.3.1 本地目录结构 ·····························234
6.3.2 DiskBlockManager提供的方法 ···························236
6.4 磁碟存储DiskStore ·····························239
6.5 记忆体管理器 ·····································242
6.5.1 记忆体池模型 ·································243
6.5.2 StorageMemoryPool详解 ···········244
6.5.3 MemoryManager模型 ················247
6.5.4 Uni?edMemoryManager详解 ····250
6.6 记忆体存储MemoryStore ······················252
6.6.1 MemoryStore的记忆体模型 ··········253
6.6.2 MemoryStore提供的方法 ··········255
6.7 块管理器BlockManager ····················265
6.7.1 BlockManager的初始化 ·············265
6.7.2 BlockManager提供的方法 ·········266
6.8 BlockManagerMaster对Block-Manager的管理 ·················285
6.8.1 BlockManagerMaster的职责 ······285
6.8.2 BlockManagerMasterEndpoint详解 ·································286
6.8.3 BlockManagerSlaveEndpoint详解 ·····························289
6.9 Block传输服务 ····································290
6.9.1 初始化NettyBlockTransfer-Service ···························291
6.9.2 NettyBlockRpcServer详解 ·········292
6.9.3 Shuf?e客户端 ·····························296
6.10 DiskBlockObjectWriter详解 ···········305
6.11 小结 ·······································308
第7章 调度系统 ·····································309
7.1 调度系统概述 ·······································310
7.2 RDD详解 ·····································312
7.2.1 为什幺需要RDD ························312
7.2.2 RDD实现的初次分析 ················313
7.2.3 RDD依赖 ····································316
7.2.4 分区计算器Partitioner················318
7.2.5 RDDInfo ······································320
7.3 Stage详解 ········································321
7.3.1 ResultStage的实现 ·····················322
7.3.2 Shuf?eMapStage的实现 ·············323
7.3.3 StageInfo ······································324
7.4 面向DAG的调度器DAGScheduler ···326
7.4.1 JobListener与JobWaiter ·············326
7.4.2 ActiveJob详解 ····························328
7.4.3 DAGSchedulerEventProcessLoop的简要介绍 ·······················328
7.4.4 DAGScheduler的组成 ················329
7.4.5 DAGScheduler提供的常用方法 ···330
7.4.6 DAGScheduler与Job的提交 ····334
7.4.7 构建Stage····································337
7.4.8 提交ResultStage ························341
7.4.9 提交还未计算的Task ·················343
7.4.10 DAGScheduler的调度流程 ······347
7.4.11 Task执行结果的处理 ··············348
7.5 调度池Pool ······································351
7.5.1 调度算法 ·······························352
7.5.2 Pool的实现 ·································354
7.5.3 调度池构建器 ·····························357
7.6 任务集合管理器TaskSetManager ···363
7.6.1 Task集合 ·····································363
7.6.2 TaskSetManager的成员属性 ······364
7.6.3 调度池与推断执行 ·····················366
7.6.4 Task本地性 ·································370
7.6.5 TaskSetManager的常用方法 ······373
7.7 运行器后端接口LauncherBackend ···383
7.7.1 BackendConnection的实现 ········384
7.7.2 LauncherBackend的实现 ···········386
7.8 调度后端接口SchedulerBackend ····389
7.8.1 SchedulerBackend的定义 ··········389
7.8.2 LocalSchedulerBackend的实现分析 ································390
7.9 任务结果获取器TaskResultGetter ···394
7.9.1 处理成功的Task ·························394
7.9.2 处理失败的Task ·························396
7.10 任务调度器TaskScheduler ··············397
7.10.1 TaskSchedulerImpl的属性 ·····397
7.10.2 TaskSchedulerImpl的初始化 ···399
7.10.3 TaskSchedulerImpl的启动 ·····399
7.10.4 TaskSchedulerImpl与Task的提交 ·······················400
7.10.5 TaskSchedulerImpl与资源分配 ···························402
7.10.6 TaskSchedulerImpl的调度流程 ······························405
7.10.7 TaskSchedulerImpl对执行结果的处理 ·····························406
7.10.8 TaskSchedulerImpl的常用方法 ···409
7.11 小结 ·······································412
第8章 计算引擎 ·····································413
8.1 计算引擎概述 ·······································414
8.2 记忆体管理器与执行记忆体 ·····················417
8.2.1 ExecutionMemoryPool详解 ·······417
8.2.2 MemoryManager模型与执行记忆体 ··························420
8.2.3 Uni?edMemoryManager与执行记忆体 ·······················421
8.3 记忆体管理器与Tungsten ·····················423
8.3.1 MemoryBlock详解 ·····················423
8.3.2 MemoryManager模型与Tungsten ···························425
8.3.3 Tungsten的记忆体分配器 ··············425
8.4 任务记忆体管理器 ··································431
8.4.1 TaskMemoryManager详解 ·········431
8.4.2 记忆体消费者 ·······················439
8.4.3 执行记忆体整体架构 ·····················441
8.5 Task详解 ······································443
8.5.1 任务上下文TaskContext ············443
8.5.2 Task的定义 ·································446
8.5.3 Shuf?eMapTask的实现 ··············449
8.5.4 ResultTask的实现 ·······················450
8.6 IndexShuf?eBlockResolver详解 ······451
8.7 採样与估算 ···········································455
8.7.1 SizeTracker的实现分析 ·············455
8.7.2 SizeTracker的工作原理 ·············457
8.8 特质WritablePartitionedPair- Collection ······················458
8.9 AppendOnlyMap的实现分析 ···········460
8.9.1 AppendOnlyMap的容量增长 ····461
8.9.2 AppendOnlyMap的数据更新 ····462
8.9.3 AppendOnlyMap的快取聚合算法 ·····························464
8.9.4 AppendOnlyMap的内置排序 ····466
8.9.5 AppendOnlyMap的扩展 ············467
8.10 PartitionedPairBuffer的实现分析 ···469
8.10.1 PartitionedPairBuffer的容量增长 ······················469
8.10.2 PartitionedPairBuffer的插入 ···470
8.10.3 PartitionedPairBuffer的叠代器 ···471
8.11 外部排序器 ·········································472
8.11.1 ExternalSorter详解 ·················473
8.11.2 Shuf?eExternalSorter详解 ······487
8.12 Shuf?e管理器 ····································490
8.12.1 Shuf?eWriter详解 ··················491
8.12.2 Shuf?eBlockFetcherIterator详解 ······························502
8.12.3 BlockStoreShuf?eReader详解 ···510
8.12.4 SortShuf?eManager详解 ········513
8.13 map端与reduce端的Shuf?e组合 ······························516
8.14 小结 ·········································519
第9章 部署模式 ········································520
9.1 心跳接收器HeartbeatReceiver ·········521
9.2 Executor的实现分析 ··························527
9.2.1 Executor的心跳报告 ··················528
9.2.2 运行Task ·····································530
9.3 local部署模式 ······································535
9.4 持久化引擎PersistenceEngine ··········537
9.4.1 基于档案系统的持久化引擎 ·····539
9.4.2 基于ZooKeeper的持久化引擎 ···541
9.5 领导选举代理 ·······································542
9.6 Master详解 ···········································546
9.6.1 启动Master ·································549
9.6.2 检查Worker逾时························553
9.6.3 被选举为领导时的处理 ·············554
9.6.4 一级资源调度 ·····························558
9.6.5 注册Worker·································568
9.6.6 更新Worker的最新状态············570
9.6.7 处理Worker的心跳····················570
9.6.8 注册Application··························571
9.6.9 处理Executor的申请 ·················573
9.6.10 处理Executor的状态变化 ·······573
9.6.11 Master的常用方法 ···················574
9.7 Worker详解 ································578
9.7.1 启动Worker·································581
9.7.2 向Master注册Worker ···············584
9.7.3 向Master传送心跳 ····················589
9.7.4 Worker与领导选举·····················591
9.7.5 运行Driver ··································593
9.7.6 运行Executor ······························594
9.7.7 处理Executor的状态变化 ·········599
9.8 StandaloneAppClient实现 ·················600
9.8.1 ClientEndpoint的实现分析 ········601
9.8.2 StandaloneAppClient的实现分析 ······························606
9.9 StandaloneSchedulerBackend的实现分析 ························607
9.9.1 StandaloneSchedulerBackend的属性 ····························607
9.9.2 DriverEndpoint的实现分析 ·······609
9.9.3 StandaloneSchedulerBackend的启动 ··························614
9.9.4 StandaloneSchedulerBackend的停止 ·························617
9.9.5 StandaloneSchedulerBackend与资源分配 ················618
9.10 CoarseGrainedExecutorBackend详解 ····························619
9.10.1 CoarseGrainedExecutorBackend进程 ··························620
9.10.2 CoarseGrainedExecutorBackend的功能分析 ·························622
9.11 local-cluster部署模式 ·······················625
9.11.1 启动本地集群 ····························625
9.11.2 local-cluster部署模式的启动过程 ·································627
9.11.3 local-cluster部署模式下Executor的分配过程 ·················628
9.11.4 local-cluster部署模式下的任务提交执行过程 ····························629
9.12 Standalone部署模式 ·························631
9.12.1 Standalone部署模式的启动过程 ························632
9.12.2 Standalone部署模式下Executor的分配过程 ················634
9.12.3 Standalone部署模式的资源回收 ·····························635
9.12.4 Standalone部署模式的容错机制 ······························636
9.13 其他部署方案 ·····································639
9.13.1 YARN·········································639
9.13.2 Mesos ·········································644
9.14 小结 ·······································646
第10章 Spark API ································647
10.1 基本概念·····································648
10.2 数据源DataSource ····························650
10.2.1 DataSourceRegister详解 ··········650
10.2.2 DataSource详解 ························651
10.3 检查点的实现 ···································655
10.3.1 CheckpointRDD的实现············655
10.3.2 RDDCheckpointData的实现 ····660
10.3.3 ReliableRDDCheckpointData的实现 ························662
10.4 RDD的再次分析 ·······························663
10.4.1 转换API ····································663
10.4.2 动作API ····································665
10.4.3 检查点API的实现分析 ···········667
10.4.4 叠代计算 ···································669
10.5 数据集合Dataset ·······························671
10.6 DataFrameReader详解 ·····················673
10.7 SparkSession详解 ·····························676
10.7.1 SparkSession的构建器Builder ···676
10.7.2 SparkSession的API ·················679
10.8 word count例子 ·································679
10.8.1 Job準备阶段 ·····························680
10.8.2 Job的提交与调度 ·····················685
10.9 小结 ········································689
附录 ···········································690
转载请注明出处海之美文 » Spark核心设计的艺术:架构设计与实现