1.3. Pacemaker 架构

在最高一个层次，集群由三个部分组成:

Non-cluster aware components (illustrated in green). These pieces include the resources themselves, scripts that start, stop and monitor them, and also a local daemon that masks the differences between the different standards these scripts implement.
Resource management Pacemaker provides the brain (illustrated in blue) that processes and reacts to events regarding the cluster. These events include nodes joining or leaving the cluster; resource events caused by failures, maintenance, scheduled activities; and other administrative actions. Pacemaker will compute the ideal state of the cluster and plot a path to achieve it after any of these events. This may include moving resources, stopping nodes and even forcing them offline with remote power switches.
Low level infrastructure Corosync provides reliable messaging, membership and quorum information about the cluster (illustrated in red).

Conceptual overview of the cluster stack

图 1.1. 概念层次总览

When combined with Corosync, Pacemaker also supports popular open source cluster filesystems. ^[3]

Due to recent standardization within the cluster filesystem community, they make use of a common distributed lock manager which makes use of Corosync for its messaging capabilities and Pacemaker for its membership (which nodes are up/down) and fencing services.

The Pacemaker StackThe Pacemaker stack when running on Corosync

图 1.2. Pacemaker 层次

1.3.1. 内部组件

Pacemaker本身由四个关键组件组成:

CIB (aka. 集群信息基础)
CRMd (aka. 集群资源管理守护进程)
PEngine (aka. PE or 策略引擎)
STONITHd

Subsystems of a Pacemaker cluster running on Corosync

图 1.3. 内部组件

The CIB uses XML to represent both the cluster’s configuration and current state of all resources in the cluster. The contents of the CIB are automatically kept in sync across the entire cluster and are used by the PEngine to compute the ideal state of the cluster and how it should be achieved.

This list of instructions is then fed to the DC (Designated Co-ordinator). Pacemaker centralizes all cluster decision making by electing one of the CRMd instances to act as a master. Should the elected CRMd process, or the node it is on, fail… a new one is quickly established.

The DC carries out the PEngine’s instructions in the required order by passing them to either the LRMd (Local Resource Management daemon) or CRMd peers on other nodes via the cluster messaging infrastructure (which in turn passes them on to their LRMd process).

节点会把他们所有操作的日志发给DC，然后根据预期的结果和实际的结果(之间的差异)，执行下一个等待中的命令，或者取消操作，并让PEngine根据非预期的结果重新计算集群的理想状态。

在某些情况下，可能会需要关闭节点的电源来保证共享数据的完整性或是完全地恢复资源。为此Pacemaker引入了STONITHd。STONITH是 Shoot-The-Other-Node-In-The-Head(爆其他节点的头)的缩写，并且通常是靠远程电源开关来实现的。在Pacemaker中，STONITH设备被当成资源(并且是在CIB中配置)从而轻松地监控，然而STONITHd会注意理解STONITH拓扑，比如它的客户端请求隔离一个节点，它会重启那个机器。(译者注:就是说不同的爆头设备驱动会对相同的请求有不同的理解，这些都是在驱动中定义的。)

^[3] Even though Pacemaker also supports Heartbeat, the filesystems need to use the stack for messaging and membership and Corosync seems to be what they’re standardizing on. Technically it would be possible for them to support Heartbeat as well, however there seems little interest in this.