ceph集群中的心跳机制研究3

之前研究了Ceph集群的心跳机制和故障检测机制，那从心跳机制到故障判断的过程又是什么样的？因为涉及多个源代码文件，并且函数调用嵌套很多层，这里看起来不是那么清晰，所以这也是我最后要讲的地方。

从heartbeat check no reply到最终确定osd failure的过程

heartbeat的相关实现在源文件OSD.cc中。

OSD::heartbeat() 
(send heartbeats) -> OSD::heartbeat_check()
...
(省略一些中间过程)
OSD::send_failures() -> 依据failure_queue计算failed_for -> 带上failed_for，通过MOSDFailure包装(带上标志MSG_OSD_FAILURE)，然后发送msg给mon，failed_for是OSD failed的时间长度。如下：
monc->send_mon_message(new MOSDFailure(monc->get_fsid(), i, failed_for,
					     osdmap->get_epoch()));

然后在源文件OSDMonitor.cc中进行处理。

OSDMonitor::prepare_update() -> case MSG_OSD_FAILURE ，它接受一个op参数，然后进入switch，如下：
bool OSDMonitor::prepare_update(MonOpRequestRef op)
{
  op->mark_osdmon_event(__func__);
  PaxosServiceMessage *m = static_cast<PaxosServiceMessage*>(op->get_req());
  dout(7) << "prepare_update " << *m << " from " << m->get_orig_source_inst() << dendl;

  switch (m->get_type()) {
    // damp updates
  case MSG_OSD_MARK_ME_DOWN:
    return prepare_mark_me_down(op);
  case MSG_OSD_FAILURE:
    return prepare_failure(op);
...
}

最后在进入以下故障判定的两个阶段：

1 2	OSDMonitor::prepare_failure() OSDMinitor::check_failure()

2017年12月20日