ceph集群中的心跳机制研究

基于Ceph 0.94.5版本，进行Ceph集群中的心跳机制进行研究。
在ceph中，心跳机制通过ping来实现，用来作为集群中故障检测的方法。分为两类：osd与osd之间的心跳，osd与mon之间的心跳。下面去一探究竟。

osd与osd之间的心跳

1、相邻osd之间会维持心跳，默认情况下，时间间隔是6s。
2、一个osd上面的pg所关联的osd之间存在着心跳。
所谓相邻，即根据osd的ID，该osd前一个活着的osd和后一个活着的osd。

如果一个osd在20s的grace时间内，没有收到来自邻居osd的heartbeat，那么它认为这个邻居osd down，并且汇报给monitor。
如果来自不同主机的2个osd报告同一个osd down，那么monitor承认这个osd down。

如下，从update_heartbeat_peers这个函数可以看出第2个选peers的方式。即有效的acting_set和up_set中出现的osd将被加入peers，但是涉及到 map pg_shard_t,pg_info_t > 的加入方式并没有看懂，进一步研究。

void PG::update_heartbeat_peers()
{
  assert(is_locked());

  set<int> new_peers;
  if (is_primary()) {
    for (unsigned i=0; i<acting.size(); i++) {
      if (acting[i] != CRUSH_ITEM_NONE)
	new_peers.insert(acting[i]);
    }
    for (unsigned i=0; i<up.size(); i++) {
      if (up[i] != CRUSH_ITEM_NONE)
	new_peers.insert(up[i]);
    }
    for (map<pg_shard_t,pg_info_t>::iterator p = peer_info.begin();
	 p != peer_info.end();
	 ++p)
      new_peers.insert(p->first.osd);
  }

  bool need_update = false;
  heartbeat_peer_lock.Lock();
  if (new_peers == heartbeat_peers) {
    dout(10) << "update_heartbeat_peers " << heartbeat_peers << " unchanged" << dendl;
  } else {
    dout(10) << "update_heartbeat_peers " << heartbeat_peers << " -> " << new_peers << dendl;
    heartbeat_peers.swap(new_peers);
    need_update = true;
  }
  heartbeat_peer_lock.Unlock();

  if (need_update)
    osd->need_heartbeat_peer_update();
}

osd与mon之间的心跳

如果一个osd无法peer其他所有的osd，osd每30s ping一次monitor，并且获取最新的cluster map信息。
如果osd没有主动报告monitor，那么monitor会在mon_osd_report_time时间后认为osd down。对应2个时间mon_osd_report_interval_min, mon_osd_report_interval_max。在达到interval_max的时候，不管有没有改变，osd都会主动上报给monitor。

osd在如下几种情况下会主动上报monitor：
有失效的情况，pg状态的改变，up_thru的改变，osd在boot的5s时间以内。
如果osd发现他的peers为空的时候，也会主动发送hb给monitor，并且获取新的osdmap（osdmap_subscribe）。

struct HeartbeatInfo {
    int peer;           ///< peer
    ConnectionRef con_front;   ///< peer connection (front)
    ConnectionRef con_back;    ///< peer connection (back)
    utime_t first_tx;   ///< time we sent our first ping request
    utime_t last_tx;    ///< last time we sent a ping request
    utime_t last_rx_front;  ///< last time we got a ping reply on the front side
    utime_t last_rx_back;   ///< last time we got a ping reply on the back side
    epoch_t epoch;      ///< most recent epoch we wanted this peer
   ......
}

HeartbeatInfo结构体中包括：con_front表示前向的连接，即public network的连接，con_back表示后向连接，即cluster network的连接。当一个osd发送ping心跳报文的时候，会通过后向连接发送，如果前向连接有效，也通过前向连接发送，就是说正常情况下会同时发2个ping。

i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
				  service.get_osdmap()->get_epoch(),
				  MOSDPing::PING,
				  now));

   if (i->second.con_front)
     i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
				     service.get_osdmap()->get_epoch(),
					     MOSDPing::PING,
					     now));

未完待续~ ~ ~
2017年9月11日01:32:24