Uploaded image for project: 'Fabric'
  1. Fabric
  2. FAB-15557

Peer gossip status keep going online/offline although those peers are online

    XMLWordPrintable

Details

    • (Please add steps to reproduce)

    Description

      we have two organizations in a channel, let's call it OrgA, OrgB, and ChannelA.
      OrgA has 4 peers and OrgB has 2 peers.
      After those OrgB's 2 peers joined the channel, everything seemed working fine.
      However, once we do "docker stop" and then "docker start" on both OrgB's peers, then it started to produce these errors on OrgA's Peer1,2,3,4 as shown below (the reason that we tried to stop the docker container is to reproduce the case where the network on OrgB is disconnected --somehow):

      -------- log starts:
      2019-05-26T15:01:51.051024427Z 2019-05-26 15:01:51.050 UTC [gossip.comm] func1 -> WARN 36e OrgB.PEER1.IP:7051 , PKIid:ba3542dce022011a600b6867ce22750dae2f4c5d2fcb9ed62cc8d31a3047eca8 isn't responsive: EOF
      2019-05-26T15:01:51.051241455Z 2019-05-26 15:01:51.051 UTC [gossip.discovery] expireDeadMembers -> WARN 36f Entering [ba3542dce022011a600b6867ce22750dae2f4c5d2fcb9ed62cc8d31a3047eca8]
      2019-05-26T15:01:51.051253492Z 2019-05-26 15:01:51.051 UTC [gossip.discovery] expireDeadMembers -> WARN 370 Closing connection to Endpoint: OrgB.PEER1.IP:7051 , InternalEndpoint: , PKI-ID: ba3542dce022011a600b6867ce22750dae2f4c5d2fcb9ed62cc8d31a3047eca8, Metadata:
      2019-05-26T15:01:51.051258827Z 2019-05-26 15:01:51.051 UTC [gossip.discovery] expireDeadMembers -> WARN 371 Exiting

      2019-05-26T15:05:41.170617434Z 2019-05-26 15:05:41.170 UTC [gossip.channel] reportMembershipChanges -> INFO 2bd Membership view has changed. peers went offline: [[OrgB.PEER1.IP:7051 ]] , current view: [[OrgB.PEER2.IP:7051 ] [OrgA.PEER1.IP:7051] [OrgA.PEER2.IP:7051] [OrgA.PEER3.IP:7051]]
      2019-05-26T15:05:41.399609880Z 2019-05-26 15:05:41.399 UTC [comm.grpc.server] 1 -> INFO 2be unary call completed grpc.service=gossip.Gossip grpc.method=Ping grpc.request_deadline=2019-05-26T15:05:43.399Z grpc.peer_address=52.79.240.46:45326 grpc.code=OK grpc.call_duration=105.988µs
      2019-05-26T15:05:46.170562199Z 2019-05-26 15:05:46.170 UTC [gossip.channel] reportMembershipChanges -> INFO 2bf Membership view has changed. peers went online: [[OrgB.PEER1.IP:7051 ]] , current view: [[OrgB.PEER1.IP:7051] [OrgB.PEER2.IP:7051] [OrgA.PEER1.IP:7051] [OrgA.PEER2.IP:7051] [OrgA.PEER3.IP:7051]]

      ------- log ended.

      It keep says OrgB's Peer 1,2 are going Online and Offline infinitely like every minutes. It never stops producing this error, but it looks like those OrgB peers are working fine. We confirmed OrgB's peer1 and 2 are still has joined the channel and ledger sync / private dissemination works just fine.

      we tried to completely remove the OrgB's Peer1 and 2, and recreate peer/join the channel, and still we are seeing those warnings although those peers joined just fine.

      As long as OrgB's Peers are working fine in despite of those warning/error messages, we can say we can deal with it. However the most critical part is here:

      After we started to see this OrgB's peer1 and peer2 online/offline issue, around after 5 hours(not exact time. some times it happens randomly), we can observe the issue that in OrgA, a random peer gets isolated from the other peers - it means the random peer's current view is empty and it cannot get the private data anymore from gossip protocol. Since every peers in OrgA are leaders, ledger height is fine, but since the peer can't see other peers after isolation, the private data is not disemissioned.  If we restart the isolated peer docker container, then it works fine at that time, but after another several hours, another random peer(s) gets isolated again and this happens again and again.

      This is our configurations on both OrgA and OrgB.

      • we set the anchor peers on both OrgA and OrgB.
      • we are on Kafka Orderer mode
      • The bootstrap set up is configured on both Orgs.
      • Every peer in OrgA is the leader peer(static leader setup) and OrgB is dynamic leader election.
      • We confirmed they both joined the channel successfully, and both ledger height the private data are synced just fine until the isolation happens.
      • Peer docker image version - 1.4.1
      • CouchDB docker image version - 0.4.15
      • Peers in OrgA are registered/enrolled from fabric CA1
      • Peers in OrgB are registered/enrolled from fabric CA2

      This guy is under similar issue : https://lists.hyperledger.org/g/fabric/message/6003

      I strongly think this is related to multi-organization situation. On a single organization setting, this error can never be observed.

      If we stop/restart one of peers on OrgA, nothing happens wrong. Everything is good.

      It only happens if we stop/restart one of peers on OrgB, then OrgA peers start to produce the error messages and the isolation on OrgA peers happens.

      Attachments

        1. gossip_on_off_debug.log
          86 kB
        2. logs.zip
          4.55 MB
        3. org1.peer1.log
          241 kB
        4. org1.peer2.log
          183 kB
        5. org2.peer1.log
          4.71 MB
        6. org2.peer2.log
          4.12 MB

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jooskim1 Jason Kim
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: