Uploaded image for project: 'Fabric'
  1. Fabric
  2. FAB-18244

Prevent intermittent WAL snapshot corruption preventing an orderer from starting.

    XMLWordPrintable

Details

    • Hide
      Given my understanding of the problem, the only way to reproduce is to get into a state where the WAL file is large and being persisted to disk and keep bouncing the orderer until you hit this case.

      Yellick may have better steps though.
      Show
      Given my understanding of the problem, the only way to reproduce is to get into a state where the WAL file is large and being persisted to disk and keep bouncing the orderer until you hit this case. Yellick may have better steps though.

    Description

      We had a user  with a single node ordering service and the orderer crashed and would not come back up.  After debugging the issue, Yellick root caused to:
      The issue is, I think, that the the commit process looks like this:

      1. Raft tells us "this is the set of stuff that's been consented on"
      2. We persist that stuff to disk
      3. If our WAL is getting too big, we also persist a snapshot to disk
      4. We apply the stuff to our blockchain.

      What is happening is that we are thrashing between 3 and 4, and, then at startup, we read "there's a snapshot for block n+1", but our blockchain is only at n! Let's go replicate before we try to start consenting.  But, we actually already have that entry, we just haven't committed it to the blockchain yet.

      This Jira is to correct this problem by, we believe, flipping 3 and 4 in the process above.

      Attachments

        Issue Links

          Activity

            People

              guoger Jay Guo
              ptippett Paul Tippett
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: