Uploaded image for project: 'Indy Node'
  1. Indy Node
  2. INDY-1640

Repeated demotion and promotion of non-primary nodes result in eventual consensus failure

    XMLWordPrintable

Details

    • Bug
    • Status: Complete
    • Medium
    • Resolution: Done
    • 1.6.73
    • 1.6.79
    • test-automation

    Description

      Running an experiment that does the following over and over again (demote and promote random non-primary node) causes a pool of 10 nodes to fall out of consensus in approximately 6 to 8 iterations:

      Configuration (reset the pool):

      • 10 node pool - Node1 through Node10
      • Node1 is the primary
      • Node2, Node3, and Node4 are initially the backup primaries
      • f_value = 3
      • Count_of_replicas = 4

      Steps:
      1. Guarantee that the pool is in consensus by writing a NYM to the domain ledger (a.k.a. steady state hypothesis). Allow up to 60 seconds for the operation (write to the domain ledger) to complete.
      2. Pick a non-primary node at random and demote it (set the node's services attribute to blank (""). Expect this operation to succeed. Allow up to 120 seconds for the operation (write to the pool ledger) to complete.

      Configuration at this point:

      • 9 node pool. Node1 (the primary) is guaranteed to still be in the pool. The remaining 8 nodes depends on what node was randomly selected to be demoted.
      • Node1 is the primary
      • The set of 2 backup primaries depends on what node was randomly selected and demoted, because one of the backup primaries could have been selected.
      • f_value = 2
      • Count_of_replicas = 3

      3. Check if the pool is still in consensus by writing a NYM to the domain ledger (a.k.a. steady state hypothesis). Allow up to 60 seconds for the operation (write to the domain ledger) to complete.
      4. Promote the demoted node (a.k.a. rollback). Allow up to 20 seconds for the operation (write to the domain ledger) to complete.
      5. Sleep 10 seconds before restarting the node being promoted.
      6. Stop the indy-node service on the node being promoted.
      7. Make sure the indy_node service is stopped on the node being promoted
      8. Start the indy-node service on the node being promoted

      Configuration at this point:

      • 10 node pool - Node1 through Node10
      • Node1 is the primary
      • Node2, Node3, and Node4 are likely still backup primaries, because backup primaries (replicas) are not changed unless a view change happens and a view change was not forced by stopping or demoting the primary (Node1). However, a view change could happen for other reasons. Therefore, the set of backup primaries could vary.
      • f_value = 3
      • Count_of_replicas = 4

      Repeating the above steps over and over again cause the pool to fall out of consensus within 6 to 8 iterations.

      nscapture archives from each of the 10 nodes are located here.

      If the log processor tool will be used to analyze this issue, the logs from each node can be found in the nscapture archive within the archive's log directory.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ckochenower Corin Kochenower
              Alexander Shcherbakov, Artem Obruchnikov, Corin Kochenower
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: