Uploaded image for project: 'Indy Node'
  1. Indy Node
  2. INDY-1054

View Change on large pools of 19 or more nodes can cause pool to stop functioning

    XMLWordPrintable

Details

    • Bug
    • Status: Complete
    • Highest
    • Resolution: Done
    • None
    • None
    • None
    • INDY 18.01: Stability+, Sprint 18.02 Stability

    Description

      It appears that view change on pools of 19 nodes or more can cause the pool to stop functioning.

      Setup
      I have a pool with 25 nodes
      After adding all the nodes to the pool from a genesis of 7 nodes I had 30 total transactions in the ledger.

      Test
      I set the load scripts to run transactions every 6 - 30 seconds for an hour.
      I have 25 clients across 5 machines each sending 10 transaction bursts every 6 - 30 seconds. Each will send 1400 total transactions which should create activity on the pool for about an hour.

      Issue
      After 524 transaction had been sent the pool stopped taking new transactions. Each node match the other nodes and the log files did not have any errors. I did see in some logs a view change was being requested so I believe this has to do with view change on a large pool.

      Logs and Screenshots

      1. Attached are the smaller logs and the current ones when the nodes failed.
      2. Also attached is a screenshot of a search I did in the logs. I searched for "proposed_view_change". Only 12 of the 25 nodes have a propose view change and it looks like Node 7 was going to run a different view change ID 991 not ID 992 that the other show.

      Some of what I see in the logs around the same time stamp are the following
      Node 12

      | has_action_queue.py  (  36) | _schedule | Node12 scheduling action propose_view_change with id 992 to run in 2 seconds
      | checkInstances | Node12 choosing to start election on the basis of count 19 and nodes {'Node20', 'Node1', 'Node5', 'Node19', 'Node24', 'Node7', 'Node14', 'Node6', 'Node9', 'Node21', 'Node4', 'Node16', 'Node17', 'Node18', 'Node25', 'Node3', 'Node8', 'Node22'}
      

      Node 14

      | onConnsChanged | Node14 lost connection to primary of master
      | lost_master_primary | Node14 scheduling a view change in 2 sec
      

      Node 15

      | set_status | Node15 changing status from started to started_hungry
      | checkInstances | Node15 choosing to start election on the basis of count 24 and nodes {'Node23', 'Node5', 'Node7', 'Node6', 'Node14', 'Node24', 'Node18', 'Node12', 'Node2', 'Node20', 'Node10', 'Node22', 'Node4', 'Node21', 'Node8', 'Node25', 'Node9', 'Node17', 'Node19', 'Node16', 'Node3', 'Node11', 'Node1'}
      

      Node 17

      | set_status | Node17 changing status from started to started_hungry
      | checkInstances | Node17 choosing to start election on the basis of count 24 and nodes {'Node4', 'Node1', 'Node9', 'Node5', 'Node10', 'Node14', 'Node15', 'Node3', 'Node8', 'Node6', 'Node23', 'Node7', 'Node2', 'Node20', 'Node16', 'Node11', 'Node18', 'Node24', 'Node25', 'Node19', 'Node21', 'Node22', 'Node12'}
      

      Node 18

      | onConnsChanged | Node18 lost connection to primary of master
      | lost_master_primary | Node18 scheduling a view change in 2 sec
      

      Node 22

      | onConnsChanged | Node22 lost connection to primary of master
      | lost_master_primary | Node22 scheduling a view change in 2 sec
      

      Attachments

        1. AWS_logs_1_2_272_master.7z
          1.27 MB
        2. DeadPool Logs.7z
          9.98 MB
        3. ViewChange25Node.JPG
          ViewChange25Node.JPG
          317 kB

        Issue Links

          Activity

            People

              Unassigned Unassigned
              krw910 Kelly Wilson
              Alexander Shcherbakov, Dmitry Surnin, Kelly Wilson, Nikita Spivachuk, Olga Zheregelya, Vladimir Shishkin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: