Details
-
Bug
-
Status: Complete
-
Highest
-
Resolution: Done
-
None
-
None
-
None
-
INDY 18.01: Stability+, Sprint 18.02 Stability
Description
It appears that view change on pools of 19 nodes or more can cause the pool to stop functioning.
Setup
I have a pool with 25 nodes
After adding all the nodes to the pool from a genesis of 7 nodes I had 30 total transactions in the ledger.
Test
I set the load scripts to run transactions every 6 - 30 seconds for an hour.
I have 25 clients across 5 machines each sending 10 transaction bursts every 6 - 30 seconds. Each will send 1400 total transactions which should create activity on the pool for about an hour.
Issue
After 524 transaction had been sent the pool stopped taking new transactions. Each node match the other nodes and the log files did not have any errors. I did see in some logs a view change was being requested so I believe this has to do with view change on a large pool.
Logs and Screenshots
- Attached are the smaller logs and the current ones when the nodes failed.
- Also attached is a screenshot of a search I did in the logs. I searched for "proposed_view_change". Only 12 of the 25 nodes have a propose view change and it looks like Node 7 was going to run a different view change ID 991 not ID 992 that the other show.
Some of what I see in the logs around the same time stamp are the following
Node 12
| has_action_queue.py ( 36) | _schedule | Node12 scheduling action propose_view_change with id 992 to run in 2 seconds | checkInstances | Node12 choosing to start election on the basis of count 19 and nodes {'Node20', 'Node1', 'Node5', 'Node19', 'Node24', 'Node7', 'Node14', 'Node6', 'Node9', 'Node21', 'Node4', 'Node16', 'Node17', 'Node18', 'Node25', 'Node3', 'Node8', 'Node22'}
Node 14
| onConnsChanged | Node14 lost connection to primary of master | lost_master_primary | Node14 scheduling a view change in 2 sec
Node 15
| set_status | Node15 changing status from started to started_hungry | checkInstances | Node15 choosing to start election on the basis of count 24 and nodes {'Node23', 'Node5', 'Node7', 'Node6', 'Node14', 'Node24', 'Node18', 'Node12', 'Node2', 'Node20', 'Node10', 'Node22', 'Node4', 'Node21', 'Node8', 'Node25', 'Node9', 'Node17', 'Node19', 'Node16', 'Node3', 'Node11', 'Node1'}
Node 17
| set_status | Node17 changing status from started to started_hungry | checkInstances | Node17 choosing to start election on the basis of count 24 and nodes {'Node4', 'Node1', 'Node9', 'Node5', 'Node10', 'Node14', 'Node15', 'Node3', 'Node8', 'Node6', 'Node23', 'Node7', 'Node2', 'Node20', 'Node16', 'Node11', 'Node18', 'Node24', 'Node25', 'Node19', 'Node21', 'Node22', 'Node12'}
Node 18
| onConnsChanged | Node18 lost connection to primary of master | lost_master_primary | Node18 scheduling a view change in 2 sec
Node 22
| onConnsChanged | Node22 lost connection to primary of master | lost_master_primary | Node22 scheduling a view change in 2 sec