Affects Version/s: None
Fix Version/s: None
STN, running 1.1.43. All agents and clients used were 1.1.43 as well.
Sprint:INDY 17.22, INDY 17.23
The catch up logic divides the transactions that need to be be caught up by the number of nodes that are in the genesis file, parceling the upgrade transactions out some to each node. If one of the nodes is down or even slow in responding, we do not catch up and the system goes into a failed state. This is true for agents that are attempting to catchup the pool ledger while connecting, and possibly for validators as well. On the other hand, the CLI client appears to retry and get the update from another node.
Failure or slow response from any node must not result in a failure to catchup, whether the catchup request is from another validator, an agent, or a client CLI.
The network that was being used when this issue appeared is the STN. This network has 7 transactions in the pool genesis file, and 18 transactions on the pool ledger, meaning that 11 transactions need to be caught up. The agent code requests 2 transactions from each of the first 5 nodes listed in the genesis file, and 1 transaction from the 6th (singapore). From some agent VMs, singapore was slow to respond, triggering the issue when the agent did not catch up that last transaction. From other agent VMs it is able to respond to the request in time and the catchup is successful. To confirm the issue, we shut down singapore entirely. In this case, catchup was unsuccessful on all nodes.
Attached are the following logs:
- intrepid_working.txt - an agent log with the singapore validator node running, from a vm that connects to it relatively quickly. Update succeeds.
- intrepid_broken.txt - an agent log with the singapore validator node down. Update fails.
- cli.log - a log from a node running the cli with the singapore validator node down. After a delay, update succeeds when a different validator is used to fetch the transaction.