Uploaded image for project: 'Indy Node'
  1. Indy Node
  2. INDY-941

Unable to catch up agent if a validator is down

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Complete
    • Priority: Highest
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Environment:

      STN, running 1.1.43.  All agents and clients used were 1.1.43 as well.

    • Sprint:
      INDY 17.22, INDY 17.23

      Description

      The catch up logic divides the transactions that need to be be caught up by the number of nodes that are in the genesis file, parceling the upgrade transactions out some to each node.  If one of the nodes is down or even slow in responding, we do not catch up and the system goes into a failed state.  This is true for agents that are attempting to catchup the pool ledger while connecting, and possibly for validators as well.  On the other hand, the CLI client appears to retry and get the update from another node.

      Failure or slow response from any node must not result in a failure to catchup, whether the catchup request is from another validator, an agent, or a client CLI.

      The network that was being used when this issue appeared is the STN.  This network has 7 transactions in the pool genesis file, and 18 transactions on the pool ledger, meaning that 11 transactions need to be caught up.  The agent code requests 2 transactions from each of the first 5 nodes listed in the genesis file, and 1 transaction from the 6th (singapore). From some agent VMs, singapore was slow to respond, triggering the issue when the agent did not catch up that last transaction.  From other agent VMs it is able to respond to the request in time and the catchup is successful.  To confirm the issue, we shut down singapore entirely.  In this case, catchup was unsuccessful on all nodes.

      Attached are the following logs:

      1. intrepid_working.txt - an agent log with the singapore validator node running, from a vm that connects to it relatively quickly.  Update succeeds.
      2. intrepid_broken.txt - an agent log with the singapore validator node down. Update fails.
      3. cli.log - a log from a node running the cli with the singapore validator node down.  After a delay, update succeeds when a different validator is used to fetch the transaction.

       

        Attachments

        1. cli.log
          50 kB
          Mike Bailey
        2. faber@live.PNG
          360 kB
          Vladimir Shishkin
        3. intrepid_broken.txt
          21 kB
          Mike Bailey
        4. intrepid_working.txt
          194 kB
          Mike Bailey
        5. intrepid.py
          6 kB
          Mike Bailey
        6. Screenshot.PNG
          470 kB
          Vladimir Shishkin

          Activity

            People

            Assignee:
            VladimirWork Vladimir Shishkin
            Reporter:
            mgbailey Mike Bailey
            Watchers:
            Alexander Shcherbakov, Kelly Wilson, Mike Bailey, Vladimir Shishkin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: