Uploaded image for project: 'Indy Node'
  1. Indy Node
  2. INDY-806

Under load an earlier stopped node launch fails with KeyError in KeyValueStorageLeveldb

    XMLWordPrintable

Details

    • Bug
    • Status: Complete
    • High
    • Resolution: Done
    • None
    • None
    • None
    • 18.07 Stability & Monitoring

    Description

      Build Info:

      • indy-node master 1.0.113

      Overview:

      • Under load one node of a pool with 4 nodes was stopped and after some period started. This node successfully started up and performed catch-up. After some time this node raised suspicions on all other nodes in the pool about Pre-Prepare messages have incorrect state trie roots and sent Instance Change messages due to this. Later this node was again stopped and after some period started. This time its startup was failed with KeyError raised from KeyValueStorageLeveldb and its restart was scheduled. The next startup was again failed with the same error and the node restart was again scheduled and this process was repeated multiple times cyclically.

      Performed steps and results:

      1. Deploy a pool with 4 nodes in Docker.
      2. Start a CLI in Docker and execute: connect test
        [ CLI reports that it has been connected to all 4 nodes. ]
      3. Execute on Node4: read_ledger --type=domain --count
        [ The script reports the following count of transactions in the domain ledger: 15 ]
      4. Copy the directory indy-node/scrips to Node1 and Node2.
      5. Run and wait for completion on Node1: python3 add_keys.py Steward1 000000000000000000000000Steward1
      6. Execute on Node4: read_ledger --type=domain --count
        [ The script reports the following count of transactions in the domain ledger: 215 ]
      7. Run on Node1: python3 load_test.py -c 10 -r 1000
      8. Run on Node2: python3 load_test.py -c 10 -r 1000
      9. Check whether the count of transactions in the domain ledger is growing, using the following command on Node4: read_ledger --type=domain --count
        [ The count of transactions in the domain ledger is growing. ]
      10. Stop the node service on Node3: systemctl stop sovrin-node
        [ CLI reports that it has been disconnected from Node3. ]
      11. Check whether the count of transactions in the domain ledger is growing, using the following command on Node4: read_ledger --type=domain --count
        [ The count of transactions in the domain ledger is growing. ]
      12. After some period start the node service on Node3: systemctl start sovrin-node
        [ CLI reports that it has been connected to Node3. ]
      13. Wait for Node3 completes catch-up (check for corresponding messages in the log).
      14. Check whether the count of transactions in the domain ledger is growing, using the following command on Node4: read_ledger --type=domain --count
        [ The count of transactions in the domain ledger is growing. ]
      15. Check Node3 log for unexpected messages.
        Actual results:
        At some moment Node3 reports in its log for each other node in the pool:
        Node3 raised suspicion on node Node<X> for Pre-Prepare message has incorrect state trie root; suspicion code is 21
        VIEW CHANGE: Node3 sent instance change since suspicion code 21
        Expected results:
        No messages about raised suspicions appear in Node3 log.
      16. Stop the node service on Node3: systemctl stop sovrin-node
        [ CLI reports that it has been disconnected from Node3. ]
      17. Check whether the count of transactions in the domain ledger is growing, using the following command on Node4: read_ledger --type=domain --count
        [ The count of transactions in the domain ledger is growing. ]
      18. After some period start the node service on Node3: systemctl start sovrin-node
        Actual results:
        CLI does not report that it has been connected to Node3.
        In systemd journal on Node3 (view using journalctl -ex) it can be viewed that the node service fails with KeyError in KeyValueStorageLeveldb on startup and is restarted in cycles.
        Expected results:
        CLI reports that it has been connected to Node3.

      See the attachments for details.

      Attachments

        1. Node1_load_test_results.txt
          0.3 kB
        2. Node1_perf_results_10_1000.csv
          1.70 MB
        3. Node1.log
          1.46 MB
        4. Node2_load_test_results.txt
          0.3 kB
        5. Node2_perf_results_10_1000.csv
          1.70 MB
        6. Node2.log
          1.51 MB
        7. Node3_msgs_from_systemd_journal.txt
          4 kB
        8. Node3.log
          423 kB
        9. Node4.log
          3.40 MB

        Issue Links

          Activity

            People

              VladimirWork Vladimir Shishkin
              spivachuk Nikita Spivachuk
              Alexander Shcherbakov, Nikita Spivachuk, Sean Bohan, Vladimir Shishkin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: