Details
-
Bug
-
Status: Complete
-
High
-
Resolution: Done
-
None
-
None
-
None
-
STN running 1.3.57
-
EV 18.11 Stability/ViewChange
Description
The STN currently has 11 nodes, 7 of which are owned by Sovrin. When one node of our seven is brought down, the network fails to post transactions. We should be well above consensus. An additional fact that confuses matters is that when we attempt to connect to the pool using the legacy CLI, it shows that it is connecting to nodes that are not currently part of the pool, but are now part of the live pool. These nodes have all been demoted on this ledger.
Validator-info shows the correct pool nodes:
Validator england is running Current time: Friday, May 18, 2018 9:57:57 PM Validator DID: DNuLANU7f1QvW1esN3Sv9Eap9j14QuLiPeYzf28Nub4W Verification Key: 5PFZeZLWxaH8LxumLkLKq9LbfDNiCNb2xXR2TrGxSbrHeyu6Pfd8Kan Node Port: 9701/tcp on 0.0.0.0/0 Client Port: 9702/tcp on 0.0.0.0/0 Metrics: Uptime: 1 minute, 0 seconds Total Config Transactions: 501 Total Ledger Transactions: 593 Total Pool Transactions: 35 Read Transactions/Seconds: 0.00 Write Transactions/Seconds: 0.00 Reachable Hosts: 11/11 RFCU VeridiumIDC australia brazil canada england findentity ibm korea singapore virginia Unreachable Hosts: 0/11 Software Versions: indy-node: 1.3.57 sovrin: 1.1.9
If you look in the attached cli log file, you will see erroneous connections to nodes such as TNO. The strange behavior of the CLI is not the thrust of this ticket, it is only a strange symptom. The emphasis of the investigation should be why one node being up or down can prevent consensus.
This problem is repeatable on the STN. If you bring down any node, the pool does not achieve consensus. Korea was down at the time that these logs were obtained. When all seven of the sovrin-owned nodes are up, the pool is in consensus, and the CLI connects and acts normally.
Logs for the sovrin-owned validators are also included. Logs will be requested from our external stewards and will be attached as they are received.
Acceptance Criteria
- Diagnose the issue and create a Plan of Attack, including associated stories and epics that can be scheduled.
- If the problem proves to be a configuration issue, we can solve it immediately.