Affects Version/s: None
Fix Version/s: 1.6.79
As of now, there is a shared Requests queue for all Replica. A request is removed from the queue only when it's ordered on all Replicas.
So, if at least one Replica doesn't order a request, it will stay there forever.
There are a couple of strategies implemented that fixed it in most of the cases (see for example
INDY-1684 and INDY-1759). So, if one of the Backup Instances doesn't order, or order too slow, we can detect it and remove the replica (allowing requests to be cleared).
However, there are cases where some requests may still be not cleared (until view change happens):
1) Request queue is different on nodes, so a Node has requests which are not present (by some reason) on one of the primaries, so it's never ordered on this instance and hence never removed from the queue.
2) A malicious primary on one of the backup instances doesn't order some of the request (in a way that backup instance performance is not changed significantly, otherwise it will be detected by others because of the strategy from
INDY-1684). So, some requests will not be cleared on all nodes.
- Define a strategy of removing outdated requests from the queue periodically.
- Assign a timestamp for each request when it goes to the queue. Schedule a timer (let's say every X minutes) which will clear all requests staying in the queue for more than X minutes
- Clear all requests for a stable checkpoint on master (once it's ordered by master some time ago, it's unlikely that it's still not ordered by backups; so this rather means that backups doesn't order or order maliciously)
- It may make sense to implement it as a strategy disabled by default in config, and enabled once it's validated by QA