Uploaded image for project: 'Fabric'
  1. Fabric
  2. FAB-17736

race: gossip/comm TestCommSend generates thousands of go routines

    XMLWordPrintable

Details

    • Bug
    • Status: Unverified
    • Medium
    • Resolution: Unresolved
    • None
    • None
    • fabric-gossip
    • GOMAXPROCS=1 GORACE=atexit_sleep_ms=0 go test -p 1 -cpu 1 -race -count 1 -v -run TestCommSend ./gossip/comm

    Description

      This is related to failures when running gossip with the race detector enabled. With changes to enable it, we now see the following message in CI (and locally):

      race: limit on 8128 simultaneously alive goroutines is exceeded, dying
      

      What this is saying is that when running a test, that test results in over 8000 unblocked, runnable routines that are ready for execution and scheduling. How can that be?

      The implementation of TestCommSend. spawns two go routines that send messages in a tight for loop. One of the routines waits for stopch1 to close while the other waits for stopch2. These channels won't close until the test observes 80 messages on the associated connection.

      	go func() {
      		for {
      			emptyMsg := createGossipMsg()
      			select {
      			case <-stopch1:
      				return
      			default:
      				comm1.Send(emptyMsg, remotePeer(port2))
      			}
      		}
      	}()
      

      The implementation of Send is intended to be non-blocking. When we look at the implementation, we see it calling the non-blocking version of send – but from within a go routine.

      func (c *commImpl) Send(msg *protoext.SignedGossipMessage, peers ...*RemotePeer) {
      	if c.isStopping() || len(peers) == 0 {
      		return
      	}
      	c.logger.Debug("Entering, sending", msg, "to ", len(peers), "peers")
      
      	for _, peer := range peers {
      		go func(peer *RemotePeer, msg *protoext.SignedGossipMessage) {
      			c.sendToEndpoint(peer, msg, nonBlockingSend)
      		}(peer, msg)
      	}
      }
      

      When you put the two together, you get tens of thousands of go routines calling into sendToEndpoint. Since these are non-blocking sends, we wouldn't expect the go routines to stick around for long ... so why run sendToEndpoint in a go routine? Doesn't seem like it would be needed or desired...

      It turns out that those thousands of go routines can end up blocked on on locks when sendToEndpoint calls connStore.getConnection to get the connection to use for the send. And this is where things get interesting.

      The getConnection implementation always gets an RLock and an Lock on the connection store. Then, after getting the Lock, it releases it gets the RLock again(!?) and grabs a destinationLock.

      func (cs *connectionStore) getConnection(peer *RemotePeer) (*connection, error) {
      	cs.RLock()
      	isClosing := cs.isClosing
      	cs.RUnlock()
      
      	if isClosing {
      		return nil, errors.Errorf("conn store is closing")
      	}
      
      	pkiID := peer.PKIID
      	endpoint := peer.Endpoint
      
      	cs.Lock()
      	destinationLock, hasConnected := cs.destinationLocks[string(pkiID)]
      	if !hasConnected {
      		destinationLock = &sync.Mutex{}
      		cs.destinationLocks[string(pkiID)] = destinationLock
      	}
      	cs.Unlock()
      
      	destinationLock.Lock()
      
      	cs.RLock()
      	conn, exists := cs.pki2Conn[string(pkiID)]
      	if exists {
      		cs.RUnlock()
      		destinationLock.Unlock()
      		return conn, nil
      	}
      	cs.RUnlock()
              ...
      

      The rest of the method tries to create a connection while holding the destination lock. Once that process is done, we do more dancing with the RLock and the Lock.

      Once sendToEndpoint gets a usable connection, it calls send where, on the non-blocking path, the message is either placed in the outBuff channel or, when the channel is full, dropped on the floor.

      Given all that's above, the Test keeps creating lots and lots fo go routines that end up calling into getConnection and waiting for the RLock while some other go routine is holding or waiting for the Lock. Each time the Lock is dropped, the go routines waiting on RLock become runnable and, under the right conditions, that means we can have more than 8000 "active" go routines.


       
      Local recreation seems to happen pretty regularly when using one CPU:

      GOMAXPROCS=1 GORACE=atexit_sleep_ms=0 go test -p 1 -cpu 1 -race -count 1 -v -run TestCommSend ./gossip/comm
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            sykesm Matthew Sykes
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: