Bug #17300
closedThe Fiber scheduler does not work with ConditionVariable
Description
When looking at replacing kernel_sleep by blocking, I found an independent bug.
ConditionVariable does not seem to work with the Fiber scheduler currently.
There is an existing test in https://github.com/ruby/ruby/blob/4f8d9b0db84c42c8d37f75de885de1c0a5cb542c/test/fiber/test_mutex.rb#L105-L140 on which I based this reproduction example.
The test should always have signalled==3, but the check is only > 1.
The test is also racy, as ConditionVariable#signal has no effect if no other Thread/Fiber is in ConditionVariable#wait.
Here is the reproduction, by default it runs without the scheduler. Pass it scheduler as an argument to use the test Scheduler.
I save the script under test/fiber for convenience.
require_relative 'scheduler'
USE_SCHEDULER = ARGV.delete('scheduler')
mutex = Mutex.new
condition = ConditionVariable.new
signalled = 0
q = Queue.new
a = Thread.new do
  Thread.current.scheduler = Scheduler.new if USE_SCHEDULER
  
  body = -> do
    mutex.synchronize do
      3.times do |i|
        q << :ready
        p [:wait, i]
        condition.wait(mutex)
        raise unless mutex.owned?
        signalled += 1
      end
    end
  end
  
  USE_SCHEDULER ? Fiber.schedule(&body) : body.call
end
b = Thread.new do
  Thread.current.scheduler = Scheduler.new if USE_SCHEDULER
  
  body = -> do
    puts "Thread 2 starting"
    3.times do |i|
      q.pop # Only acquire Mutex once the other thread is in wait
      puts "Thread 2 locking Mutex"
      mutex.synchronize do
        p [:signal, i]
        condition.signal
      end
      sleep 1 # 0.1
    end
  end
  
  USE_SCHEDULER ? Fiber.schedule(&body) : body.call
end
a.join
b.join
p signalled
$ ruby condvar2.rb          
Thread 2 starting
[:wait, 0]
Thread 2 locking Mutex
[:signal, 0]
[:wait, 1]
Thread 2 locking Mutex
[:signal, 1]
[:wait, 2]
Thread 2 locking Mutex
[:signal, 2]
3
ruby condvar2.rb scheduler
Thread 2 starting
Thread 2 locking Mutex
[:wait, 0]
[:signal, 0]
# hangs
        
           Updated by ioquatix (Samuel Williams) almost 5 years ago
          Updated by ioquatix (Samuel Williams) almost 5 years ago
          
          
        
        
      
      @Eregon (Benoit Daloze) thanks for this report, I will investigate it.
        
           Updated by ioquatix (Samuel Williams) almost 5 years ago
          Updated by ioquatix (Samuel Williams) almost 5 years ago
          
          
        
        
      
      It looks like memory corruption.
th_mutex = 0x55613ba19018
-> th_mutex = 0x8
Some how the mutex linked list ends up with Qnil... not sure how yet.
        
           Updated by ioquatix (Samuel Williams) almost 5 years ago
          Updated by ioquatix (Samuel Williams) almost 5 years ago
          
          
        
        
      
      Okay, I found an unrelated bug, and also the root cause. PR incoming.
        
           Updated by Eregon (Benoit Daloze) almost 5 years ago
          Updated by Eregon (Benoit Daloze) almost 5 years ago
          
          
        
        
      
      - Status changed from Open to Closed
Thanks for the quick fix.