• Peter Xu's avatar
    iothread: fix iothread hang when stop too soon · 6c95363d
    Peter Xu authored
    Lukas reported an hard to reproduce QMP iothread hang on s390 that
    QEMU might hang at pthread_join() of the QMP monitor iothread before
    quitting:
    
      Thread 1
      #0  0x000003ffad10932c in pthread_join
      #1  0x0000000109e95750 in qemu_thread_join
          at /home/thuth/devel/qemu/util/qemu-thread-posix.c:570
      #2  0x0000000109c95a1c in iothread_stop
      #3  0x0000000109bb0874 in monitor_cleanup
      #4  0x0000000109b55042 in main
    
    While the iothread is still in the main loop:
    
      Thread 4
      #0  0x000003ffad0010e4 in ??
      #1  0x000003ffad553958 in g_main_context_iterate.isra.19
      #2  0x000003ffad553d90 in g_main_loop_run
      #3  0x0000000109c9585a in iothread_run
          at /home/thuth/devel/qemu/iothread.c:74
      #4  0x0000000109e94752 in qemu_thread_start
          at /home/thuth/devel/qemu/util/qemu-thread-posix.c:502
      #5  0x000003ffad10825a in start_thread
      #6  0x000003ffad00dcf2 in thread_start
    
    IMHO it's because there's a race between the main thread and iothread
    when stopping the thread in following sequence:
    
        main thread                       iothread
        ===========                       ==============
                                          aio_poll()
        iothread_get_g_main_context
          set iothread->worker_context
        iothread_stop
          schedule iothread_stop_bh
                                            execute iothread_stop_bh [1]
                                              set iothread->running=false
                                              (since main_loop==NULL so
                                               skip to quit main loop.
                                               Note: although main_loop is
                                               NULL but worker_context is
                                               not!)
                                          atomic_read(&iothread->worker_context) [2]
                                            create main_loop object
                                            g_main_loop_run() [3]
        pthread_join() [4]
    
    We can see that when execute iothread_stop_bh() at [1] it's possible
    that main_loop is still NULL because it's only created until the first
    check of the worker_context later at [2].  Then the iothread will hang
    in the main loop [3] and it'll starve the main thread too [4].
    
    Here the simple solution should be that we check again the "running"
    variable before check against worker_context.
    
    CC: Thomas Huth <thuth@redhat.com>
    CC: Dr. David Alan Gilbert <dgilbert@redhat.com>
    CC: Stefan Hajnoczi <stefanha@redhat.com>
    CC: Lukáš Doktor <ldoktor@redhat.com>
    CC: Markus Armbruster <armbru@redhat.com>
    CC: Eric Blake <eblake@redhat.com>
    CC: Paolo Bonzini <pbonzini@redhat.com>
    Reported-by: default avatarLukáš Doktor <ldoktor@redhat.com>
    Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
    Tested-by: default avatarThomas Huth <thuth@redhat.com>
    Message-id: 20190129051432.22023-1-peterx@redhat.com
    Signed-off-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
    6c95363d