Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • None
    • 3
    • 15940

    Description

      Sometimes lc_watchdogd disappears w/o any messages and lustre logs are not dumped after watchdog triggered.

      How the correct behaviour should look:

      LNet: Service thread pid 7096 was inactive for 10.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 7096, comm: lctl
      
      Call Trace:
       [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffffa0380df7>] proc_trigger_watchdog+0x67/0x80 [libcfs]
       [<ffffffff811fd8e7>] proc_sys_call_handler+0x97/0xd0
       [<ffffffff811fd934>] proc_sys_write+0x14/0x20
       [<ffffffff81188f68>] vfs_write+0xb8/0x1a0
       [<ffffffff81189861>] sys_write+0x51/0x90
       [<ffffffff8152b2be>] ? do_device_not_available+0xe/0x10
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      
      LustreError: dumping log to /tmp/lustre-log.1411548646.7096
      

      and how it may look in the kernel logs when lustre logs are not dumped:

      Lustre: DEBUG MARKER: == sanity test 242: Check that watchdog causes kernel log dump == 09:19:38 (1411550378)
      LNet: Service thread pid 12742 stopped after 20.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Lustre: DEBUG MARKER: sanity test_242: @@@@@@ FAIL: Lustre log wasn't dumped
      Lustre: DEBUG MARKER: == sanity test complete, duration 29 sec == 09:20:01 (1411550401)
      

      Attachments

        Issue Links

          Activity

            [LU-5695] watchdog dispatch thread disappears

            Still waiting for a patch

            cliffw Cliff White (Inactive) added a comment - Still waiting for a patch

            Please reopen. I plan to update the patch but I was waiting until the port to sysfs happens for Lustre 2.10.

            simmonsja James A Simmons added a comment - Please reopen. I plan to update the patch but I was waiting until the port to sysfs happens for Lustre 2.10.

            Bug out of date, no patch update. Closing

            cliffw Cliff White (Inactive) added a comment - Bug out of date, no patch update. Closing

            The patch has failed review, can you address the issues?

            cliffw Cliff White (Inactive) added a comment - The patch has failed review, can you address the issues?

            I will monitor this issue

            cliffw Cliff White (Inactive) added a comment - I will monitor this issue
            zam Alexander Zarochentsev added a comment - patch http://review.whamcloud.com/#/c/12155/

            the fix is like:

            diff --git a/libcfs/libcfs/watchdog.c b/libcfs/libcfs/watchdog.c
            index ed1acf7..e71b48a 100644
            --- a/libcfs/libcfs/watchdog.c
            +++ b/libcfs/libcfs/watchdog.c
            @@ -330,6 +330,7 @@ static void lcw_dispatch_stop(void)
                    wake_up(&lcw_event_waitq);
             
                    wait_for_completion(&lcw_stop_completion);
            +       clear_bit(LCW_FLAG_STOP, &lcw_flags);
             
                    CDEBUG(D_INFO, "watchdog dispatcher has shut down.\n");
            

            a proper patch will be submitted soon

            zam Alexander Zarochentsev added a comment - the fix is like: diff --git a/libcfs/libcfs/watchdog.c b/libcfs/libcfs/watchdog.c index ed1acf7..e71b48a 100644 --- a/libcfs/libcfs/watchdog.c +++ b/libcfs/libcfs/watchdog.c @@ -330,6 +330,7 @@ static void lcw_dispatch_stop(void) wake_up(&lcw_event_waitq); wait_for_completion(&lcw_stop_completion); + clear_bit(LCW_FLAG_STOP, &lcw_flags); CDEBUG(D_INFO, "watchdog dispatcher has shut down.\n" ); a proper patch will be submitted soon

            After a closer look at libcfs/libcfs/watchdog.c it was found that LCW_FLAG_STOP flag in lcw_flags variable is only set and never gets cleared.

            [ 17:33:12 ] $ git grep lcw_flags
            libcfs/libcfs/watchdog.c:static unsigned long lcw_flags = 0;
            libcfs/libcfs/watchdog.c:       if (test_bit(LCW_FLAG_STOP, &lcw_flags))
            libcfs/libcfs/watchdog.c:               if (test_bit(LCW_FLAG_STOP, &lcw_flags)) {
            libcfs/libcfs/watchdog.c:       set_bit(LCW_FLAG_STOP, &lcw_flags);
            

            So if lcw_refcount reaches zero and the watchdog thread is stopped by lcw_dispatch_stop() , it will be never working again (it exists immediately after start) until the modules reload or system restart.

            zam Alexander Zarochentsev added a comment - After a closer look at libcfs/libcfs/watchdog.c it was found that LCW_FLAG_STOP flag in lcw_flags variable is only set and never gets cleared. [ 17:33:12 ] $ git grep lcw_flags libcfs/libcfs/watchdog.c:static unsigned long lcw_flags = 0; libcfs/libcfs/watchdog.c: if (test_bit(LCW_FLAG_STOP, &lcw_flags)) libcfs/libcfs/watchdog.c: if (test_bit(LCW_FLAG_STOP, &lcw_flags)) { libcfs/libcfs/watchdog.c: set_bit(LCW_FLAG_STOP, &lcw_flags); So if lcw_refcount reaches zero and the watchdog thread is stopped by lcw_dispatch_stop() , it will be never working again (it exists immediately after start) until the modules reload or system restart.

            People

              simmonsja James A Simmons
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: