Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • None
    • 3
    • 15940

    Description

      Sometimes lc_watchdogd disappears w/o any messages and lustre logs are not dumped after watchdog triggered.

      How the correct behaviour should look:

      LNet: Service thread pid 7096 was inactive for 10.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 7096, comm: lctl
      
      Call Trace:
       [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffffa0380df7>] proc_trigger_watchdog+0x67/0x80 [libcfs]
       [<ffffffff811fd8e7>] proc_sys_call_handler+0x97/0xd0
       [<ffffffff811fd934>] proc_sys_write+0x14/0x20
       [<ffffffff81188f68>] vfs_write+0xb8/0x1a0
       [<ffffffff81189861>] sys_write+0x51/0x90
       [<ffffffff8152b2be>] ? do_device_not_available+0xe/0x10
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      
      LustreError: dumping log to /tmp/lustre-log.1411548646.7096
      

      and how it may look in the kernel logs when lustre logs are not dumped:

      Lustre: DEBUG MARKER: == sanity test 242: Check that watchdog causes kernel log dump == 09:19:38 (1411550378)
      LNet: Service thread pid 12742 stopped after 20.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Lustre: DEBUG MARKER: sanity test_242: @@@@@@ FAIL: Lustre log wasn't dumped
      Lustre: DEBUG MARKER: == sanity test complete, duration 29 sec == 09:20:01 (1411550401)
      

      Attachments

        Issue Links

          Activity

            [LU-5695] watchdog dispatch thread disappears

            Since Alex is okay with a one line fix I refreshed the patch. Very simple and should be landed soon.

            simmonsja James A Simmons added a comment - Since Alex is okay with a one line fix I refreshed the patch. Very simple and should be landed soon.

            Is this fixed?

            simmonsja James A Simmons added a comment - Is this fixed?

            Old issue, already fixed

            cliffw Cliff White (Inactive) added a comment - Old issue, already fixed

            procfs (or sysfs) part of the patch is only for testing, I think at least the actual fix from the patch can be landed.

            zam Alexander Zarochentsev added a comment - procfs (or sysfs) part of the patch is only for testing, I think at least the actual fix from the patch can be landed.

            Still waiting for a patch

            cliffw Cliff White (Inactive) added a comment - Still waiting for a patch

            Please reopen. I plan to update the patch but I was waiting until the port to sysfs happens for Lustre 2.10.

            simmonsja James A Simmons added a comment - Please reopen. I plan to update the patch but I was waiting until the port to sysfs happens for Lustre 2.10.

            Bug out of date, no patch update. Closing

            cliffw Cliff White (Inactive) added a comment - Bug out of date, no patch update. Closing

            The patch has failed review, can you address the issues?

            cliffw Cliff White (Inactive) added a comment - The patch has failed review, can you address the issues?

            I will monitor this issue

            cliffw Cliff White (Inactive) added a comment - I will monitor this issue
            zam Alexander Zarochentsev added a comment - patch http://review.whamcloud.com/#/c/12155/

            the fix is like:

            diff --git a/libcfs/libcfs/watchdog.c b/libcfs/libcfs/watchdog.c
            index ed1acf7..e71b48a 100644
            --- a/libcfs/libcfs/watchdog.c
            +++ b/libcfs/libcfs/watchdog.c
            @@ -330,6 +330,7 @@ static void lcw_dispatch_stop(void)
                    wake_up(&lcw_event_waitq);
             
                    wait_for_completion(&lcw_stop_completion);
            +       clear_bit(LCW_FLAG_STOP, &lcw_flags);
             
                    CDEBUG(D_INFO, "watchdog dispatcher has shut down.\n");
            

            a proper patch will be submitted soon

            zam Alexander Zarochentsev added a comment - the fix is like: diff --git a/libcfs/libcfs/watchdog.c b/libcfs/libcfs/watchdog.c index ed1acf7..e71b48a 100644 --- a/libcfs/libcfs/watchdog.c +++ b/libcfs/libcfs/watchdog.c @@ -330,6 +330,7 @@ static void lcw_dispatch_stop(void) wake_up(&lcw_event_waitq); wait_for_completion(&lcw_stop_completion); + clear_bit(LCW_FLAG_STOP, &lcw_flags); CDEBUG(D_INFO, "watchdog dispatcher has shut down.\n" ); a proper patch will be submitted soon

            People

              simmonsja James A Simmons
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: