[LU-5695] watchdog dispatch thread disappears Created: 01/Oct/14 Updated: 27/Feb/18 Resolved: 27/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Zarochentsev | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15940 | ||||||||
| Description |
|
Sometimes lc_watchdogd disappears w/o any messages and lustre logs are not dumped after watchdog triggered. How the correct behaviour should look: LNet: Service thread pid 7096 was inactive for 10.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 7096, comm: lctl Call Trace: [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffffa0380df7>] proc_trigger_watchdog+0x67/0x80 [libcfs] [<ffffffff811fd8e7>] proc_sys_call_handler+0x97/0xd0 [<ffffffff811fd934>] proc_sys_write+0x14/0x20 [<ffffffff81188f68>] vfs_write+0xb8/0x1a0 [<ffffffff81189861>] sys_write+0x51/0x90 [<ffffffff8152b2be>] ? do_device_not_available+0xe/0x10 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b LustreError: dumping log to /tmp/lustre-log.1411548646.7096 and how it may look in the kernel logs when lustre logs are not dumped: Lustre: DEBUG MARKER: == sanity test 242: Check that watchdog causes kernel log dump == 09:19:38 (1411550378) LNet: Service thread pid 12742 stopped after 20.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Lustre: DEBUG MARKER: sanity test_242: @@@@@@ FAIL: Lustre log wasn't dumped Lustre: DEBUG MARKER: == sanity test complete, duration 29 sec == 09:20:01 (1411550401) |
| Comments |
| Comment by Alexander Zarochentsev [ 01/Oct/14 ] |
|
After a closer look at libcfs/libcfs/watchdog.c it was found that LCW_FLAG_STOP flag in lcw_flags variable is only set and never gets cleared. [ 17:33:12 ] $ git grep lcw_flags
libcfs/libcfs/watchdog.c:static unsigned long lcw_flags = 0;
libcfs/libcfs/watchdog.c: if (test_bit(LCW_FLAG_STOP, &lcw_flags))
libcfs/libcfs/watchdog.c: if (test_bit(LCW_FLAG_STOP, &lcw_flags)) {
libcfs/libcfs/watchdog.c: set_bit(LCW_FLAG_STOP, &lcw_flags);
So if lcw_refcount reaches zero and the watchdog thread is stopped by lcw_dispatch_stop() , it will be never working again (it exists immediately after start) until the modules reload or system restart. |
| Comment by Alexander Zarochentsev [ 01/Oct/14 ] |
|
the fix is like: diff --git a/libcfs/libcfs/watchdog.c b/libcfs/libcfs/watchdog.c index ed1acf7..e71b48a 100644 --- a/libcfs/libcfs/watchdog.c +++ b/libcfs/libcfs/watchdog.c @@ -330,6 +330,7 @@ static void lcw_dispatch_stop(void) wake_up(&lcw_event_waitq); wait_for_completion(&lcw_stop_completion); + clear_bit(LCW_FLAG_STOP, &lcw_flags); CDEBUG(D_INFO, "watchdog dispatcher has shut down.\n"); a proper patch will be submitted soon |
| Comment by Alexander Zarochentsev [ 01/Oct/14 ] |
| Comment by Cliff White (Inactive) [ 02/Oct/14 ] |
|
I will monitor this issue |
| Comment by Cliff White (Inactive) [ 20/Nov/14 ] |
|
The patch has failed review, can you address the issues? |
| Comment by Cliff White (Inactive) [ 18/Aug/16 ] |
|
Bug out of date, no patch update. Closing |
| Comment by James A Simmons [ 18/Aug/16 ] |
|
Please reopen. I plan to update the patch but I was waiting until the port to sysfs happens for Lustre 2.10. |
| Comment by Cliff White (Inactive) [ 18/Aug/16 ] |
|
Still waiting for a patch |
| Comment by Alexander Zarochentsev [ 20/Aug/16 ] |
|
procfs (or sysfs) part of the patch is only for testing, I think at least the actual fix from the patch can be landed. |
| Comment by Cliff White (Inactive) [ 09/Feb/18 ] |
|
Old issue, already fixed |
| Comment by James A Simmons [ 16/Feb/18 ] |
|
Is this fixed? |
| Comment by James A Simmons [ 16/Feb/18 ] |
|
Since Alex is okay with a one line fix I refreshed the patch. Very simple and should be landed soon. |
| Comment by Alexander Zarochentsev [ 17/Feb/18 ] |
|
yes, lets go with the one-line fix. |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/12155/ |
| Comment by Peter Jones [ 27/Feb/18 ] |
|
Landed for 2.11 |