[LU-5695] watchdog dispatch thread disappears Created: 01/Oct/14  Updated: 27/Feb/18  Resolved: 27/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-8066 Move lustre procfs handling to sysfs ... Open
Severity: 3
Rank (Obsolete): 15940

 Description   

Sometimes lc_watchdogd disappears w/o any messages and lustre logs are not dumped after watchdog triggered.

How the correct behaviour should look:

LNet: Service thread pid 7096 was inactive for 10.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 7096, comm: lctl

Call Trace:
 [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffffa0380df7>] proc_trigger_watchdog+0x67/0x80 [libcfs]
 [<ffffffff811fd8e7>] proc_sys_call_handler+0x97/0xd0
 [<ffffffff811fd934>] proc_sys_write+0x14/0x20
 [<ffffffff81188f68>] vfs_write+0xb8/0x1a0
 [<ffffffff81189861>] sys_write+0x51/0x90
 [<ffffffff8152b2be>] ? do_device_not_available+0xe/0x10
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

LustreError: dumping log to /tmp/lustre-log.1411548646.7096

and how it may look in the kernel logs when lustre logs are not dumped:

Lustre: DEBUG MARKER: == sanity test 242: Check that watchdog causes kernel log dump == 09:19:38 (1411550378)
LNet: Service thread pid 12742 stopped after 20.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Lustre: DEBUG MARKER: sanity test_242: @@@@@@ FAIL: Lustre log wasn't dumped
Lustre: DEBUG MARKER: == sanity test complete, duration 29 sec == 09:20:01 (1411550401)


 Comments   
Comment by Alexander Zarochentsev [ 01/Oct/14 ]

After a closer look at libcfs/libcfs/watchdog.c it was found that LCW_FLAG_STOP flag in lcw_flags variable is only set and never gets cleared.

[ 17:33:12 ] $ git grep lcw_flags
libcfs/libcfs/watchdog.c:static unsigned long lcw_flags = 0;
libcfs/libcfs/watchdog.c:       if (test_bit(LCW_FLAG_STOP, &lcw_flags))
libcfs/libcfs/watchdog.c:               if (test_bit(LCW_FLAG_STOP, &lcw_flags)) {
libcfs/libcfs/watchdog.c:       set_bit(LCW_FLAG_STOP, &lcw_flags);

So if lcw_refcount reaches zero and the watchdog thread is stopped by lcw_dispatch_stop() , it will be never working again (it exists immediately after start) until the modules reload or system restart.

Comment by Alexander Zarochentsev [ 01/Oct/14 ]

the fix is like:

diff --git a/libcfs/libcfs/watchdog.c b/libcfs/libcfs/watchdog.c
index ed1acf7..e71b48a 100644
--- a/libcfs/libcfs/watchdog.c
+++ b/libcfs/libcfs/watchdog.c
@@ -330,6 +330,7 @@ static void lcw_dispatch_stop(void)
        wake_up(&lcw_event_waitq);
 
        wait_for_completion(&lcw_stop_completion);
+       clear_bit(LCW_FLAG_STOP, &lcw_flags);
 
        CDEBUG(D_INFO, "watchdog dispatcher has shut down.\n");

a proper patch will be submitted soon

Comment by Alexander Zarochentsev [ 01/Oct/14 ]

patch http://review.whamcloud.com/#/c/12155/

Comment by Cliff White (Inactive) [ 02/Oct/14 ]

I will monitor this issue

Comment by Cliff White (Inactive) [ 20/Nov/14 ]

The patch has failed review, can you address the issues?

Comment by Cliff White (Inactive) [ 18/Aug/16 ]

Bug out of date, no patch update. Closing

Comment by James A Simmons [ 18/Aug/16 ]

Please reopen. I plan to update the patch but I was waiting until the port to sysfs happens for Lustre 2.10.

Comment by Cliff White (Inactive) [ 18/Aug/16 ]

Still waiting for a patch

Comment by Alexander Zarochentsev [ 20/Aug/16 ]

procfs (or sysfs) part of the patch is only for testing, I think at least the actual fix from the patch can be landed.

Comment by Cliff White (Inactive) [ 09/Feb/18 ]

Old issue, already fixed

Comment by James A Simmons [ 16/Feb/18 ]

Is this fixed?

Comment by James A Simmons [ 16/Feb/18 ]

Since Alex is okay with a one line fix I refreshed the patch. Very simple and should be landed soon.

Comment by Alexander Zarochentsev [ 17/Feb/18 ]

yes, lets go with the one-line fix.

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/12155/
Subject: LU-5695 libcfs: watchdog dispatch thread fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1947bc08c0709ad80611dc65785ccb8dbf7f7214

Comment by Peter Jones [ 27/Feb/18 ]

Landed for 2.11

Generated at Sat Feb 10 01:53:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.