[LU-5428] LNet: Service thread pid completed after 0.00s (DDN SR34734) Created: 29/Jul/14  Updated: 21/Mar/17  Resolved: 28/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Oz Rentas Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

New Installation - Lustre 2.4.3 servers, 1.8.9 Clients


Issue Links:
Related
Severity: 3
Rank (Obsolete): 15111

 Description   

This problem was reported against a newly installed system at NOAA (Boulder). The system was idle at the time:

Jul 17 04:53:57 lfs-mds-0-1 kernel: : LNet: Service thread pid 29363 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jul 17 05:26:52 lfs-mds-0-1 kernel: : LNet: Service thread pid 29363 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jul 20 04:10:08 lfs-mds-0-1 kernel: : LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=off. Opts:
Jul 21 01:20:12 lfs-mds-0-1 kernel: : LNet: Service thread pid 13603 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jul 21 14:35:12 lfs-mds-0-1 kernel: : LNet: Service thread pid 13829 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jul 23 05:55:12 lfs-mds-0-1 kernel: : LNet: Service thread pid 29363 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jul 24 11:19:47 lfs-mds-0-1 kernel: : LNet: Service thread pid 13672 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Customer stats that he is observing LNet: Service thread pid completed after 0.00s even when the system is idle (they are on testbed (pre-production)).

I also saw these same messages on another idle system that was newly installed (Harvard (HMU)).



 Comments   
Comment by Peter Jones [ 30/Jul/14 ]

Liang

Could you please advise on this one?

Thanks

Peter

Comment by Liang Zhen (Inactive) [ 31/Jul/14 ]

It is strange that we saw "Service thread pid completed after 0.00s", because watchdog should complain only if service thread took too long to finish a request, but here we saw 0.00s. I think it could be a bug in our watchdog code, I will look into it.
Btw, I guess the system should be still working fine besides these fault warning?

Comment by Oz Rentas [ 01/Aug/14 ]

Yes, very strange. I agree.

>>Btw, I guess the system should be still working fine besides these fault warning?
Yes

Thanks,
Oz

Comment by Oz Rentas [ 11/Aug/14 ]

Any ideas on this one?

Comment by Liang Zhen (Inactive) [ 12/Aug/14 ]

Hi, sorry for late response. I have worked out a patch: http://review.whamcloud.com/11415
Briefly, reason of this issue is because there is a race between lc_watchdog_touch and lcw_cb which will generate false alarm, this should be harmless, but it is still good to fix it.

Comment by Liang Zhen (Inactive) [ 19/Aug/14 ]

Patch landed to master

Comment by Oz Rentas [ 28/Aug/14 ]

Thanks much. Go ahead and close this.

Comment by Peter Jones [ 28/Aug/14 ]

Thanks Oz

Comment by Rustem Bikboulatov [ 22/Feb/16 ]

Will this patch work with earlier versions of Lustre? For example, version 2.1.5 ?

In version 2.1.5, we are seeing the same symptoms:

Feb 21 07:29:59 mmp-2 kernel: Lustre: Service thread pid 5040 was inactive for 0.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb 21 07:29:59 mmp-2 kernel: Pid: 5040, comm: ll_mgs_00
Feb 21 07:29:59 mmp-2 kernel:
Feb 21 07:29:59 mmp-2 kernel: Call Trace:
Feb 21 07:29:59 mmp-2 kernel: Lustre: Service thread pid 5040 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Feb 21 07:29:59 mmp-2 kernel: [<ffffffffa0620af4>] ? target_send_reply_msg+0x54/0x190 [ptlrpc]
Feb 21 07:29:59 mmp-2 kernel: [<ffffffffa03b360e>] cfs_waitq_wait+0xe/0x10 [libcfs]
Feb 21 07:29:59 mmp-2 kernel: [<ffffffffa065e9f9>] ptlrpc_wait_event+0x2b9/0x2c0 [ptlrpc]
Feb 21 07:29:59 mmp-2 kernel: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
Feb 21 07:29:59 mmp-2 kernel: [<ffffffffa066657d>] ptlrpc_main+0x61d/0x1a40 [ptlrpc]
Feb 21 07:29:59 mmp-2 kernel: [<ffffffffa0665f60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Feb 21 07:29:59 mmp-2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Feb 21 07:29:59 mmp-2 kernel: [<ffffffffa0665f60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Feb 21 07:29:59 mmp-2 kernel: [<ffffffffa0665f60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Feb 21 07:29:59 mmp-2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Feb 21 07:29:59 mmp-2 kernel:

Comment by Liang Zhen (Inactive) [ 29/Feb/16 ]

yes, I think it should work for 2.1.5.

Comment by Gerrit Updater [ 14/Oct/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23162
Subject: LU-5428 libcfs: race in lc_watchdog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 637b75e8ebd2e17c151f32aecb341dbfa336264b

Comment by Peter Jones [ 21/Mar/17 ]

Patch will be tracked for landing under LU-9235

Generated at Sat Feb 10 01:51:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.