[LU-2944] Client evictions - watchdog timeouts on MDT - iorfpp Created: 11/Mar/13  Updated: 04/Feb/14  Resolved: 13/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: MB
Environment:

Hyperion/LLNL RHEL6


Issue Links:
Related
is related to LU-4572 hung mdt threads Resolved
is related to LU-2419 mdt threads stuck in ldlm_expired_com... Closed
Severity: 3
Rank (Obsolete): 7066

 Description   

Running parallel-scale IOR fpp test. at end of test. 60 clients report ENOTCONN, then are evicted from MDT due to lock callback timeout:

Mar  8 15:47:22 hyperion-rst6 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 477s: evicting client at 192.168.117.65@o2ib1  ns: mdt-ffff8802f7d21000 lock: ffff8801193bfe00/0xb4fc8ee670e8bb9a lrc: 3/0,0 mode: CR/CR res: 8589935754/45677 bits 0x9 rrc: 2 type: IBT flags: 0x200000000020 nid: 192.168.117.65@o2ib1 remote: 0xc3b8b1de83cb4606 expref: 81 pid: 11951 timeout: 4379843546 lvb_type: 0

After long delay, system is idle and MDT is now watchdogging:

Mar  8 17:03:57 hyperion-rst6 kernel: LNet: Service thread pid 15804 was inactive for 304.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar  8 17:03:57 hyperion-rst6 kernel: Pid: 15804, comm: mdt03_035
Mar  8 17:03:57 hyperion-rst6 kernel:
Mar  8 17:03:57 hyperion-rst6 kernel: Call Trace:  
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff8107cb50>] ? process_timeout+0x0/0x10
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa06df6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096b22d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0966950>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096a968>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096ad40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3ac60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3d92b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3ac60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096ad40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3e1a4>] mdt_object_lock+0x14/0x20 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f5e5a9>] mdt_reint_unlink+0x5b9/0xdf0 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f59781>] mdt_reint_rec+0x41/0xe0 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f52de3>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f53114>] mdt_reint+0x44/0xe0 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f44008>] mdt_handle_common+0x628/0x1620 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f7c6e5>] mds_regular_handle+0x15/0x20 [mdt]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a404c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa06df5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa099b799>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff81052223>] ? __wake_up+0x53/0x70
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a5596>] ptlrpc_main+0xb76/0x1870 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

After a period, evicted clients are unable to reconnect, MDT reports 'busy with 1 RPC'



 Comments   
Comment by Oleg Drokin [ 11/Mar/13 ]

We really need more info than that, like sysrq-t or the like.
a bigger log to see all such threads hung...

Does this happen every time?

Comment by Cliff White (Inactive) [ 13/Mar/13 ]

No, i have repeated the parallel-scale tests, and did not have evictions on the second run. You can close, i can re-open if i get a repeat.

Generated at Sat Feb 10 01:29:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.