Details
-
Bug
-
Resolution: Cannot Reproduce
-
Blocker
-
None
-
Lustre 2.4.0
-
Hyperion/LLNL RHEL6
-
3
-
7066
Description
Running parallel-scale IOR fpp test. at end of test. 60 clients report ENOTCONN, then are evicted from MDT due to lock callback timeout:
Mar 8 15:47:22 hyperion-rst6 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 477s: evicting client at 192.168.117.65@o2ib1 ns: mdt-ffff8802f7d21000 lock: ffff8801193bfe00/0xb4fc8ee670e8bb9a lrc: 3/0,0 mode: CR/CR res: 8589935754/45677 bits 0x9 rrc: 2 type: IBT flags: 0x200000000020 nid: 192.168.117.65@o2ib1 remote: 0xc3b8b1de83cb4606 expref: 81 pid: 11951 timeout: 4379843546 lvb_type: 0
After long delay, system is idle and MDT is now watchdogging:
Mar 8 17:03:57 hyperion-rst6 kernel: LNet: Service thread pid 15804 was inactive for 304.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 8 17:03:57 hyperion-rst6 kernel: Pid: 15804, comm: mdt03_035 Mar 8 17:03:57 hyperion-rst6 kernel: Mar 8 17:03:57 hyperion-rst6 kernel: Call Trace: Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0 Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffff8107cb50>] ? process_timeout+0x0/0x10 Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa06df6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096b22d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0966950>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096a968>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096ad40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3ac60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3d92b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3ac60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096ad40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3e1a4>] mdt_object_lock+0x14/0x20 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f5e5a9>] mdt_reint_unlink+0x5b9/0xdf0 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f59781>] mdt_reint_rec+0x41/0xe0 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f52de3>] mdt_reint_internal+0x4e3/0x7d0 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f53114>] mdt_reint+0x44/0xe0 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f44008>] mdt_handle_common+0x628/0x1620 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f7c6e5>] mds_regular_handle+0x15/0x20 [mdt] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a404c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa06df5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa099b799>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffff81052223>] ? __wake_up+0x53/0x70 Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a5596>] ptlrpc_main+0xb76/0x1870 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] Mar 8 17:03:57 hyperion-rst6 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
After a period, evicted clients are unable to reconnect, MDT reports 'busy with 1 RPC'