Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.4.1
-
None
-
our lustre source tree is at:
https://github.com/jlan/lustre-nas
-
3
-
12492
Description
met threads hug. Forced reboot of mds 2 different time.
uploading the following to ftp site:
lustre-log.1391239242.7851.txt.gz
vmcore-dmesg.txt.gz
Lustre: MGS: haven't heard from client c546719d-1bcc-571f-a4e3-17f67dc35b50 (at 10.151.31.4@o2ib) in 199 seconds. I think it's dead, and I am evicting it. exp ffff880fd0587800, cur 1391236411 expire 1391236261 last 1391236212 LNet: Service thread pid 7851 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 7851, comm: mdt01_055 Call Trace: [<ffffffff815404c2>] schedule_timeout+0x192/0x2e0 [<ffffffff81080610>] ? process_timeout+0x0/0x10 [<ffffffffa04156d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa06d201d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa06cd790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [<ffffffff81063be0>] ? default_wake_function+0x0/0x20 [<ffffffffa06d1758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [<ffffffffa06d1b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0dd7a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa0dddc0c>] mdt_object_lock0+0x28c/0xaf0 [mdt] [<ffffffffa0dd7a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa06d1b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0dde534>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa0dde5a1>] mdt_object_find_lock+0x61/0x170 [mdt] [<ffffffffa0e0c80c>] mdt_reint_open+0x8cc/0x20e0 [mdt] [<ffffffffa043185e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] [<ffffffffa06fadcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] [<ffffffffa05921b0>] ? lu_ucred+0x20/0x30 [obdclass] [<ffffffffa0dd7015>] ? mdt_ucred+0x15/0x20 [mdt] [<ffffffffa0df31cc>] ? mdt_root_squash+0x2c/0x410 [mdt] [<ffffffffa0df7981>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0ddcb03>] mdt_reint_internal+0x4c3/0x780 [mdt] [<ffffffffa0ddd090>] mdt_intent_reint+0x1f0/0x530 [mdt] [<ffffffffa0ddaf3e>] mdt_intent_policy+0x39e/0x720 [mdt] [<ffffffffa06b2831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] [<ffffffffa06d91ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] [<ffffffffa0ddb3c6>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa0de1ad7>] mdt_handle_common+0x647/0x16d0 [mdt] [<ffffffffa0e1b615>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa070b3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa04155de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa0426d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa0702729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81055813>] ? __wake_up+0x53/0x70 [<ffffffffa070c75e>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1391239242.7851
1. I am working through the logs to see if there were any IB fabric issues.
2. When the system was in that state cpu utilization was minimal.
3. We have seen the clocksource in our logs but we haven't found it to cause any issue. But I will switch it to hpet.
4. This filesystem has 84 OSTs.