Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
3
-
5846
Description
I think this is a duplicate of LU-1640, but just in case it's not I'm opening this as a separate ticket.
I've found the following message on the console of our Grove MDS today after upgrading to 2.3.57-1chaos-1surya1 (2.3.57-1chaos + a couple changes for LU-2109) and rebooting a few times.
2012-12-13 13:25:25 LNet: Service thread pid 33078 was inactive for 450.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2012-12-13 13:25:25 Pid: 33078, comm: mdt01_020 2012-12-13 13:25:25 2012-12-13 13:25:25 Call Trace: 2012-12-13 13:25:25 [<ffffffffa05b07de>] cfs_waitq_wait+0xe/0x10 [libcfs] 2012-12-13 13:25:25 [<ffffffffa0764fe3>] lu_object_find_at+0xb3/0x460 [obdclass] 2012-12-13 13:25:25 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 2012-12-13 13:25:25 [<ffffffff8126c735>] ? _atomic_dec_and_lock+0x55/0x80 2012-12-13 13:25:25 [<ffffffffa07653cf>] lu_object_find_slice+0x1f/0x80 [obdclass] 2012-12-13 13:25:25 [<ffffffffa0c27310>] mdd_object_find+0x10/0x70 [mdd] 2012-12-13 13:25:25 [<ffffffffa0c2b89f>] mdd_path+0x35f/0x1060 [mdd] 2012-12-13 13:25:25 [<ffffffffa08f30f1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc] 2012-12-13 13:25:25 [<ffffffffa0c8dfd7>] mdt_get_info+0x567/0xbb0 [mdt] 2012-12-13 13:25:25 [<ffffffffa0c8941d>] ? mdt_unpack_req_pack_rep+0x4d/0x4c0 [mdt] 2012-12-13 13:25:25 [<ffffffffa0c91782>] mdt_handle_common+0x932/0x1760 [mdt] 2012-12-13 13:25:25 [<ffffffffa0c92685>] mdt_regular_handle+0x15/0x20 [mdt] 2012-12-13 13:25:25 [<ffffffffa09036ac>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-12-13 13:25:25 [<ffffffffa05b06be>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-12-13 13:25:25 [<ffffffffa05c214f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-12-13 13:25:25 [<ffffffffa08faa79>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2012-12-13 13:25:25 [<ffffffff81051ba3>] ? __wake_up+0x53/0x70 2012-12-13 13:25:25 [<ffffffffa0904c45>] ptlrpc_main+0xbb5/0x1970 [ptlrpc] 2012-12-13 13:25:25 [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2012-12-13 13:25:25 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-12-13 13:25:25 [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2012-12-13 13:25:25 [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2012-12-13 13:25:25 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-12-13 13:25:25 2012-12-13 13:25:25 LustreError: dumping log to /tmp/lustre-log.1355433925.33078
Even nearly an hour later, it looks like the thread is still stuck. I continually see the following message every few minutes:
2012-12-13 14:22:12 INFO: task mdt01_020:33078 blocked for more than 120 seconds. 2012-12-13 14:22:12 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2012-12-13 14:22:12 mdt01_020 D 0000000000000004 0 33078 2 0x00000000 2012-12-13 14:22:12 ffff880f444d7ae0 0000000000000046 0000000000000000 ffff880f444d7b60 2012-12-13 14:22:12 ffff880f444d7a90 ffffc9007e22e02c 0000000000000246 0000000000000246 2012-12-13 14:22:12 ffff880f444d5ab8 ffff880f444d7fd8 000000000000f4e8 ffff880f444d5ab8 2012-12-13 14:22:12 Call Trace: 2012-12-13 14:22:12 [<ffffffffa05b07de>] cfs_waitq_wait+0xe/0x10 [libcfs] 2012-12-13 14:22:12 [<ffffffffa0764fe3>] lu_object_find_at+0xb3/0x460 [obdclass] 2012-12-13 14:22:12 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 2012-12-13 14:22:12 [<ffffffff8126c735>] ? _atomic_dec_and_lock+0x55/0x80 2012-12-13 14:22:12 [<ffffffffa07653cf>] lu_object_find_slice+0x1f/0x80 [obdclass] 2012-12-13 14:22:12 [<ffffffffa0c27310>] mdd_object_find+0x10/0x70 [mdd] 2012-12-13 14:22:12 [<ffffffffa0c2b89f>] mdd_path+0x35f/0x1060 [mdd] 2012-12-13 14:22:12 [<ffffffffa08f30f1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc] 2012-12-13 14:22:12 [<ffffffffa0c8dfd7>] mdt_get_info+0x567/0xbb0 [mdt] 2012-12-13 14:22:12 [<ffffffffa0c8941d>] ? mdt_unpack_req_pack_rep+0x4d/0x4c0 [mdt] 2012-12-13 14:22:12 [<ffffffffa0c91782>] mdt_handle_common+0x932/0x1760 [mdt] 2012-12-13 14:22:12 [<ffffffffa0c92685>] mdt_regular_handle+0x15/0x20 [mdt] 2012-12-13 14:22:12 [<ffffffffa09036ac>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-12-13 14:22:12 [<ffffffffa05b06be>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-12-13 14:22:12 [<ffffffffa05c214f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-12-13 14:22:12 [<ffffffffa08faa79>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2012-12-13 14:22:12 [<ffffffff81051ba3>] ? __wake_up+0x53/0x70 2012-12-13 14:22:12 [<ffffffffa0904c45>] ptlrpc_main+0xbb5/0x1970 [ptlrpc] 2012-12-13 14:22:12 [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2012-12-13 14:22:12 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-12-13 14:22:12 [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2012-12-13 14:22:12 [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2012-12-13 14:22:12 [<ffffffff8100c140>] ? child_rip+0x0/0x20
I've attached the lustre log file dumped by the LNet message above:
2012-12-13 13:25:25 LustreError: dumping log to /tmp/lustre-log.1355433925.33078