Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.7.0, Lustre 2.8.0
-
None
-
Single node test system (MGT, 2x MDT, 3x OST)
RHEL 6.5 kernel 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64
Lustre v2_7_50_0-3-g7ed514f
{noformat}
7ed514fc1c51LU-5420ptlrpc: revert ptlrpc_reconnect_import() changes
88a0262fdea8LU-5912build: Fix XeonPhi build
b851f20baba1LU-6216tests: compile fixes for PPC64, and for clang
f23e085ac937 Move master branch to 2.8 development
{noformat}Single node test system (MGT, 2x MDT, 3x OST) RHEL 6.5 kernel 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64 Lustre v2_7_50_0-3-g7ed514f {noformat} 7ed514fc1c51 LU-5420 ptlrpc: revert ptlrpc_reconnect_import() changes 88a0262fdea8 LU-5912 build: Fix XeonPhi build b851f20baba1 LU-6216 tests: compile fixes for PPC64, and for clang f23e085ac937 Move master branch to 2.8 development {noformat}
-
3
-
17667
Description
I was running sanity-lfsck on master (v2_7_50_0-3-g7ed514f) and hit a soft lockup in test_18f:
Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 sleeping for 10000ms Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) Skipped 1 previous similar message Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 awake Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) Skipped 1 previous similar message Feb 27 05:01:43 Lustre: DEBUG MARKER: == sanity-lfsck test 18f: Skip the failed OST(s) when handle orphan OST-objects ====================== 05:01:42 (1425038502) Feb 27 05:01:43 Pid: 7574, comm: mdt01_002 Feb 27 05:01:43 Call Trace: Feb 27 05:01:43 [<ffffffffa0f1e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Feb 27 05:01:43 [<ffffffffa0ff462f>] osd_trans_start+0x63f/0x660 [osd_ldiskfs] Feb 27 05:01:43 [<ffffffffa11eea1f>] lod_trans_start+0x9f/0x190 [lod] Feb 27 05:01:43 [<ffffffffa10e0624>] mdd_trans_start+0x14/0x20 [mdd] Feb 27 05:01:43 [<ffffffffa10cbd05>] mdd_create+0xbf5/0x1730 [mdd] Feb 27 05:01:43 [<ffffffffa113ecb8>] mdo_create+0x18/0x50 [mdt] Feb 27 05:01:43 [<ffffffffa1148e6f>] mdt_reint_open+0x1f8f/0x2c70 [mdt] Feb 27 05:01:43 [<ffffffffa0f3b84c>] ? upcall_cache_get_entry+0x29c/0x880 [libcfs] Feb 27 05:01:43 [<ffffffffa11300cd>] mdt_reint_rec+0x5d/0x200 [mdt] Feb 27 05:01:43 [<ffffffffa111423b>] mdt_reint_internal+0x4cb/0x7a0 [mdt] Feb 27 05:01:43 [<ffffffffa1114706>] mdt_intent_reint+0x1f6/0x430 [mdt] Feb 27 05:01:43 [<ffffffffa1112cf4>] mdt_intent_policy+0x494/0xce0 [mdt] Feb 27 05:01:43 [<ffffffffa0af94f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0b2566b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0ba6331>] tgt_enqueue+0x61/0x230 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0ba6f7e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0b569e1>] ptlrpc_main+0xe41/0x1960 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0b55ba0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] Feb 27 05:01:43 [<ffffffff8109abf6>] kthread+0x96/0xa0 Feb 27 05:01:43 [<ffffffff8100c20a>] child_rip+0xa/0x20 Feb 27 05:01:43 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 Feb 27 05:01:43 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Feb 27 05:01:43 Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc start Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc stop Feb 27 05:01:44 Lustre: *** cfs_fail_loc=1616, val=0*** Feb 27 05:01:44 Lustre: Skipped 1 previous similar message Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks mdc stop Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc start Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc stop Feb 27 05:01:47 Lustre: *** cfs_fail_loc=161c, val=0*** Feb 27 05:01:54 Lustre: testfs-OST0002: Client testfs-MDT0001-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:01:54 Lustre: testfs-OST0002: deleting orphan objects from 0x280000400:102 to 0x280000400:193 Feb 27 05:01:54 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:929 Feb 27 05:02:01 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:01 Lustre: Skipped 1 previous similar message Feb 27 05:02:01 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:961 Feb 27 05:02:08 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:08 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:993 Feb 27 05:02:15 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:15 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1025 Feb 27 05:02:22 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:22 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1057 Feb 27 05:02:29 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1089 Feb 27 05:02:36 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:36 Lustre: Skipped 1 previous similar message Feb 27 05:02:36 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1121 Feb 27 05:02:43 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1153 Feb 27 05:02:50 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1185 Feb 27 05:02:57 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:57 Lustre: Skipped 2 previous similar messages Feb 27 05:02:57 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1217 Feb 27 05:03:00 BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost00_015:7824]
The stack trace is a bit strange since it lists all of the addresses on the stack as "uncertain" (leading '?'):
Pid: 7824, comm: ll_ost00_015 Tainted: P --------------- 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64RIP: 0010:[<ffffffff8152c10e>] [<ffffffff8152c10e>] _spin_lock+0x1e/0x30 Process ll_ost00_015 (pid: 7824, threadinfo ffff8800ca29a000, task ffff8800d91e0040) Call Trace: ? lfsck_layout_slave_in_notify+0x210/0x15b0 [lfsck] ? null_alloc_rs+0xf3/0x390 [ptlrpc] ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc] ? lustre_msg_buf+0x55/0x60 [ptlrpc] ? __req_capsule_get+0x162/0x6d0 [ptlrpc] ? lfsck_in_notify+0xef/0x330 [lfsck] ? tgt_handle_lfsck_notify+0x64/0x150 [ptlrpc] ? tgt_request_handle+0x8be/0x1000 [ptlrpc] ? ptlrpc_main+0xe41/0x1960 [ptlrpc]
The test did not progress beyond this, and the same process has been stuck in the watchdog for many hours.