[LU-6312] soft lockup in sanity-lfsck test_18f Created: 01/Mar/15 Updated: 03/Mar/15 Resolved: 03/Mar/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Andreas Dilger | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Single node test system (MGT, 2x MDT, 3x OST) 7ed514fc1c51 LU-5420 ptlrpc: revert ptlrpc_reconnect_import() changes 88a0262fdea8 LU-5912 build: Fix XeonPhi build b851f20baba1 LU-6216 tests: compile fixes for PPC64, and for clang f23e085ac937 Move master branch to 2.8 development |
||
| Severity: | 3 |
| Rank (Obsolete): | 17667 |
| Description |
|
I was running sanity-lfsck on master (v2_7_50_0-3-g7ed514f) and hit a soft lockup in test_18f: Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 sleeping for 10000ms Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) Skipped 1 previous similar message Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 awake Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) Skipped 1 previous similar message Feb 27 05:01:43 Lustre: DEBUG MARKER: == sanity-lfsck test 18f: Skip the failed OST(s) when handle orphan OST-objects ====================== 05:01:42 (1425038502) Feb 27 05:01:43 Pid: 7574, comm: mdt01_002 Feb 27 05:01:43 Call Trace: Feb 27 05:01:43 [<ffffffffa0f1e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Feb 27 05:01:43 [<ffffffffa0ff462f>] osd_trans_start+0x63f/0x660 [osd_ldiskfs] Feb 27 05:01:43 [<ffffffffa11eea1f>] lod_trans_start+0x9f/0x190 [lod] Feb 27 05:01:43 [<ffffffffa10e0624>] mdd_trans_start+0x14/0x20 [mdd] Feb 27 05:01:43 [<ffffffffa10cbd05>] mdd_create+0xbf5/0x1730 [mdd] Feb 27 05:01:43 [<ffffffffa113ecb8>] mdo_create+0x18/0x50 [mdt] Feb 27 05:01:43 [<ffffffffa1148e6f>] mdt_reint_open+0x1f8f/0x2c70 [mdt] Feb 27 05:01:43 [<ffffffffa0f3b84c>] ? upcall_cache_get_entry+0x29c/0x880 [libcfs] Feb 27 05:01:43 [<ffffffffa11300cd>] mdt_reint_rec+0x5d/0x200 [mdt] Feb 27 05:01:43 [<ffffffffa111423b>] mdt_reint_internal+0x4cb/0x7a0 [mdt] Feb 27 05:01:43 [<ffffffffa1114706>] mdt_intent_reint+0x1f6/0x430 [mdt] Feb 27 05:01:43 [<ffffffffa1112cf4>] mdt_intent_policy+0x494/0xce0 [mdt] Feb 27 05:01:43 [<ffffffffa0af94f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0b2566b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0ba6331>] tgt_enqueue+0x61/0x230 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0ba6f7e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0b569e1>] ptlrpc_main+0xe41/0x1960 [ptlrpc] Feb 27 05:01:43 [<ffffffffa0b55ba0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] Feb 27 05:01:43 [<ffffffff8109abf6>] kthread+0x96/0xa0 Feb 27 05:01:43 [<ffffffff8100c20a>] child_rip+0xa/0x20 Feb 27 05:01:43 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 Feb 27 05:01:43 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Feb 27 05:01:43 Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc start Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc stop Feb 27 05:01:44 Lustre: *** cfs_fail_loc=1616, val=0*** Feb 27 05:01:44 Lustre: Skipped 1 previous similar message Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks mdc stop Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc start Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc stop Feb 27 05:01:47 Lustre: *** cfs_fail_loc=161c, val=0*** Feb 27 05:01:54 Lustre: testfs-OST0002: Client testfs-MDT0001-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:01:54 Lustre: testfs-OST0002: deleting orphan objects from 0x280000400:102 to 0x280000400:193 Feb 27 05:01:54 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:929 Feb 27 05:02:01 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:01 Lustre: Skipped 1 previous similar message Feb 27 05:02:01 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:961 Feb 27 05:02:08 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:08 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:993 Feb 27 05:02:15 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:15 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1025 Feb 27 05:02:22 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:22 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1057 Feb 27 05:02:29 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1089 Feb 27 05:02:36 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:36 Lustre: Skipped 1 previous similar message Feb 27 05:02:36 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1121 Feb 27 05:02:43 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1153 Feb 27 05:02:50 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1185 Feb 27 05:02:57 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting Feb 27 05:02:57 Lustre: Skipped 2 previous similar messages Feb 27 05:02:57 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1217 Feb 27 05:03:00 BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost00_015:7824] The stack trace is a bit strange since it lists all of the addresses on the stack as "uncertain" (leading '?'): Pid: 7824, comm: ll_ost00_015 Tainted: P --------------- 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64RIP: 0010:[<ffffffff8152c10e>] [<ffffffff8152c10e>] _spin_lock+0x1e/0x30 Process ll_ost00_015 (pid: 7824, threadinfo ffff8800ca29a000, task ffff8800d91e0040) Call Trace: ? lfsck_layout_slave_in_notify+0x210/0x15b0 [lfsck] ? null_alloc_rs+0xf3/0x390 [ptlrpc] ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc] ? lustre_msg_buf+0x55/0x60 [ptlrpc] ? __req_capsule_get+0x162/0x6d0 [ptlrpc] ? lfsck_in_notify+0xef/0x330 [lfsck] ? tgt_handle_lfsck_notify+0x64/0x150 [ptlrpc] ? tgt_request_handle+0x8be/0x1000 [ptlrpc] ? ptlrpc_main+0xe41/0x1960 [ptlrpc] The test did not progress beyond this, and the same process has been stuck in the watchdog for many hours. |
| Comments |
| Comment by Andreas Dilger [ 01/Mar/15 ] |
|
PS: I don't know if this is a new failure, or has been around for some time, as I haven't tested sanity-lfsck in some time. |
| Comment by nasf (Inactive) [ 02/Mar/15 ] |
|
There is spin_lock leak in layout LFSCK, that may cause modifying the lfsck_layout_slave_data::llsd_master_list without spin_lock when others traverses such list with spin_lock. |
| Comment by Gerrit Updater [ 02/Mar/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13921 |
| Comment by Gerrit Updater [ 02/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13921/ |
| Comment by Gerrit Updater [ 03/Mar/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13944 |
| Comment by Gerrit Updater [ 03/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13944/ |
| Comment by Peter Jones [ 03/Mar/15 ] |
|
Landed for 2.7 and 2.8 |