[LU-6312] soft lockup in sanity-lfsck test_18f Created: 01/Mar/15  Updated: 03/Mar/15  Resolved: 03/Mar/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.7.0, Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Andreas Dilger Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Single node test system (MGT, 2x MDT, 3x OST)
RHEL 6.5 kernel 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64
Lustre v2_7_50_0-3-g7ed514f

7ed514fc1c51 LU-5420 ptlrpc: revert ptlrpc_reconnect_import() changes
88a0262fdea8 LU-5912 build: Fix XeonPhi build
b851f20baba1 LU-6216 tests: compile fixes for PPC64, and for clang
f23e085ac937 Move master branch to 2.8 development

Severity: 3
Rank (Obsolete): 17667

 Description   

I was running sanity-lfsck on master (v2_7_50_0-3-g7ed514f) and hit a soft lockup in test_18f:

Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 sleeping for 10000ms
Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) Skipped 1 previous similar message
Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 awake
Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) Skipped 1 previous similar message
Feb 27 05:01:43 Lustre: DEBUG MARKER: == sanity-lfsck test 18f: Skip the failed OST(s) when handle orphan OST-objects ====================== 05:01:42 (1425038502)
Feb 27 05:01:43 Pid: 7574, comm: mdt01_002
Feb 27 05:01:43 Call Trace:
Feb 27 05:01:43 [<ffffffffa0f1e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Feb 27 05:01:43 [<ffffffffa0ff462f>] osd_trans_start+0x63f/0x660 [osd_ldiskfs]
Feb 27 05:01:43 [<ffffffffa11eea1f>] lod_trans_start+0x9f/0x190 [lod]
Feb 27 05:01:43 [<ffffffffa10e0624>] mdd_trans_start+0x14/0x20 [mdd]
Feb 27 05:01:43 [<ffffffffa10cbd05>] mdd_create+0xbf5/0x1730 [mdd]
Feb 27 05:01:43 [<ffffffffa113ecb8>] mdo_create+0x18/0x50 [mdt]
Feb 27 05:01:43 [<ffffffffa1148e6f>] mdt_reint_open+0x1f8f/0x2c70 [mdt]
Feb 27 05:01:43 [<ffffffffa0f3b84c>] ? upcall_cache_get_entry+0x29c/0x880 [libcfs]
Feb 27 05:01:43 [<ffffffffa11300cd>] mdt_reint_rec+0x5d/0x200 [mdt]
Feb 27 05:01:43 [<ffffffffa111423b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
Feb 27 05:01:43 [<ffffffffa1114706>] mdt_intent_reint+0x1f6/0x430 [mdt]
Feb 27 05:01:43 [<ffffffffa1112cf4>] mdt_intent_policy+0x494/0xce0 [mdt]
Feb 27 05:01:43 [<ffffffffa0af94f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0b2566b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0ba6331>] tgt_enqueue+0x61/0x230 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0ba6f7e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0b569e1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0b55ba0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
Feb 27 05:01:43 [<ffffffff8109abf6>] kthread+0x96/0xa0
Feb 27 05:01:43 [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 27 05:01:43 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
Feb 27 05:01:43 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 27 05:01:43 
Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc start
Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc stop
Feb 27 05:01:44 Lustre: *** cfs_fail_loc=1616, val=0***
Feb 27 05:01:44 Lustre: Skipped 1 previous similar message
Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks mdc stop
Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc start
Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc stop
Feb 27 05:01:47 Lustre: *** cfs_fail_loc=161c, val=0***
Feb 27 05:01:54 Lustre: testfs-OST0002: Client testfs-MDT0001-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:01:54 Lustre: testfs-OST0002: deleting orphan objects from 0x280000400:102 to 0x280000400:193
Feb 27 05:01:54 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:929
Feb 27 05:02:01 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:01 Lustre: Skipped 1 previous similar message
Feb 27 05:02:01 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:961
Feb 27 05:02:08 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:08 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:993
Feb 27 05:02:15 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:15 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1025
Feb 27 05:02:22 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:22 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1057
Feb 27 05:02:29 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1089
Feb 27 05:02:36 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:36 Lustre: Skipped 1 previous similar message
Feb 27 05:02:36 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1121
Feb 27 05:02:43 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1153
Feb 27 05:02:50 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1185
Feb 27 05:02:57 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:57 Lustre: Skipped 2 previous similar messages
Feb 27 05:02:57 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1217
Feb 27 05:03:00 BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost00_015:7824]

The stack trace is a bit strange since it lists all of the addresses on the stack as "uncertain" (leading '?'):

Pid: 7824, comm: ll_ost00_015 Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64RIP: 0010:[<ffffffff8152c10e>]  [<ffffffff8152c10e>] _spin_lock+0x1e/0x30
Process ll_ost00_015 (pid: 7824, threadinfo ffff8800ca29a000, task ffff8800d91e0040)
Call Trace:
 ? lfsck_layout_slave_in_notify+0x210/0x15b0 [lfsck]
 ? null_alloc_rs+0xf3/0x390 [ptlrpc]
 ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc]
 ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 ? __req_capsule_get+0x162/0x6d0 [ptlrpc]
 ? lfsck_in_notify+0xef/0x330 [lfsck]
 ? tgt_handle_lfsck_notify+0x64/0x150 [ptlrpc]
 ? tgt_request_handle+0x8be/0x1000 [ptlrpc]
 ? ptlrpc_main+0xe41/0x1960 [ptlrpc]

The test did not progress beyond this, and the same process has been stuck in the watchdog for many hours.



 Comments   
Comment by Andreas Dilger [ 01/Mar/15 ]

PS: I don't know if this is a new failure, or has been around for some time, as I haven't tested sanity-lfsck in some time.

Comment by nasf (Inactive) [ 02/Mar/15 ]

There is spin_lock leak in layout LFSCK, that may cause modifying the lfsck_layout_slave_data::llsd_master_list without spin_lock when others traverses such list with spin_lock.

Comment by Gerrit Updater [ 02/Mar/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13921
Subject: LU-6312 lfsck: modify llsd_master_list with spin_lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a9a3bc335ccd5af33cfcbf2fc2a3f425727a1a6a

Comment by Gerrit Updater [ 02/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13921/
Subject: LU-6312 lfsck: modify llsd_master_list with spin_lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 094030bab406b0ea5d45f711549327829b68c9cd

Comment by Gerrit Updater [ 03/Mar/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13944
Subject: LU-6312 lfsck: modify llsd_master_list with spin_lock
Project: fs/lustre-release
Branch: b2_7
Current Patch Set: 1
Commit: b9fa93bb5cad33f9526a7bbb488ef359c176f1dd

Comment by Gerrit Updater [ 03/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13944/
Subject: LU-6312 lfsck: modify llsd_master_list with spin_lock
Project: fs/lustre-release
Branch: b2_7
Current Patch Set:
Commit: 081a71a121f0bb42d73a7a90812fc3b2a1c02091

Comment by Peter Jones [ 03/Mar/15 ]

Landed for 2.7 and 2.8

Generated at Sat Feb 10 01:59:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.