Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.7.0, Lustre 2.8.0
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Labels:
None
Environment:

Hide
Single node test system (MGT, 2x MDT, 3x OST)
RHEL 6.5 kernel 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64
Lustre v2_7_50_0-3-g7ed514f
{noformat}
7ed514fc1c51 ~~LU-5420~~ ptlrpc: revert ptlrpc_reconnect_import() changes
88a0262fdea8 ~~LU-5912~~ build: Fix XeonPhi build
b851f20baba1 ~~LU-6216~~ tests: compile fixes for PPC64, and for clang
f23e085ac937 Move master branch to 2.8 development
{noformat}

Show
Single node test system (MGT, 2x MDT, 3x OST) RHEL 6.5 kernel 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64 Lustre v2_7_50_0-3-g7ed514f {noformat} 7ed514fc1c51 LU-5420 ptlrpc: revert ptlrpc_reconnect_import() changes 88a0262fdea8 LU-5912 build: Fix XeonPhi build b851f20baba1 LU-6216 tests: compile fixes for PPC64, and for clang f23e085ac937 Move master branch to 2.8 development {noformat}

Severity:
3
Rank (Obsolete):
17667

Description

I was running sanity-lfsck on master (v2_7_50_0-3-g7ed514f) and hit a soft lockup in test_18f:

Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 sleeping for 10000ms
Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) Skipped 1 previous similar message
Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 awake
Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) Skipped 1 previous similar message
Feb 27 05:01:43 Lustre: DEBUG MARKER: == sanity-lfsck test 18f: Skip the failed OST(s) when handle orphan OST-objects ====================== 05:01:42 (1425038502)
Feb 27 05:01:43 Pid: 7574, comm: mdt01_002
Feb 27 05:01:43 Call Trace:
Feb 27 05:01:43 [<ffffffffa0f1e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Feb 27 05:01:43 [<ffffffffa0ff462f>] osd_trans_start+0x63f/0x660 [osd_ldiskfs]
Feb 27 05:01:43 [<ffffffffa11eea1f>] lod_trans_start+0x9f/0x190 [lod]
Feb 27 05:01:43 [<ffffffffa10e0624>] mdd_trans_start+0x14/0x20 [mdd]
Feb 27 05:01:43 [<ffffffffa10cbd05>] mdd_create+0xbf5/0x1730 [mdd]
Feb 27 05:01:43 [<ffffffffa113ecb8>] mdo_create+0x18/0x50 [mdt]
Feb 27 05:01:43 [<ffffffffa1148e6f>] mdt_reint_open+0x1f8f/0x2c70 [mdt]
Feb 27 05:01:43 [<ffffffffa0f3b84c>] ? upcall_cache_get_entry+0x29c/0x880 [libcfs]
Feb 27 05:01:43 [<ffffffffa11300cd>] mdt_reint_rec+0x5d/0x200 [mdt]
Feb 27 05:01:43 [<ffffffffa111423b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
Feb 27 05:01:43 [<ffffffffa1114706>] mdt_intent_reint+0x1f6/0x430 [mdt]
Feb 27 05:01:43 [<ffffffffa1112cf4>] mdt_intent_policy+0x494/0xce0 [mdt]
Feb 27 05:01:43 [<ffffffffa0af94f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0b2566b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0ba6331>] tgt_enqueue+0x61/0x230 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0ba6f7e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0b569e1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
Feb 27 05:01:43 [<ffffffffa0b55ba0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
Feb 27 05:01:43 [<ffffffff8109abf6>] kthread+0x96/0xa0
Feb 27 05:01:43 [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 27 05:01:43 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
Feb 27 05:01:43 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 27 05:01:43 
Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc start
Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc stop
Feb 27 05:01:44 Lustre: *** cfs_fail_loc=1616, val=0***
Feb 27 05:01:44 Lustre: Skipped 1 previous similar message
Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks mdc stop
Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc start
Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc stop
Feb 27 05:01:47 Lustre: *** cfs_fail_loc=161c, val=0***
Feb 27 05:01:54 Lustre: testfs-OST0002: Client testfs-MDT0001-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:01:54 Lustre: testfs-OST0002: deleting orphan objects from 0x280000400:102 to 0x280000400:193
Feb 27 05:01:54 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:929
Feb 27 05:02:01 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:01 Lustre: Skipped 1 previous similar message
Feb 27 05:02:01 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:961
Feb 27 05:02:08 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:08 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:993
Feb 27 05:02:15 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:15 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1025
Feb 27 05:02:22 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:22 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1057
Feb 27 05:02:29 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1089
Feb 27 05:02:36 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:36 Lustre: Skipped 1 previous similar message
Feb 27 05:02:36 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1121
Feb 27 05:02:43 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1153
Feb 27 05:02:50 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1185
Feb 27 05:02:57 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
Feb 27 05:02:57 Lustre: Skipped 2 previous similar messages
Feb 27 05:02:57 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1217
Feb 27 05:03:00 BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost00_015:7824]

The stack trace is a bit strange since it lists all of the addresses on the stack as "uncertain" (leading '?'):

Pid: 7824, comm: ll_ost00_015 Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64RIP: 0010:[<ffffffff8152c10e>]  [<ffffffff8152c10e>] _spin_lock+0x1e/0x30
Process ll_ost00_015 (pid: 7824, threadinfo ffff8800ca29a000, task ffff8800d91e0040)
Call Trace:
 ? lfsck_layout_slave_in_notify+0x210/0x15b0 [lfsck]
 ? null_alloc_rs+0xf3/0x390 [ptlrpc]
 ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc]
 ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 ? __req_capsule_get+0x162/0x6d0 [ptlrpc]
 ? lfsck_in_notify+0xef/0x330 [lfsck]
 ? tgt_handle_lfsck_notify+0x64/0x150 [ptlrpc]
 ? tgt_request_handle+0x8be/0x1000 [ptlrpc]
 ? ptlrpc_main+0xe41/0x1960 [ptlrpc]

The test did not progress beyond this, and the same process has been stuck in the watchdog for many hours.

soft lockup in sanity-lfsck test_18f

Details

Description

Attachments

Activity

People

Dates