Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.3
-
None
-
3
-
9223372036854775807
Description
when some network switch goes down, taking with it a number of compute nodes.
a bunch of compute nodes were stuck, and also caused some oss to go down to hit
LBUG (LU-12906) and crash.
the mds went into softlockups before crashing. when it got back, 3 out of 4 MDTs
mounted and recovered, but one MDT went into WAITING and stayed there.
lctl abort_recov had no effect on its status. so the mds was rebooted.
when the MDT again went into WAITING state, a pre-emptive abort_recov was issued
before it could time-out. but it did not help and the MDT continued to try to recover.
2020-05-08 18:52:15 [ 1640.880984] Pid: 15922, comm: mdt02_020 3.10.0-1062.4.1.el7_lustre.x86_64 #1 SMP Mon Oct 28 01:39:05 UTC 2019 2020-05-08 18:52:15 [ 1640.892505] Call Trace: 2020-05-08 18:52:15 [ 1640.895708] [<ffffffffc169bdc0>] ptlrpc_set_wait+0x480/0x790 [ptlrpc] 2020-05-08 18:52:15 [ 1640.903498] [<ffffffffc169c153>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] 2020-05-08 18:52:15 [ 1640.911383] [<ffffffffc1a9eaf3>] osp_remote_sync+0xd3/0x200 [osp] 2020-05-08 18:52:15 [ 1640.918760] [<ffffffffc1a84c63>] osp_attr_get+0x463/0x730 [osp] 2020-05-08 18:52:15 [ 1640.925917] [<ffffffffc1a818cd>] osp_object_init+0x16d/0x2d0 [osp] 2020-05-08 18:52:15 [ 1640.933361] [<ffffffffc141c59b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] 2020-05-08 18:52:15 [ 1640.941977] [<ffffffffc1420471>] lu_object_find_at+0x1e1/0xa60 [obdclass] 2020-05-08 18:52:15 [ 1640.950100] [<ffffffffc1420d06>] lu_object_find+0x16/0x20 [obdclass] 2020-05-08 18:52:15 [ 1640.957737] [<ffffffffc194a01b>] mdt_object_find+0x4b/0x170 [mdt] 2020-05-08 18:52:15 [ 1640.965074] [<ffffffffc194cc38>] mdt_getattr_name_lock+0x848/0x1c30 [mdt] 2020-05-08 18:52:15 [ 1640.973194] [<ffffffffc1954d25>] mdt_intent_getattr+0x2b5/0x480 [mdt] 2020-05-08 18:52:15 [ 1640.980928] [<ffffffffc1951bb5>] mdt_intent_policy+0x435/0xd80 [mdt] 2020-05-08 18:52:15 [ 1640.988552] [<ffffffffc1659d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] 2020-05-08 18:52:15 [ 1640.996483] [<ffffffffc1682366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] 2020-05-08 18:52:15 [ 1641.004811] [<ffffffffc170ab02>] tgt_enqueue+0x62/0x210 [ptlrpc] 2020-05-08 18:52:15 [ 1641.012076] [<ffffffffc17112ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] 2020-05-08 18:52:15 [ 1641.020213] [<ffffffffc16b629b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] 2020-05-08 18:52:15 [ 1641.029217] [<ffffffffc16b9bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] 2020-05-08 18:52:15 [ 1641.056176] Pid: 14912, comm: mdt03_006 3.10.0-1062.4.1.el7_lustre.x86_64 #1 SMP Mon Oct 28 01:39:05 UTC 2019 2020-05-08 18:52:15 [ 1641.067667] Call Trace: 2020-05-08 18:52:15 [ 1641.070833] [<ffffffffc1672b96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc] 2020-05-08 18:52:15 [ 1641.078951] [<ffffffffc167492f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc] 2020-05-08 18:52:15 [ 1641.087272] [<ffffffffc167751e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc] 2020-05-08 18:52:15 [ 1641.095108] [<ffffffffc1a997f2>] osp_md_object_lock+0x162/0x2d0 [osp] 2020-05-08 18:52:15 [ 1641.102832] [<ffffffffc10cb193>] lod_object_lock+0xf3/0x7b0 [lod] 2020-05-08 18:52:15 [ 1641.110179] [<ffffffffc1a2eeee>] mdd_object_lock+0x3e/0xe0 [mdd] 2020-05-08 18:52:15 [ 1641.117429] [<ffffffffc194a341>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt] 2020-05-08 18:52:15 [ 1641.125926] [<ffffffffc194a8da>] mdt_remote_object_lock+0x2a/0x30 [mdt] 2020-05-08 18:52:15 [ 1641.133847] [<ffffffffc195f2ae>] mdt_rename_lock+0xbe/0x4b0 [mdt] 2020-05-08 18:52:15 [ 1641.141189] [<ffffffffc1961605>] mdt_reint_rename+0x2c5/0x2b90 [mdt] 2020-05-08 18:52:15 [ 1641.148819] [<ffffffffc196a693>] mdt_reint_rec+0x83/0x210 [mdt] 2020-05-08 18:52:15 [ 1641.155965] [<ffffffffc19471b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt] 2020-05-08 18:52:15 [ 1641.163689] [<ffffffffc1952567>] mdt_reint+0x67/0x140 [mdt] 2020-05-08 18:52:15 [ 1641.170469] [<ffffffffc17112ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] 2020-05-08 18:52:15 [ 1641.178601] [<ffffffffc16b629b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] 2020-05-08 18:52:15 [ 1641.187639] [<ffffffffc16b9bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]