Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

Using 2.12.0 + patches, after running a few days without major issues, we just had a serie of events leading to a filesystem down, it started with an OSS crash on fir-io4-s2:

[411912.582805] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3305472 GRANT, real grant 0
[411912.597102] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 258 previous similar messages
[412523.291449] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3923968 GRANT, real grant 0
[412523.305649] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 179 previous similar messages
[413133.047652] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 4132864 GRANT, real grant 0
[413133.061828] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 209 previous similar messages
[413733.569845] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3780608 GRANT, real grant 0
[413733.583920] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 206 previous similar messages
[414347.351179] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3911680 GRANT, real grant 0
[414347.365413] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 197 previous similar messages
[414687.733602] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) failed: 
[414687.748464] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) LBUG
[414687.755956] Pid: 108071, comm: ldlm_cn01_010 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[414687.766217] Call Trace:
[414687.768773]  [<ffffffffc0d0f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[414687.775434]  [<ffffffffc0d0f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[414687.781736]  [<ffffffffc15d636c>] osd_object_release+0x7c/0x80 [osd_ldiskfs]
[414687.788910]  [<ffffffffc0e936a0>] lu_object_put+0x190/0x3d0 [obdclass]
[414687.795582]  [<ffffffffc1750b56>] ofd_lvbo_update+0x5e6/0xdf0 [ofd]
[414687.801980]  [<ffffffffc10a2c89>] ldlm_request_cancel+0x3c9/0x740 [ptlrpc]
[414687.809008]  [<ffffffffc10a77da>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
[414687.815859]  [<ffffffffc10a7ac8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
[414687.822885]  [<ffffffffc10d8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[414687.830693]  [<ffffffffc10dc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[414687.837115]  [<ffffffffb74c1c31>] kthread+0xd1/0xe0
[414687.842115]  [<ffffffffb7b74c24>] ret_from_fork_nospec_begin+0xe/0x21
[414687.848677]  [<ffffffffffffffff>] 0xffffffffffffffff
[414687.853780] Kernel panic - not syncing: LBUG

Shortly after restarting the OSS, an MDS, fir-md1-s2 (MDT1 + 3) crashed:

[2687144.065647] Lustre: fir-MDT0003: Client ddea348b-e5a4-5330-325a-755d459e8dda (at 10.9.107.57@o2ib4) reconnecting
[2687144.076003] Lustre: Skipped 1 previous similar message
[2689300.253654] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) ASSERTION( o->opo_reserved == 0 ) failed: 
[2689300.264453] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) LBUG
[2689300.271942] Pid: 124673, comm: mdt00_044 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[2689300.281942] Call Trace:
[2689300.284583]  [<ffffffffc0a397cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[2689300.291332]  [<ffffffffc0a3987c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[2689300.297719]  [<ffffffffc15aecc5>] osp_declare_create+0x5a5/0x5b0 [osp]
[2689300.304462]  [<ffffffffc14f639f>] lod_sub_declare_create+0xdf/0x210 [lod]
[2689300.311466]  [<ffffffffc14f5904>] lod_qos_prep_create+0x15d4/0x1890 [lod]
[2689300.318467]  [<ffffffffc14d6bba>] lod_declare_instantiate_components+0x9a/0x1d0 [lod]
[2689300.326511]  [<ffffffffc14e94d5>] lod_declare_layout_change+0xb65/0x10f0 [lod]
[2689300.333948]  [<ffffffffc155b102>] mdd_declare_layout_change+0x62/0x120 [mdd]
[2689300.341211]  [<ffffffffc1563e52>] mdd_layout_change+0x882/0x1000 [mdd]
[2689300.347964]  [<ffffffffc13ca327>] mdt_layout_change+0x337/0x430 [mdt]
[2689300.354636]  [<ffffffffc13d243e>] mdt_intent_layout+0x7ee/0xcc0 [mdt]
[2689300.361292]  [<ffffffffc13cfa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[2689300.367949]  [<ffffffffc0d6dd46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[2689300.374893]  [<ffffffffc0d96707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[2689300.382189]  [<ffffffffc0e1d6e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[2689300.388531]  [<ffffffffc0e2473a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[2689300.395657]  [<ffffffffc0dc8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2689300.403563]  [<ffffffffc0dcc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[2689300.410077]  [<ffffffff9e6c1c31>] kthread+0xd1/0xe0
[2689300.415164]  [<ffffffff9ed74c24>] ret_from_fork_nospec_begin+0xe/0x21
[2689300.421813]  [<ffffffffffffffff>] 0xffffffffffffffff
[2689300.427016] Kernel panic - not syncing: LBUG

And now I am not able to restart the filesystem, even with -o abort_recov everywhere. OSTs restart but then the MDT0 doesn't want to restart. It keeps wanting to perform a recovery.

May 29 15:59:43 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
May 29 15:59:49 fir-md1-s1 kernel: Lustre: DEBUG MARKER: Wed May 29 15:59:49 2019
May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): recovery complete
May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
May 29 16:00:25 fir-md1-s1 kernel: Lustre: MGS: Connection restored to MGC10.0.10.51@o2ib7_0 (at 0@lo)
May 29 16:00:26 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 1495a992-a81e-7de9-50a8-88376d880671 (at 10.8.12.32@o2ib6)
May 29 16:00:26 fir-md1-s1 kernel: Lustre: Skipped 39 previous similar messages
May 29 16:00:27 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 0e359090-33c6-a6a0-317c-1b138ecd9b87 (at 10.9.105.16@o2ib4)
May 29 16:00:27 fir-md1-s1 kernel: Lustre: Skipped 72 previous similar messages





May 29 16:00:29 fir-md1-s1 kernel: Lustre: MGS: Connection restored to d6e5f2ed-c606-14e7-85a0-13787ff0467b (at 10.9.105.69@o2ib4)
May 29 16:00:29 fir-md1-s1 kernel: Lustre: Skipped 35 previous similar messages
May 29 16:00:33 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 2996d077-09af-23fe-fe18-ce9b6359bb8e (at 10.8.21.35@o2ib6)
May 29 16:00:33 fir-md1-s1 kernel: Lustre: Skipped 263 previous similar messages
May 29 16:00:43 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 545852b2-0fa4-d28f-21da-1dc3deae2f8b (at 10.9.115.1@o2ib4)
May 29 16:00:43 fir-md1-s1 kernel: Lustre: Skipped 177 previous similar messages



May 29 16:01:10 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 98b9580a-8059-3273-b7c7-9f03a2d4d87f (at 10.9.112.7@o2ib4)
May 29 16:01:10 fir-md1-s1 kernel: Lustre: Skipped 6 previous similar messages
May 29 16:01:41 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
May 29 16:01:42 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 4030ea6e-e97d-b5db-fa13-470ec3d51337 (at 10.8.1.8@o2ib6)
May 29 16:01:42 fir-md1-s1 kernel: Lustre: Skipped 40 previous similar messages
May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): recovery complete
May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.31.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:24 fir-md1-s1 kernel: Lustre: fir-MDT0000: Not available for connect from 10.8.18.10@o2ib6 (not set up)
May 29 16:02:24 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages
May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.17.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:24 fir-md1-s1 kernel: LustreError: Skipped 4 previous similar messages
May 29 16:02:26 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.102.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:26 fir-md1-s1 kernel: LustreError: Skipped 1 previous similar message
May 29 16:02:28 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.0.10.106@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:28 fir-md1-s1 kernel: LustreError: Skipped 13 previous similar messages
May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 2 seconds
May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
May 29 16:02:33 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.0.10.3@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:33 fir-md1-s1 kernel: LustreError: Skipped 12 previous similar messages
May 29 16:02:38 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 1 seconds
May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDT0000: Imperative Recovery not enabled, recovery window 300-900
May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDD0000: changelog on
May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0000: Aborting recovery for device
May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0000: Aborting recovery
May 29 16:02:38 fir-md1-s1 kernel: Lustre: 20890:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
May 29 16:02:41 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.9.108.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:41 fir-md1-s1 kernel: LustreError: Skipped 29 previous similar messages
May 29 16:02:44 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (86 recovered, 49 in progress, and 0 evicted) already passed deadline 8:07
May 29 16:02:46 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 98410362-8850-0375-cdb0-3dd7ebb1ce58 (at 10.9.107.28@o2ib4)
May 29 16:02:46 fir-md1-s1 kernel: Lustre: Skipped 1171 previous similar messages
May 29 16:02:57 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.1.6@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:57 fir-md1-s1 kernel: LustreError: Skipped 803 previous similar messages
May 29 16:03:02 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:03:12 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (802 recovered, 565 in progress, and 0 evicted) already passed deadline 8:34
May 29 16:03:27 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:03:30 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.20.22@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:03:30 fir-md1-s1 kernel: LustreError: Skipped 2086 previous similar messages
May 29 16:03:37 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 8:59
May 29 16:04:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:24
May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
May 29 16:04:27 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:50
May 29 16:04:34 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.8.19@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:04:34 fir-md1-s1 kernel: LustreError: Skipped 3903 previous similar messages
May 29 16:04:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
May 29 16:04:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 16:04:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
May 29 16:04:38 fir-md1-s1 kernel: Call Trace:
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
May 29 16:04:52 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:15
May 29 16:05:17 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:40
May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 2 previous similar messages
May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
May 29 16:06:06 fir-md1-s1 kernel: Call Trace:
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f9982>] mdt_intent_open+0x82/0x350 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f7a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0e79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ea2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f296e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9bac1c31>] kthread+0xd1/0xe0
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9c174c24>] ret_from_fork_nospec_begin+0xe/0x21
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
May 29 16:06:06 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1559171166.20981
May 29 16:06:07 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 11:30
May 29 16:06:07 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
May 29 16:06:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
May 29 16:06:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 16:06:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
May 29 16:06:38 fir-md1-s1 kernel: Call Trace:
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27

mount commands are stuck for all 4 MDTs.

While I'm writing this, we're trying to restart everything with all lnet routers stopped to avoid any recovery with our compute nodes on Sherlock. My hope is that after the timeout of the imperative recovery window, the MDTs will actually mount. Any idea is welcome. Thanks!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-io4-s2-2019-05-29-12-51-55-foreach-bt.txt
1.05 MB
07/Jun/19 5:39 AM
fir-io4-s2-2019-05-29-12-51-55-vmcore-dmesg.txt
836 kB
07/Jun/19 5:39 AM
fir-md1-s1_noabort_rtrdown.log
218 kB
29/May/19 11:59 PM
fir-md1-s1.log
227 kB
29/May/19 11:31 PM
fir-md1-s2_noabort_rtrdown.log
211 kB
29/May/19 11:59 PM
fir-md1-s2.log
230 kB
29/May/19 11:31 PM
fir-md1-s2-20190910-endlessrecov.sysrq-t.log
1.46 MB
10/Aug/19 2:26 PM

Issue Links

is related to

LU-11761 blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads

Resolved

LU-12485 (osd_handler.c:2146:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) faile

Resolved

Activity

People

Assignee:: Peter Jones

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 29/May/19 11:23 PM

Updated:: 25/Sep/19 9:39 AM