[LU-8992] osd_object_release() LBUG Created: 09/Jan/17 Updated: 04/Dec/20 Resolved: 04/Dec/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Mikhail Pershin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 2 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
Just hit this at sanity umount: [98256.613336] Lustre: DEBUG MARKER: == sanity test complete, duration 6442 sec =========================================================== 20:37:21 (1483753041) [98335.790861] LustreError: 30777:0:(osd_handler.c:1932:osd_object_release()) LBUG [98335.791970] Pid: 30777, comm: mdt00_003 [98335.792533] Call Trace: [98335.802667] [<ffffffffa01ac7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [98335.803281] [<ffffffffa01ac841>] lbug_with_loc+0x41/0xb0 [libcfs] [98335.803886] [<ffffffff8137676d>] ? _atomic_dec_and_lock+0x4d/0x70 [98335.804499] [<ffffffffa0aa1690>] osd_get_ldiskfs_dirent_param+0x0/0xf0 [osd_ldiskfs] [98335.814248] [<ffffffffa033639d>] lu_object_put+0x17d/0x3d0 [obdclass] [98335.814861] [<ffffffffa0bed89d>] mdt_getattr_name_lock+0x58d/0x18d0 [mdt] [98335.815540] [<ffffffffa056ff4c>] ? lustre_msg_get_flags+0x2c/0xa0 [ptlrpc] [98335.816248] [<ffffffffa0beee90>] mdt_intent_getattr+0x2b0/0x470 [mdt] [98335.816924] [<ffffffffa0bf287c>] mdt_intent_policy+0x59c/0xb60 [mdt] [98335.817583] [<ffffffffa052311a>] ldlm_lock_enqueue+0x35a/0x940 [ptlrpc] [98335.818188] [<ffffffffa01bcb75>] ? cfs_hash_rw_unlock+0x15/0x20 [libcfs] [98335.818825] [<ffffffffa054c5d3>] ldlm_handle_enqueue0+0x9e3/0x1670 [ptlrpc] [98335.821917] [<ffffffffa05741f0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc] [98335.823195] [<ffffffffa05cbcb2>] tgt_enqueue+0x62/0x210 [ptlrpc] [98335.825573] [<ffffffffa05cfff5>] tgt_request_handle+0x925/0x1330 [ptlrpc] [98335.826324] [<ffffffffa057d311>] ptlrpc_server_handle_request+0x231/0xac0 [ptlrpc] [98335.827656] [<ffffffffa057aec8>] ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc] [98335.828403] [<ffffffffa0581128>] ptlrpc_main+0xa58/0x1de0 [ptlrpc] [98335.829105] [<ffffffffa05806d0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc] [98335.831441] [<ffffffff810a2eda>] kthread+0xea/0xf0 [98335.831909] [<ffffffff810a2df0>] ? kthread+0x0/0xf0 [98335.833378] [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90 [98335.833852] [<ffffffff810a2df0>] ? kthread+0x0/0xf0 [98335.834273] [98335.834701] Kernel panic - not syncing: LBUG this is pretty much current master. Crashdump and modules in /exports/crashdumps/192.168.10.224-2017-01-06-20:38:47 |
| Comments |
| Comment by Oleg Drokin [ 31/Jan/17 ] |
|
Hit this once again [112978.039383] Lustre: DEBUG MARKER: == sanity test 408: drop_caches should not hang due to page leaks ==================================== 22:09:24 (1485745764) [112978.061613] Lustre: *** cfs_fail_loc=40a, val=0*** [112978.062236] LustreError: 11788:0:(osc_request.c:1849:osc_build_rpc()) prep_req failed: -22 [112978.063681] LustreError: 11788:0:(osc_cache.c:2299:osc_check_rpcs()) Read request failed with -22 [112978.792538] LustreError: 12463:0:(osd_handler.c:1946:osd_object_release()) LBUG [112978.794273] Pid: 12463, comm: ldlm_cn00_001 [112978.795812] Call Trace: [112978.797257] [<ffffffffa027b7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [112978.798160] [<ffffffffa027b841>] lbug_with_loc+0x41/0xb0 [libcfs] [112978.799006] [<ffffffff8137676d>] ? _atomic_dec_and_lock+0x4d/0x70 [112978.800177] [<ffffffffa0ac1890>] osd_get_ldiskfs_dirent_param+0x0/0xf0 [osd_ldiskfs] [112978.801801] [<ffffffffa052038d>] lu_object_put+0x17d/0x3d0 [obdclass] [112978.802632] [<ffffffffa0d5e092>] ofd_object_put+0x12/0x20 [ofd] [112978.803461] [<ffffffffa0d706bc>] ofd_lvbo_update+0x4dc/0xdf2 [ofd] [112978.804336] [<ffffffffa08e2352>] ldlm_request_cancel+0x132/0x710 [ptlrpc] [112978.805212] [<ffffffffa08e85ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [112978.806084] [<ffffffffa08e88c1>] ldlm_cancel_handler+0x141/0x490 [ptlrpc] [112978.806965] [<ffffffffa0918421>] ptlrpc_server_handle_request+0x231/0xac0 [ptlrpc] [112978.808497] [<ffffffffa0915fd8>] ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc] [112978.809378] [<ffffffffa091c198>] ptlrpc_main+0xa58/0x1de0 [ptlrpc] [112978.810212] [<ffffffffa091b740>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc] [112978.810638] sh (22580): drop_caches: 2 [112978.811354] [<ffffffff810a2eda>] kthread+0xea/0xf0 [112978.812241] [<ffffffff810a2df0>] ? kthread+0x0/0xf0 [112978.812809] [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90 [112978.813360] [<ffffffff810a2df0>] ? kthread+0x0/0xf0 [112978.814117] [112978.814611] Kernel panic - not syncing: LBUG [112978.815492] CPU: 2 PID: 12463 Comm: ldlm_cn00_001 Tainted: G W OE ------------ 3.10.0-debug #1 [112978.817353] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [112978.818060] ffffffffa0299dd3 00000000e6acc995 ffff8800b07cfa58 ffffffff816fd400 [112978.819595] ffff8800b07cfad8 ffffffff816f8c74 ffffffff00000008 ffff8800b07cfae8 [112978.821254] ffff8800b07cfa88 00000000e6acc995 00000000e6acc995 ffff8800bc70d948 [112978.822781] Call Trace: [112978.823505] [<ffffffff816fd400>] dump_stack+0x19/0x1b [112978.824296] [<ffffffff816f8c74>] panic+0xd8/0x1e7 [112978.825116] [<ffffffffa027b859>] lbug_with_loc+0x59/0xb0 [libcfs] [112978.826254] [<ffffffff8137676d>] ? _atomic_dec_and_lock+0x4d/0x70 [112978.827221] [<ffffffffa0ac1890>] osd_object_release+0x60/0x60 [osd_ldiskfs] [112978.828212] [<ffffffffa052038d>] lu_object_put+0x17d/0x3d0 [obdclass] [112978.829202] [<ffffffffa0d5e092>] ofd_object_put+0x12/0x20 [ofd] [112978.830165] [<ffffffffa0d706bc>] ofd_lvbo_update+0x4dc/0xdf2 [ofd] [112978.831191] [<ffffffffa08e2352>] ldlm_request_cancel+0x132/0x710 [ptlrpc] [112978.832107] [<ffffffffa08e85ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [112978.832958] [<ffffffffa08e88c1>] ldlm_cancel_handler+0x141/0x490 [ptlrpc] [112978.833831] [<ffffffffa0918421>] ptlrpc_server_handle_request+0x231/0xac0 [ptlrpc] [112978.835326] [<ffffffffa0915fd8>] ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc] [112978.836389] [<ffffffffa091c198>] ptlrpc_main+0xa58/0x1de0 [ptlrpc] [112978.837323] [<ffffffffa091b740>] ? ptlrpc_register_service+0xe70/0xe70 [ptlrpc] [112978.838859] [<ffffffff810a2eda>] kthread+0xea/0xf0 [112978.839423] [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140 [112978.840226] [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90 [112978.840957] [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140 |
| Comment by Stephane Thiell [ 29/Dec/18 ] |
|
on MDT, Lustre version 2.10.6: [477491.194082] Lustre: Lustre: Build Version: 2.10.6 [477491.331700] LNet: Using FMR for registration [477491.348439] LNetError: 3307:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 10.0.2.202@o2ib5 on NA (ib0:0:10.0.2.52): bad dst nid 10.0.2.52@o2ib5 [477491.357436] LNet: Added LNI 10.0.2.52@o2ib5 [8/256/0/180] [477494.640350] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc [477496.754449] Lustre: oak-MDT0000: Not available for connect from 10.9.105.64@o2ib4 (not set up) [477502.812863] Lustre: oak-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [477503.123450] Lustre: oak-MDD0000: changelog on [477503.249407] Lustre: oak-MDT0000: in recovery but waiting for the first client to connect [477503.270134] Lustre: oak-MDT0000: Will be in recovery for at least 2:30, or until 1346 clients reconnect [477503.270274] Lustre: oak-MDT0000: Connection restored to a2a70c7a-93ab-9d6b-48c1-139a5aa710ac (at 10.8.8.35@o2ib6) [477503.782201] Lustre: oak-MDT0000: Connection restored to 071bed2c-a2cb-4641-c7cc-27d97c013d03 (at 10.9.114.1@o2ib4) [477503.793855] Lustre: Skipped 55 previous similar messages [477504.788188] Lustre: oak-MDT0000: Connection restored to 11390d04-e59a-a240-221a-bbc64a18625e (at 10.8.10.1@o2ib6) [477504.799743] Lustre: Skipped 87 previous similar messages [477505.181983] Lustre: oak-MDT0000: nosquash_nids set to 10.0.2.[1-3]@o2ib5 10.0.2.[51-58]@o2ib5 10.0.2.[66-67]@o2ib5 10.0.2.[101-120]@o2ib5 10.0.2.[221-223]@o2ib5 10.0.2.[226-227]@o2ib5 10.0.2.[228-229]@o2ib5 10.0.2.[232-236]@o2ib5 10.0.2.[240-252]@o2ib5 10.210.47.253@o2ib3 10.9.0.[1-2]@o2ib4 [477505.257277] Lustre: oak-MDT0000: root_squash is set to 99:99 [477506.789708] Lustre: oak-MDT0000: Connection restored to 47d3b4f3-9365-0e0c-3374-5b4bc33fc853 (at 10.9.105.49@o2ib4) [477506.801462] Lustre: Skipped 207 previous similar messages [477510.819236] Lustre: oak-MDT0000: Connection restored to a89d8d74-9631-091e-d940-4db0bd69e79b (at 10.9.101.44@o2ib4) [477510.830986] Lustre: Skipped 437 previous similar messages [477512.748245] Lustre: oak-MDT0000: Denying connection for new client 9c8f604b-0507-3ae7-a89f-58322359a019(at 10.8.2.23@o2ib6), waiting for 1346 known clients (871 recovered, 2 in progress, and 0 evicted) to recover in 2:35 [477514.683928] Lustre: oak-MDT0000: Denying connection for new client 9da67a8e-85ba-1cfc-ad1b-9f74e942a870(at 10.9.106.72@o2ib4), waiting for 1346 known clients (943 recovered, 27 in progress, and 0 evicted) to recover in 2:33 [477514.706155] Lustre: Skipped 17 previous similar messages [477516.074578] Lustre: oak-MDT0000: Denying connection for new client 7c2e1755-70e4-330b-c1fb-81f599445c46(at 10.9.106.52@o2ib4), waiting for 1346 known clients (1327 recovered, 12 in progress, and 0 evicted) to recover in 2:32 [477520.183201] Lustre: oak-MDT0000: Connection restored to 10.0.2.105@o2ib5 (at 10.0.2.105@o2ib5) [477520.183201] Lustre: oak-MDT0000: Connection restored to 10.0.2.105@o2ib5 (at 10.0.2.105@o2ib5) [477520.183203] Lustre: Skipped 568 previous similar messages [477520.208751] Lustre: Skipped 15 previous similar messages [477537.855661] Lustre: oak-MDT0000: Connection restored to oak-MDT0000-lwp-MDT0001_UUID (at 10.0.2.51@o2ib5) [477537.866462] Lustre: Skipped 33 previous similar messages [477539.684995] Lustre: oak-MDT0000: Denying connection for new client 4825ff92-a390-1e66-d676-f8ef412b4eae(at 10.9.106.10@o2ib4), waiting for 1346 known clients (1336 recovered, 4 in progress, and 0 evicted) to recover in 2:08 [477539.707250] Lustre: Skipped 1 previous similar message [477543.695367] Lustre: oak-MDT0000: Denying connection for new client e307fc73-7072-b2c5-0bfe-5cc8a908ab95(at 10.9.106.17@o2ib4), waiting for 1346 known clients (1336 recovered, 4 in progress, and 0 evicted) to recover in 2:04 [477543.717588] Lustre: Skipped 3 previous similar messages [477549.345933] LustreError: 3439:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff881d3aefb000 x1616898091808576/t0(0) o601->oak-MDT0000-lwp-OST005f_UUID@10.0.2.106@o2ib5:580/0 lens 336/0 e 0 to 0 dl 1546105435 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [477549.345934] LustreError: 3396:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff881d3aefe300 x1616898091808592/t0(0) o601->oak-MDT0000-lwp-OST005f_UUID@10.0.2.106@o2ib5:580/0 lens 336/0 e 0 to 0 dl 1546105435 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [477549.403760] LustreError: 3439:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 2 previous similar messages [477554.828568] Lustre: oak-MDT0000: Denying connection for new client fd617f09-e764-0401-ac6e-ea7afbf315c0(at 10.9.106.27@o2ib4), waiting for 1346 known clients (1336 recovered, 4 in progress, and 0 evicted) to recover in 1:53 [477554.850788] Lustre: Skipped 4 previous similar messages [477571.048820] Lustre: oak-MDT0000: Denying connection for new client 5a4d8b28-5680-b773-50d0-7f92dbdbf9b8(at 10.9.106.49@o2ib4), waiting for 1346 known clients (1336 recovered, 4 in progress, and 0 evicted) to recover in 1:37 [477571.071043] Lustre: Skipped 21 previous similar messages [477598.989851] LustreError: 3395:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff881778315a00 x1616898091817216/t0(0) o601->oak-MDT0000-lwp-OST004d_UUID@10.0.2.106@o2ib5:630/0 lens 336/0 e 0 to 0 dl 1546105485 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [477598.989852] LustreError: 3396:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff881d47e49e00 x1616898091817232/t0(0) o601->oak-MDT0000-lwp-OST004d_UUID@10.0.2.106@o2ib5:630/0 lens 336/0 e 0 to 0 dl 1546105485 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [477598.989855] LustreError: 3396:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 2 previous similar messages [477603.437114] Lustre: oak-MDT0000: Denying connection for new client bbe1f40a-e606-1d75-0db4-6523723b799c(at 10.9.106.63@o2ib4), waiting for 1346 known clients (1336 recovered, 4 in progress, and 0 evicted) to recover in 1:05 [477603.459344] Lustre: Skipped 33 previous similar messages [477609.347092] LustreError: 3395:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff881778312a00 x1616898091819056/t0(0) o601->oak-MDT0000-lwp-OST005f_UUID@10.0.2.106@o2ib5:640/0 lens 336/0 e 0 to 0 dl 1546105495 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [477609.376017] LustreError: 3395:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 1 previous similar message [477631.391971] LustreError: 3396:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff881778310c00 x1616898091831552/t0(0) o601->oak-MDT0000-lwp-OST0051_UUID@10.0.2.106@o2ib5:662/0 lens 336/0 e 0 to 0 dl 1546105517 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [477631.420882] LustreError: 3396:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 5 previous similar messages [477642.751348] LustreError: 3451:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff880c2e9f1800 x1616897512980832/t0(0) o601->oak-MDT0000-lwp-OST0052_UUID@10.0.2.105@o2ib5:674/0 lens 336/0 e 0 to 0 dl 1546105529 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [477642.751350] LustreError: 3441:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff880e9b090300 x1616897512980848/t0(0) o601->oak-MDT0000-lwp-OST0052_UUID@10.0.2.105@o2ib5:674/0 lens 336/0 e 0 to 0 dl 1546105529 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [477642.751352] LustreError: 3441:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 6 previous similar messages [477667.961746] Lustre: oak-MDT0000: Denying connection for new client 73422625-1c0e-020e-24fa-f6f58463db79(at 10.9.106.46@o2ib4), waiting for 1346 known clients (1336 recovered, 4 in progress, and 0 evicted) to recover in 0:00 [477667.983989] Lustre: Skipped 81 previous similar messages [477668.536627] Lustre: oak-MDT0000: recovery is timed out, evict stale exports [477668.544664] Lustre: oak-MDT0000: disconnecting 6 stale clients [477668.835386] LustreError: 3241:0:(osd_handler.c:1982:osd_object_release()) LBUG [477668.843548] Pid: 3241, comm: tgt_recover_0 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 SMP Sat Sep 30 00:44:25 PDT 2017 [477668.855293] Call Trace: [477668.858122] [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40 [477668.864926] [<ffffffffc080b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [477668.872323] [<ffffffffc080b87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [477668.879326] [<ffffffffc0fb4660>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs] [477668.888274] [<ffffffffc0939c4d>] lu_object_put+0x17d/0x3d0 [obdclass] [477668.895689] [<ffffffffc0b90738>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] [477668.904389] [<ffffffffc0ad6987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc] [477668.913445] [<ffffffffc0ad7616>] target_recovery_thread+0x616/0x1360 [ptlrpc] [477668.921627] [<ffffffff810b098f>] kthread+0xcf/0xe0 [477668.927177] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [477668.933308] [<ffffffffffffffff>] 0xffffffffffffffff [477668.938969] Kernel panic - not syncing: LBUG [477668.943826] CPU: 22 PID: 3241 Comm: tgt_recover_0 Tainted: G OE ------------ 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 [477668.956926] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.6.0 10/26/2017 [477668.965370] ffff88103bd30000 000000000489dbfb ffff880de5947bf0 ffffffff816a3db1 [477668.973757] ffff880de5947c70 ffffffff8169dc74 0000000000000008 ffff880de5947c80 [477668.982142] ffff880de5947c20 000000000489dbfb ffff88103bd30000 ffff88103d0cf8b8 [477668.990527] Call Trace: [477668.993352] [<ffffffff816a3db1>] dump_stack+0x19/0x1b [477668.999180] [<ffffffff8169dc74>] panic+0xe8/0x20d [477669.004621] [<ffffffffc080b8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [477669.011614] [<ffffffffc0fb4660>] osd_object_release+0x60/0x60 [osd_ldiskfs] [477669.019586] [<ffffffffc0939c4d>] lu_object_put+0x17d/0x3d0 [obdclass] [477669.026987] [<ffffffffc0b90738>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] [477669.035641] [<ffffffffc0ad6987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc] [477669.044660] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [477669.051277] [<ffffffffc0ad7616>] target_recovery_thread+0x616/0x1360 [ptlrpc] [477669.059424] [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0 [477669.065653] [<ffffffffc0ad7000>] ? replay_request_or_update.isra.21+0x8c0/0x8c0 [ptlrpc] [477669.074874] [<ffffffff810b098f>] kthread+0xcf/0xe0 [477669.080410] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [477669.087305] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [477669.093422] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 |
| Comment by Stephane Thiell [ 29/Dec/18 ] |
|
Ok so this is the second time that we hit this issue with 2.10, the first with 2.10.4 and now today with 2.10.6 on Oak. When this happens, it is during recovery and it's impossible to mount the MDT without aborting the recovery. Because the issue usually follows a changelogs-is-full related crash, I tried to kill the changelogs reader client but no luck. The first time when it happened I did a -o abort_recov but this had a huge impact on production so this was quite bad for our users. This time, I was able to start the MDT, wait for some clients to recover, and then do an abort_recovery on the fly. MDT didn't crash and only 14 clients were evicted according to the logs: Dec 29 10:11:26 oak-md1-s2 kernel: Lustre: 3633:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Dec 29 10:11:29 oak-md1-s2 kernel: LustreError: 3780:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8810139dd100 x1613424070104896/t0(0) o601->oak-MDT0000-lwp-OST0011_UUID@10.0.2.102@o2ib5:399/0 lens 336/0 e 0 to 0 dl 1546107519 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Dec 29 10:11:29 oak-md1-s2 kernel: LustreError: 3780:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 1 previous similar message Dec 29 10:11:33 oak-md1-s2 kernel: LustreError: 3780:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff881013bfdd00 x1613424070104928/t0(0) o601->oak-MDT0000-lwp-OST000b_UUID@10.0.2.102@o2ib5:403/0 lens 336/0 e 0 to 0 dl 1546107523 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Dec 29 10:11:33 oak-md1-s2 kernel: LustreError: 3780:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 1 previous similar message Dec 29 10:11:33 oak-md1-s2 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 29 10:11:38 oak-md1-s2 kernel: LustreError: 3808:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff88202aece300 x1616898091935616/t0(0) o601->oak-MDT0000-lwp-OST003f_UUID@10.0.2.106@o2ib5:193/0 lens 336/0 e 0 to 0 dl 1546107313 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Dec 29 10:11:38 oak-md1-s2 kernel: LustreError: 3808:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 8 previous similar messages Dec 29 10:11:39 oak-md1-s2 kernel: Lustre: oak-MDT0000: Denying connection for new client 0e9724d2-14fa-28cb-a075-ae829322ce54(at 10.8.15.5@o2ib6), waiting for 1346 known clients (1332 recovered, 3 in progress, and 0 evicted) to recover in 7:50 Dec 29 10:11:39 oak-md1-s2 kernel: Lustre: Skipped 96 previous similar messages Dec 29 10:11:52 oak-md1-s2 kernel: LustreError: 3808:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff882023e08c50 x1616898091937392/t0(0) o601->oak-MDT0000-lwp-OST005d_UUID@10.0.2.106@o2ib5:207/0 lens 336/0 e 0 to 0 dl 1546107327 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Dec 29 10:11:52 oak-md1-s2 kernel: LustreError: 3808:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 8 previous similar messages Dec 29 10:11:59 oak-md1-s2 kernel: LustreError: 3632:0:(lod_dev.c:419:lod_sub_recovery_thread()) oak-MDT0001-osp-MDT0000 getting update log failed: rc = -108 Dec 29 10:11:59 oak-md1-s2 kernel: Lustre: oak-MDT0000: disconnecting 14 stale clients Dec 29 10:11:59 oak-md1-s2 kernel: LustreError: 3633:0:(ldlm_lib.c:1640:abort_lock_replay_queue()) @@@ aborted: req@ffff881014cae600 x1612964722859424/t0(0) o101->7980f9e0-a876-9d0e-aaa3-7ea63f72dc7a@10.0.2.228@o2ib5:118/0 lens 328/0 e 0 to 0 dl 1546107238 ref 1 fl Complete:/40/ffffffff rc 0/-1 Dec 29 10:11:59 oak-md1-s2 kernel: Lustre: 3633:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Dec 29 10:11:59 oak-md1-s2 kernel: Lustre: oak-MDT0000: Recovery over after 1:40, of 1346 clients 1332 recovered and 14 were evicted. <ok> |
| Comment by Stephane Thiell [ 23/Jan/19 ] |
|
Same occurrence tonight on Oak after a MDT0 restarted: [ 250.690265] LustreError: 3751:0:(osd_handler.c:1982:osd_object_release()) LBUG [ 250.698328] Pid: 3751, comm: tgt_recover_0 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 SMP Sat Sep 30 00:44:25 PDT 2017 [ 250.709975] Call Trace: [ 250.712707] [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40 [ 250.719422] [<ffffffffc07837cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 250.726719] [<ffffffffc078387c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 250.733626] [<ffffffffc0fc4660>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs] [ 250.742478] [<ffffffffc08e4c4d>] lu_object_put+0x17d/0x3d0 [obdclass] [ 250.749804] [<ffffffffc0b88738>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] [ 250.758409] [<ffffffffc0ace987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc] [ 250.767364] [<ffffffffc0acf616>] target_recovery_thread+0x616/0x1360 [ptlrpc] [ 250.775447] [<ffffffff810b098f>] kthread+0xcf/0xe0 [ 250.780897] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [ 250.786932] [<ffffffffffffffff>] 0xffffffffffffffff I've uploaded the vmcore files to the Whamcloud FTP server:
Lustre 2.10.6 the kernel debuginfo files are already there I think Thanks |
| Comment by Peter Jones [ 23/Jan/19 ] |
|
Mike Could you please investigate? Peter |
| Comment by Mikhail Pershin [ 13/Feb/19 ] |
|
This bug consists of two different issues in fact, one stack trace on MDT is about LU-10581 issue with LinkEA. It doesn't exist in master since landing of commit https://review.whamcloud.com/#/c/31427/, so b2_12 is not affected as well. I am updating patch for LU-10581 for b2_10 now. |
| Comment by Stephane Thiell [ 13/Feb/19 ] |
|
Thank you, Mikhail, for the update, this is much appreciated! We definitively need your patch for 2.10 as MDS does crash every time we restart the MDT unless I abort the recovery (the good thing is that our 2.10-based "Oak" filesystem is now stable and doesn't need to be restarted very often |
| Comment by Marek Magrys [ 13/Feb/19 ] |
|
We hit the same issue on our 2.10.6 filesystem in Cyfronet. |
| Comment by Stephane Thiell [ 25/Mar/19 ] |
|
Still a main issue for us with 2.10.6, hit this today on Oak. The workaround is painful and potentially impact jobs. Could we increase the severity? MDS crash today, 2.10.6:
[5324499.387259] LustreError: 171003:0:(osd_handler.c:1982:osd_object_release()) LBUG [5324499.412171] Pid: 171003, comm: tgt_recover_0 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 SMP Sat Sep 30 00:44:25 PDT 2017 [5324499.447664] Call Trace: [5324499.456236] [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40 [5324499.476575] [<ffffffffc07fd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [5324499.498646] [<ffffffffc07fd87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [5324499.519561] [<ffffffffc0fa8660>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs] [5324499.546206] [<ffffffffc092bc4d>] lu_object_put+0x17d/0x3d0 [obdclass] [5324499.568300] [<ffffffffc0bcf738>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] [5324499.594126] [<ffffffffc0b15987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc] [5324499.621075] [<ffffffffc0b16616>] target_recovery_thread+0x616/0x1360 [ptlrpc] [5324499.645455] [<ffffffff810b098f>] kthread+0xcf/0xe0 [5324499.662091] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [5324499.680444] [<ffffffffffffffff>] 0xffffffffffffffff [5324499.697369] Kernel panic - not syncing: LBUG [5324499.711966] CPU: 1 PID: 171003 Comm: tgt_recover_0 Tainted: G OE ------------ 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 [5324499.751171] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.6.0 10/26/2017 [5324499.776351] ffff88202ee3de00 00000000f667804a ffff881fea957bf0 ffffffff816a3db1 [5324499.801280] ffff881fea957c70 ffffffff8169dc74 0000000000000008 ffff881fea957c80 [5324499.826210] ffff881fea957c20 00000000f667804a ffff88202ee3dee0 ffff88203c60f8b8 [5324499.851140] Call Trace: [5324499.859734] [<ffffffff816a3db1>] dump_stack+0x19/0x1b [5324499.877192] [<ffffffff8169dc74>] panic+0xe8/0x20d [5324499.893509] [<ffffffffc07fd8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [5324499.914399] [<ffffffffc0fa8660>] osd_object_release+0x60/0x60 [osd_ldiskfs] [5324499.938161] [<ffffffffc092bc4d>] lu_object_put+0x17d/0x3d0 [obdclass] [5324499.960219] [<ffffffffc0bcf738>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] [5324499.985986] [<ffffffffc0b15987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc] [5324500.012884] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [5324500.032644] [<ffffffffc0b16616>] target_recovery_thread+0x616/0x1360 [ptlrpc] [5324500.056968] [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0 [5324500.075583] [<ffffffffc0b16000>] ? replay_request_or_update.isra.21+0x8c0/0x8c0 [ptlrpc] [5324500.103052] [<ffffffff810b098f>] kthread+0xcf/0xe0 [5324500.119652] [<ffffffff810bf107>] ? finish_task_switch+0x57/0x160 [5324500.140256] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [5324500.160859] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [5324500.179174] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 |
| Comment by Mikhail Pershin [ 27/Mar/19 ] |
|
FYI, It is not trivial to just update the patch for 2.10, it is incomplete and needs more time to finish. I am working on that right now. |
| Comment by Peter Jones [ 27/Mar/19 ] |
|
sthiell would you consider moving Oak to 2.12.x once 2.12.1 is out? |
| Comment by Stephane Thiell [ 30/Mar/19 ] |
|
@Mikhail Pershin I may be wrong but I believe this issue started after playing a bit with lfs migrate (between MDTs...). Since then I stopped playing with it in 2.10. Would be nice to at least avoid the LBUG and print an error if possible? @Peter Jones Even though we won't use all 2.12 features on Oak (DOM, DNE2) like we do with Fir, 2.12.1 might still be a little bit early for Oak, which is our longer-term storage system and has to be super stable. It's getting tough to make changes right now on our systems due to all the recent problems with 2.12.0, so we're being very, very careful |
| Comment by Stephane Thiell [ 03/Apr/19 ] |
|
We performed a rolling upgrade of all Oak servers to 2.10.7 last night. Just to let you know that this issue is still there in 2.10.7, MDT000 crashed during recovery: [ 90.777835] libcfs: loading out-of-tree module taints kernel. [ 90.786635] libcfs: module verification failed: signature and/or required key missing - tainting kernel [ 90.801597] LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 [ 90.809614] alg: No test for adler32 (adler32-zlib) [ 91.590737] Lustre: Lustre: Build Version: 2.10.7 [ 91.686711] LNet: Using FMR for registration [ 91.698137] LNetError: 165:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 10.0.2.109@o2ib5 on NA (ib0:0:10.0.2.51): bad dst nid 10.0.2.51@o2ib5 [ 91.712588] LNet: Added LNI 10.0.2.51@o2ib5 [8/256/0/180] [ 92.848866] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [ 93.110454] Lustre: MGS: Connection restored to 5a5298f6-111d-c8be-fa51-aba1e8acbddd (at 0@lo) [ 113.462988] Lustre: MGS: Connection restored to (at 10.9.113.13@o2ib4) [ 115.015200] Lustre: MGS: Connection restored to e853ed08-0e59-c1f0-46ec-bddf7e82efae (at 10.9.104.67@o2ib4) [ 117.020952] Lustre: MGS: Connection restored to f9de0af1-581e-3c57-4892-885381db22e1 (at 10.8.12.15@o2ib6) [ 117.031732] Lustre: Skipped 101 previous similar messages [ 121.029153] Lustre: MGS: Connection restored to 52dc5c23-0e5a-b147-6607-824396a7a5bc (at 10.8.13.5@o2ib6) [ 121.039839] Lustre: Skipped 349 previous similar messages [ 129.049808] Lustre: MGS: Connection restored to bd8677c5-3675-5e1d-caea-743b6e85d5b7 (at 10.9.107.51@o2ib4) [ 129.060688] Lustre: Skipped 332 previous similar messages [ 145.222869] Lustre: MGS: Connection restored to b241d3bb-50fb-b0ec-e874-28b5f819cd5f (at 10.9.102.47@o2ib4) [ 145.233747] Lustre: Skipped 498 previous similar messages [ 179.936861] Lustre: MGS: Connection restored to 5a252fac-dd19-be49-5427-6758e870f019 (at 10.9.101.34@o2ib4) [ 179.947744] Lustre: Skipped 39 previous similar messages [ 244.770361] Lustre: MGS: Connection restored to 45eed64b-8718-bf1a-d8cc-70f51513bbfa (at 10.12.4.64@o2ib) [ 244.781047] Lustre: Skipped 73 previous similar messages [ 287.809651] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc [ 289.726318] LustreError: 137-5: oak-MDT0000_UUID: not available for connect from 10.9.101.10@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 289.745753] LustreError: Skipped 2 previous similar messages [ 289.863527] Lustre: oak-MDT0000: Not available for connect from 10.8.29.4@o2ib6 (not set up) [ 291.229416] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. [ 291.247566] LustreError: Skipped 2 previous similar messages [ 291.424665] Lustre: oak-MDT0000: Imperative Recovery not enabled, recovery window 300-900 [ 292.337349] Lustre: oak-MDD0000: changelog on [ 292.443825] Lustre: oak-MDT0000: Will be in recovery for at least 5:00, or until 1405 clients reconnect [ 293.154869] Lustre: oak-MDT0000: nosquash_nids set to 10.0.2.[1-3]@o2ib5 10.0.2.[51-58]@o2ib5 10.0.2.[66-69]@o2ib5 10.0.2.71@o2ib5 10.0.2.[101-120]@o2ib5 10.0.2.[221-223]@o2ib5 10.0.2.[226-227]@o2ib5 10.0.2.[ [ 293.205283] Lustre: oak-MDT0000: root_squash is set to 99:99 [ 293.403991] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.2.109@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 293.423306] LustreError: Skipped 7 previous similar messages [ 296.423425] Lustre: 22004:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1554299393/real 0] req@ffff926559b05100 x1629800831651440/t0(0) o8->oak-OST0071-os [ 299.444490] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.2.105@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 299.449179] LustreError: 22721:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff925537622100 x1629798651808976/t0(0) o601->oak-MDT0000-lwp-OST005e_UUID@10.0.2.105 [ 299.492733] LustreError: Skipped 15 previous similar messages [ 300.705804] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.0.2.106@o2ib5: 17 seconds [ 301.705919] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.0.2.110@o2ib5: 4294968 seconds [ 301.717765] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 1 previous similar message [ 302.439010] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554299393/real 1554299393] req@ffff9265581aad00 x1629800831657520/t0(0) o104->MGS [ 302.468600] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 32 previous similar messages [ 308.114494] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.2.101@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 308.133806] LustreError: Skipped 7 previous similar messages [ 308.230556] LustreError: 22737:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff926555e87800 x1629795618378928/t0(0) o601->oak-MDT0000-lwp-OST0012_UUID@10.0.2.101 [ 308.259468] LustreError: 22737:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 3 previous similar messages [ 313.480093] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554299404/real 1554299404] req@ffff9265581aad00 x1629800831657520/t0(0) o104->MGS [ 313.509688] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 11 previous similar messages [ 313.706985] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.0.2.106@o2ib5: 4294980 seconds [ 318.406286] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.2.109@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 318.425613] LustreError: Skipped 7 previous similar messages [ 324.450857] LustreError: 22781:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff92553e7c7c50 x1629798651812416/t0(0) o601->oak-MDT0000-lwp-OST0054_UUID@10.0.2.105 [ 324.521126] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554299416/real 1554299416] req@ffff9265581aad00 x1629800831657520/t0(0) o104->MGS [ 324.550719] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 11 previous similar messages [ 326.708254] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.0.2.106@o2ib5: 43 seconds [ 326.719614] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages [ 329.317873] LustreError: 22737:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff92655a2cb600 x1629795618386208/t0(0) o601->oak-MDT0000-lwp-OST002c_UUID@10.0.2.101 [ 329.346793] LustreError: 22737:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 2 previous similar messages [ 333.501981] LustreError: 22777:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff925535f90f00 x1629762005152560/t0(0) o601->oak-MDT0000-lwp-OST0025_UUID@10.0.2.102 [ 333.530897] LustreError: 22777:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 13 previous similar messages [ 335.562147] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554299427/real 1554299427] req@ffff9265581aad00 x1629800831657520/t0(0) o104->MGS [ 335.591737] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 11 previous similar messages [ 338.709384] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.0.2.106@o2ib5: 4295005 seconds [ 338.721226] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages [ 343.408615] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.2.109@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 343.427930] LustreError: Skipped 31 previous similar messages [ 346.603126] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554299438/real 1554299438] req@ffff9265581aad00 x1629800831657520/t0(0) o104->MGS [ 346.632716] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 11 previous similar messages [ 347.345895] LustreError: 22782:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff92553deb6c50 x1629762005168800/t0(0) o601->oak-MDT0000-lwp-OST002b_UUID@10.0.2.102 [ 351.710546] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.0.2.106@o2ib5: 11 seconds [ 351.721904] LNet: 21991:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 1 previous similar message [ 368.235816] LustreError: 22738:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff9265610cc050 x1629795618411280/t0(0) o601->oak-MDT0000-lwp-OST0012_UUID@10.0.2.101 [ 368.264766] LustreError: 22738:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 25 previous similar messages [ 368.645150] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554299460/real 1554299460] req@ffff9265581aad00 x1629800831657520/t0(0) o104->MGS [ 368.674740] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 23 previous similar messages [ 374.451347] Lustre: oak-MDT0000: Connection restored to d5274aad-c3ee-d395-ef4f-9ea80149e013 (at 10.0.2.105@o2ib5) [ 374.451348] Lustre: oak-MDT0000: Connection restored to d5274aad-c3ee-d395-ef4f-9ea80149e013 (at 10.0.2.105@o2ib5) [ 374.451351] Lustre: Skipped 1460 previous similar messages [ 374.480642] Lustre: Skipped 15 previous similar messages [ 383.121257] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.2.101@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 383.140573] LustreError: Skipped 83 previous similar messages [ 401.688386] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554299493/real 1554299493] req@ffff9265581aad00 x1629800831657520/t0(0) o104->MGS [ 401.717985] Lustre: 22129:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 35 previous similar messages [ 407.351522] LustreError: 22786:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff92557af0d850 x1629762005177968/t0(0) o601->oak-MDT0000-lwp-OST002b_UUID@10.0.2.102 [ 407.380442] LustreError: 22786:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 22 previous similar messages [ 445.732641] LustreError: 22129:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid 10.8.13.7@o2ib6) failed to reply to blocking AST (req@ffff9265581aad00 x1629800831657520 status 0 rc -110), evict it [ 445.776303] LustreError: 138-a: MGS: A client on nid 10.8.13.7@o2ib6 was evicted due to a lock blocking callback time out: rc -110 [ 449.458361] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.2.105@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 449.477686] LustreError: Skipped 168 previous similar messages [ 478.215062] LustreError: 22264:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff92656aaaaa00 x1629795618414336/t0(0) o601->oak-MDT0000-lwp-OST0028_UUID@10.0.2.101 [ 478.243981] LustreError: 22264:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 59 previous similar messages [ 583.140074] LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.2.101@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. [ 583.159398] LustreError: Skipped 265 previous similar messages [ 593.556078] Lustre: oak-MDT0000: recovery is timed out, evict stale exports [ 593.563979] Lustre: oak-MDT0000: disconnecting 12 stale clients [ 593.900136] LustreError: 22549:0:(osd_handler.c:1982:osd_object_release()) LBUG [ 593.908297] Pid: 22549, comm: tgt_recover_0 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018 [ 593.919656] Call Trace: [ 593.922412] [<ffffffffc08487cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 593.929717] [<ffffffffc084887c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 593.936628] [<ffffffffc1075820>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs] [ 593.945485] [<ffffffffc09aa1e0>] lu_object_put+0x190/0x3e0 [obdclass] [ 593.952816] [<ffffffffc0c50748>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] [ 593.961435] [<ffffffffc0b94abd>] replay_request_or_update.isra.21+0x24d/0x8c0 [ptlrpc] [ 593.970401] [<ffffffffc0b95746>] target_recovery_thread+0x616/0x1370 [ptlrpc] [ 593.978488] [<ffffffff946bdf21>] kthread+0xd1/0xe0 [ 593.983942] [<ffffffff94d255f7>] ret_from_fork_nospec_end+0x0/0x39 [ 593.990951] [<ffffffffffffffff>] 0xffffffffffffffff [ 593.996519] Kernel panic - not syncing: LBUG [ 594.001283] CPU: 15 PID: 22549 Comm: tgt_recover_0 Kdump: loaded Tainted: G OE ------------ 3.10.0-862.14.4.el7_lustre.x86_64 #1 [ 594.015451] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.6.0 10/26/2017 [ 594.023800] Call Trace: [ 594.026527] [<ffffffff94d13754>] dump_stack+0x19/0x1b [ 594.032260] [<ffffffff94d0d29f>] panic+0xe8/0x21f [ 594.037606] [<ffffffffc08488cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 594.044504] [<ffffffffc1075820>] osd_object_release+0x60/0x60 [osd_ldiskfs] [ 594.052380] [<ffffffffc09aa1e0>] lu_object_put+0x190/0x3e0 [obdclass] [ 594.059698] [<ffffffffc0c50748>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] [ 594.068256] [<ffffffffc0b94abd>] replay_request_or_update.isra.21+0x24d/0x8c0 [ptlrpc] [ 594.077190] [<ffffffff946c7c4b>] ? __wake_up_common+0x5b/0x90 [ 594.083712] [<ffffffffc0b95746>] target_recovery_thread+0x616/0x1370 [ptlrpc] [ 594.091786] [<ffffffffc0b95130>] ? replay_request_or_update.isra.21+0x8c0/0x8c0 [ptlrpc] [ 594.100912] [<ffffffff946bdf21>] kthread+0xd1/0xe0 [ 594.106352] [<ffffffff946bde50>] ? insert_kthread_work+0x40/0x40 [ 594.113151] [<ffffffff94d255f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 594.120435] [<ffffffff946bde50>] ? insert_kthread_work+0x40/0x40 |
| Comment by Peter Jones [ 06/Apr/19 ] |
|
Stephane I think that this is expected news but if Mike's analysis is correct (and I have no reason to suspect otherwise) then this issue should not occur in the upcoming 2.12.1 release Peter |
| Comment by Stephane Thiell [ 06/Apr/19 ] |
|
Ok Peter thanks for the update, that sounds good for 2.12.1, but what about fixing the LBUG in 2.10, or at least getting around it? I got this LBUG each time I mount MDT0000 on Oak at the end of the recovery, if I don't do a late manual abort_recovery. That's not really convenient |
| Comment by Mahmoud Hanafi [ 22/May/19 ] |
|
We hit this at NASA. Are there plans for 2.10.6 patch? |
| Comment by Andreas Dilger [ 07/Jun/19 ] |
|
It looks like this one might be a duplicate of LU-10581, which has a patch. |
| Comment by Mikhail Pershin [ 31/Jul/19 ] |
|
Note that there are two different problems with the same assertion, one on MDT which is The first, MDT problem, is fixed in 2.12 along with massive DNE update, but exists in 2.10. The problem is that there is no separate The second problem happens on OST with stack trace containing ofd_lvbo_update and there is Please be specific about what particular issue you've encounter, at least stack trace when possible. |
| Comment by Mikhail Pershin [ 14/Dec/19 ] |
|
The problems with osd_object_release() on OST and MDT can be caused by |
| Comment by Andreas Dilger [ 04/Dec/20 ] |
|
This assertion is removed with the patch https://review.whamcloud.com/40058 "LU-13980 osd: remove osd_object_release LASSERT". |