[LU-16052] conf-sanity test_106: crash after osp_sync_process_queues failed: -53 Created: 27/Jul/22  Updated: 02/Aug/23  Resolved: 12/Sep/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Gantt End to Start
Related
is related to LU-15938 MDT recovery did not finish due to co... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite runs:
https://testing.whamcloud.com/test_sets/7a5e3020-ab45-4e10-b38d-a4b75cf4bc12

https://testing.whamcloud.com/test_sets/47bbfb4a-1425-4d4c-b039-a9ebf527424f

https://testing.whamcloud.com/gerrit-janitor/24377/testresults/

test_106 crashed after hitting "osp_sync_process_queues() failed: -53" and has hit this 3 times between 2022-07-19 and 2022-07-27 (and never between 2022-01-01..2022-07-18), so likely relates to a recently-landed patch. This is a "SLOW" test, so there were only 69 test_106 runs between 2022-07-19..2022-07-27, about 1/23 = 4% failure.

Each stack trace is slightly different, so likely relates to random memory/stack corruption or use-after-free.

[14872.316473] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1312 fail_val=5
[14994.249862] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[14994.369136] LustreError: 19690:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0xc266:0x0]: have 0 want 1
[15209.861438] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[15210.268814] LustreError: 23314:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0x184cb:0x0]: have 0 want 1
[15426.956281] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[15427.417079] LustreError: 19691:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x4730:0x0]: have 0 want 1
[15448.744066] LustreError: 22077:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[15453.764018] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
[15453.770242] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) LBUG
[15453.771540] Pid: 19690, comm: mdt00_001 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1 SMP Mon Jul 18 20:59:11 UTC 2022
[15453.773552] Call Trace:
[15453.774160] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[15453.775185] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[15453.776274] [<0>] osd_trunc_lock+0x242/0x250 [osd_ldiskfs]
[15453.777405] [<0>] osd_declare_write+0x360/0x4a0 [osd_ldiskfs]
[15453.778769] [<0>] llog_osd_declare_write_rec+0xe0/0x3a0 [obdclass]
[15453.779994] [<0>] llog_declare_write_rec+0x1e6/0x240 [obdclass]
[15453.781161] [<0>] llog_cat_declare_add_rec+0x9c/0x260 [obdclass]
[15453.782340] [<0>] llog_declare_add+0x14f/0x1c0 [obdclass]
[15453.783452] [<0>] osp_sync_declare_add+0x11a/0x490 [osp]
[15453.784496] [<0>] osp_declare_destroy+0xfa/0x250 [osp]
[15453.785607] [<0>] lod_sub_declare_destroy+0x106/0x320 [lod]
[15453.786698] [<0>] lod_obj_stripe_destroy_cb+0xfb/0x110 [lod]
[15453.787806] [<0>] lod_obj_for_each_stripe+0x11e/0x2c0 [lod]
[15453.788898] [<0>] lod_declare_destroy+0x64a/0x700 [lod]
[15453.789978] [<0>] mdd_declare_finish_unlink+0x83/0x260 [mdd]
[15453.791100] [<0>] mdd_unlink+0x556/0xcb0 [mdd]
[15453.792109] [<0>] mdt_reint_unlink+0xdb9/0x1fe0 [mdt]
[15453.793113] [<0>] mdt_reint_rec+0x83/0x210 [mdt]
[15453.794033] [<0>] mdt_reint_internal+0x730/0xb00 [mdt]
[15453.795047] [<0>] mdt_reint+0x67/0x140 [mdt]
[15453.796342] [<0>] tgt_request_handle+0x8bf/0x18c0 [ptlrpc]
[15453.797463] [<0>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc]
[15453.798712] [<0>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc]
[15453.799726] [<0>] kthread+0xd1/0xe0
[32881.988866] LustreError: 2111699:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[32882.011552] LustreError: 2105039:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000405:0x44de:0x0]: have 0 want 1
[32901.212021] LustreError: 2111698:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[32916.247614] general protection fault: 0000 [#1] SMP PTI
[32916.248831] CPU: 0 PID: 2105038 Comm: mdt00_000 4.18.0-348.23.1.el8_lustre.ddn17.x86_64 #1
[32916.251311] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[32916.252690] RIP: 0010:llog_exist+0xd9/0x180 [obdclass]
[32916.268987] Call Trace:
[32916.269576]  llog_cat_prep_log+0x4f/0x3c0 [obdclass]
[32916.270584]  llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
[32916.272788]  llog_declare_add+0x187/0x1d0 [obdclass]
[32916.273842]  osp_sync_declare_add+0x1c2/0x460 [osp]
[32916.274823]  osp_declare_destroy+0x15f/0x230 [osp]
[32916.275897]  lod_sub_declare_destroy+0x195/0x310 [lod]
[32916.276932]  lod_obj_for_each_stripe+0x11f/0x2b0 [lod]
[32916.277968]  lod_declare_destroy+0x4f1/0x500 [lod]
[32916.279953]  mdd_declare_finish_unlink+0xa9/0x250 [mdd]
[32916.281003]  mdd_unlink+0x46b/0xbe0 [mdd]
[32916.281992]  mdt_reint_unlink+0xd43/0x1570 [mdt]
[32916.282939]  mdt_reint_rec+0x11f/0x250 [mdt]
[32916.283819]  mdt_reint_internal+0x498/0x780 [mdt]
[32916.284776]  mdt_reint+0x5e/0x100 [mdt]
[32916.286004]  tgt_request_handle+0xc90/0x1970 [ptlrpc]
[32916.288289]  ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
[32916.289495]  ptlrpc_main+0xba6/0x14a0 [ptlrpc]
[32916.292328]  kthread+0x116/0x130
[  335.344635] LustreError: 25107:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  335.349455] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[  335.391269] LustreError: 22802:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0xc225:0x0]: have 0 want 1
[  503.875146] LustreError: 25865:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  503.880055] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[  503.914196] LustreError: 25561:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x18448:0x0]: have 0 want 1
[  660.811921] LustreError: 25853:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  660.816609] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[  660.841116] LustreError: 25867:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000403:0x466b:0x0]: have 0 want 1
[  675.126857] LustreError: 25105:0:(osp_sync.c:1294:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[  680.141546] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) ASSERTION( llh->llh_size == reclen ) failed: 
[  680.145147] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) LBUG
[  680.147035] Pid: 22802, comm: mdt00_002 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022
[  680.149556] Call Trace:
[  680.150356] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[  680.152054] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[  680.153679] [<0>] llog_osd_write_rec+0x190f/0x1b60 [obdclass]
[  680.155391] [<0>] llog_write_rec+0x290/0x590 [obdclass]
[  680.156740] [<0>] llog_cat_add_rec+0x1e1/0x990 [obdclass]
[  680.158225] [<0>] llog_add+0x17f/0x1f0 [obdclass]
[  680.159875] [<0>] osp_sync_add_rec+0x177/0x780 [osp]
[  680.161819] [<0>] osp_sync_add+0x47/0x50 [osp]
[  680.163722] [<0>] osp_destroy+0x115/0x2e0 [osp]
[  680.165193] [<0>] lod_sub_destroy+0x1bb/0x4e0 [lod]
[  680.166828] [<0>] lod_obj_stripe_destroy_cb+0x3e/0x110 [lod]
[  680.168910] [<0>] lod_obj_for_each_stripe+0x11d/0x300 [lod]
[  680.170641] [<0>] lod_destroy+0x7c9/0x9d0 [lod]
[  680.172200] [<0>] mdd_finish_unlink+0x283/0x3c0 [mdd]
[  680.174117] [<0>] mdd_unlink+0xb5c/0xdb0 [mdd]
[  680.175944] [<0>] mdt_reint_unlink+0xe32/0x1dc0 [mdt]
[  680.177465] [<0>] mdt_reint_rec+0x87/0x240 [mdt]
[  680.179030] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
[  680.180533] [<0>] mdt_reint+0x67/0x150 [mdt]
[  680.182054] [<0>] tgt_request_handle+0x93a/0x19c0 [ptlrpc]
[  680.183685] [<0>] ptlrpc_server_handle_request+0x250/0xc30 [ptlrpc]
[  680.185863] [<0>] ptlrpc_main+0xbd9/0x15f0 [ptlrpc]
[  680.187608] [<0>] kthread+0xe4/0xf0

Common factors in the stacks is that they are all in mdd_unlink()->lod_sub_{declare,}_destroy()->llog_*. A likely culprit is patch https://review.whamcloud.com/47698 "LU-15938 lod: prevent endless retry in recovery thread", which landed on 2022-07-18 and added the -53 = -EBADR error return in llog_osd_next_block().

Relevant patches landed between 2022-07-15 and 2022-07-19:

dd670d968a LU-15984 o2iblnd: debug message is missing a newline
1a24dcdce1 LU-15938 lod: prevent endless retry in recovery thread  ****
5038bf8db8 LU-10994 clio: Remove cl_2queue_add wrapper
9153049bdc LU-15925 lnet: add debug messages for IB
1ebc9ed460 LU-15902 obdclass: dt_try_as_dir() check dir exists
40daa59ac4 LU-15880 quota: fix issues in reserving quota
ee4b50278e LU-15993 ofd: don't leak pages if nodemap fails
566edb8c43 LU-8582 target: send error reply on wrong opcode
b52b52c2d1 LU-15886 lfsck: remove unreasonable assertions
54a2d4662b LU-15868 lfsck: don't crash upon dir migration failure
66b3e74bcc LU-15132 hsm: Protect against parallel HSM restore requests
f238540c87 LU-15913 mdt: disable parallel rename for striped dirs

between 2022-07-10..2022-07-14 (outside probability if test fails every 4 days) doesn't show as much of interest:

210803a247 LU-15653 client: able to cleanup devices manually
a21ce928aa LU-15894 ofd: revert range locking in ofd
b2dfb4457f LU-15759 libcfs: debugfs file_operation should have an owner
98ba508190 LU-15779 ofd: don't hold read lock over bulk
0396310692 LU-15727 lod: honor append_pool with default composite layouts
71d63602c5 LU-15922 sec: new connect flag for name encryption
9bf968db56 LU-15942 utils: ofd_access_log_reader exit status
bc69a8d058 LU-8621 utils: cmd help to stdout or short cmd error
6ab060e58e LU-14555 lnet: asym route inconsistency warning
aa22a6826e LU-15481 llog: Add LLOG_SKIP_PLAIN to skip llog plain
e7ce67de92 LU-15451 sec: read-only nodemap flag
8db455c772 LU-15399 llite: dont restart directIO with IOCB_NOWAIT
530861b344 LU-15926 nrs: fix tbf realtime rules
78b04d8ee7 LU-6142 obdclass: checkpatch cleanup of obd_mount_server.c
192902851d LU-11695 som: disabling xattr cache for LSOM on client
1ac4b9598a LU-15720 dne: add crush2 hash type

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_106 - onyx-120vm4 crashed during conf-sanity test_106



 Comments   
Comment by Gerrit Updater [ 27/Jul/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48054
Subject: LU-16052 tests: run conf-sanity test_106 for debug2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3a1b65c93eba00247f54d211618d4bc48252df02

Comment by Gerrit Updater [ 27/Jul/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48055
Subject: LU-16052 tests: run conf-sanity test_106 for debug1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 617c188cb3fd33be3893093ef0d1826293d0907f

Comment by Andreas Dilger [ 28/Jul/22 ]

Hi Mike,
could you please take a look at this crash. The test patch passed 50/50 runs right before your LU-15938 patch, but crashed 2/41 runs on the tip of master. I've rebased the "after" patch to be right on top of the LU-15938 patch to confirm it isn't some bad interaction with another patch after that, though there are only 8 other patches afterward, and most are unrelated.

Comment by Gerrit Updater [ 28/Jul/22 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48070
Subject: LU-16052 osp: don't cleanup llog context in use
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fd497004b9815a5ae2dd41460ffd84f4dc3e6555

Comment by Mikhail Pershin [ 28/Jul/22 ]

I think the reason of crashes is just caused by recent patch from LU-15938 but problem existed before. Patch just creates conditions to see it happens. The problem is that osp_sync_thread() uses llog_cleanup() instead of llog_ctxt_put(). That causes llog context complete cleanup which is to be done on device cleanup usually but not on active device. As result llog_handle pointer is llog_ctxt becomes invalid and concurrent processes might see various corruptions like we see in traces above

Comment by Mikhail Pershin [ 28/Jul/22 ]

So far I've pushed patch to prove that, meanwhile I'd check why llog_cleanup() was used originally, was that intentionally or not. Also I'd add -EBADR handling in llog_process_thread() instead of returning it to the caller. So final patch is in progress

Comment by Mikhail Pershin [ 28/Jul/22 ]

it is not so simple as it seems. Interesting that osp_sync_thread() can exit just if llog_process_thread() would return error. Previously it would assert if that wasn't due to umount process, but now it just exits with error message. The problem is that thread is never restarted, so server will stay without sync thread at all, so it is basically not operational after that. Originally that  exits only on umount, that is why it has llog_cleanup() here, it cleanups context after all and that was correct. But now thread can exit while server is active causing crashed we are seeing. So not much better that assertion. I am not sure how to handle that better

Considering that server can't operate normally without sync thread, it seem that we shouldn't exit in any case until umount OR thread must be re-started. I don't see how restart could be better, so we shouldn't exit. In that case either we need just start reprocess again (which also will result the same error). Probably the best we can do is full llog cleanup 

Comment by Mikhail Pershin [ 29/Jul/22 ]

So far I fixed the problem with -EBADR handling causing crashes. The additional patch is needed to handle error in osp_sync_thread() without exiting the thread. 

Comment by Peter Jones [ 12/Sep/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 05/Oct/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48772
Subject: LU-16052 llog: handle -EBADR for catalog processing
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: ed025edc7045a1edd4ed0a528ada48306b4bf150

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48772/
Subject: LU-16052 llog: handle -EBADR for catalog processing
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: b996d1e0276fdf6c084410cd1dcfac0df13437fe

Generated at Sat Feb 10 03:23:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.