[LU-16052] conf-sanity test_106: crash after osp_sync_process_queues failed: -53 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4
Affects Version/s: None
Labels:
- llog
- recovery

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite runs:
https://testing.whamcloud.com/test_sets/7a5e3020-ab45-4e10-b38d-a4b75cf4bc12

https://testing.whamcloud.com/test_sets/47bbfb4a-1425-4d4c-b039-a9ebf527424f

https://testing.whamcloud.com/gerrit-janitor/24377/testresults/

test_106 crashed after hitting "osp_sync_process_queues() failed: -53" and has hit this 3 times between 2022-07-19 and 2022-07-27 (and never between 2022-01-01..2022-07-18), so likely relates to a recently-landed patch. This is a "SLOW" test, so there were only 69 test_106 runs between 2022-07-19..2022-07-27, about 1/23 = 4% failure.

Each stack trace is slightly different, so likely relates to random memory/stack corruption or use-after-free.

[14872.316473] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1312 fail_val=5
[14994.249862] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[14994.369136] LustreError: 19690:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0xc266:0x0]: have 0 want 1
[15209.861438] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[15210.268814] LustreError: 23314:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0x184cb:0x0]: have 0 want 1
[15426.956281] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[15427.417079] LustreError: 19691:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x4730:0x0]: have 0 want 1
[15448.744066] LustreError: 22077:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[15453.764018] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
[15453.770242] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) LBUG
[15453.771540] Pid: 19690, comm: mdt00_001 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1 SMP Mon Jul 18 20:59:11 UTC 2022
[15453.773552] Call Trace:
[15453.774160] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[15453.775185] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[15453.776274] [<0>] osd_trunc_lock+0x242/0x250 [osd_ldiskfs]
[15453.777405] [<0>] osd_declare_write+0x360/0x4a0 [osd_ldiskfs]
[15453.778769] [<0>] llog_osd_declare_write_rec+0xe0/0x3a0 [obdclass]
[15453.779994] [<0>] llog_declare_write_rec+0x1e6/0x240 [obdclass]
[15453.781161] [<0>] llog_cat_declare_add_rec+0x9c/0x260 [obdclass]
[15453.782340] [<0>] llog_declare_add+0x14f/0x1c0 [obdclass]
[15453.783452] [<0>] osp_sync_declare_add+0x11a/0x490 [osp]
[15453.784496] [<0>] osp_declare_destroy+0xfa/0x250 [osp]
[15453.785607] [<0>] lod_sub_declare_destroy+0x106/0x320 [lod]
[15453.786698] [<0>] lod_obj_stripe_destroy_cb+0xfb/0x110 [lod]
[15453.787806] [<0>] lod_obj_for_each_stripe+0x11e/0x2c0 [lod]
[15453.788898] [<0>] lod_declare_destroy+0x64a/0x700 [lod]
[15453.789978] [<0>] mdd_declare_finish_unlink+0x83/0x260 [mdd]
[15453.791100] [<0>] mdd_unlink+0x556/0xcb0 [mdd]
[15453.792109] [<0>] mdt_reint_unlink+0xdb9/0x1fe0 [mdt]
[15453.793113] [<0>] mdt_reint_rec+0x83/0x210 [mdt]
[15453.794033] [<0>] mdt_reint_internal+0x730/0xb00 [mdt]
[15453.795047] [<0>] mdt_reint+0x67/0x140 [mdt]
[15453.796342] [<0>] tgt_request_handle+0x8bf/0x18c0 [ptlrpc]
[15453.797463] [<0>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc]
[15453.798712] [<0>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc]
[15453.799726] [<0>] kthread+0xd1/0xe0

[32881.988866] LustreError: 2111699:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[32882.011552] LustreError: 2105039:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000405:0x44de:0x0]: have 0 want 1
[32901.212021] LustreError: 2111698:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[32916.247614] general protection fault: 0000 [#1] SMP PTI
[32916.248831] CPU: 0 PID: 2105038 Comm: mdt00_000 4.18.0-348.23.1.el8_lustre.ddn17.x86_64 #1
[32916.251311] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[32916.252690] RIP: 0010:llog_exist+0xd9/0x180 [obdclass]
[32916.268987] Call Trace:
[32916.269576]  llog_cat_prep_log+0x4f/0x3c0 [obdclass]
[32916.270584]  llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
[32916.272788]  llog_declare_add+0x187/0x1d0 [obdclass]
[32916.273842]  osp_sync_declare_add+0x1c2/0x460 [osp]
[32916.274823]  osp_declare_destroy+0x15f/0x230 [osp]
[32916.275897]  lod_sub_declare_destroy+0x195/0x310 [lod]
[32916.276932]  lod_obj_for_each_stripe+0x11f/0x2b0 [lod]
[32916.277968]  lod_declare_destroy+0x4f1/0x500 [lod]
[32916.279953]  mdd_declare_finish_unlink+0xa9/0x250 [mdd]
[32916.281003]  mdd_unlink+0x46b/0xbe0 [mdd]
[32916.281992]  mdt_reint_unlink+0xd43/0x1570 [mdt]
[32916.282939]  mdt_reint_rec+0x11f/0x250 [mdt]
[32916.283819]  mdt_reint_internal+0x498/0x780 [mdt]
[32916.284776]  mdt_reint+0x5e/0x100 [mdt]
[32916.286004]  tgt_request_handle+0xc90/0x1970 [ptlrpc]
[32916.288289]  ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
[32916.289495]  ptlrpc_main+0xba6/0x14a0 [ptlrpc]
[32916.292328]  kthread+0x116/0x130

[  335.344635] LustreError: 25107:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  335.349455] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[  335.391269] LustreError: 22802:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0xc225:0x0]: have 0 want 1
[  503.875146] LustreError: 25865:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  503.880055] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[  503.914196] LustreError: 25561:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x18448:0x0]: have 0 want 1
[  660.811921] LustreError: 25853:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  660.816609] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[  660.841116] LustreError: 25867:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000403:0x466b:0x0]: have 0 want 1
[  675.126857] LustreError: 25105:0:(osp_sync.c:1294:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[  680.141546] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) ASSERTION( llh->llh_size == reclen ) failed: 
[  680.145147] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) LBUG
[  680.147035] Pid: 22802, comm: mdt00_002 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022
[  680.149556] Call Trace:
[  680.150356] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[  680.152054] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[  680.153679] [<0>] llog_osd_write_rec+0x190f/0x1b60 [obdclass]
[  680.155391] [<0>] llog_write_rec+0x290/0x590 [obdclass]
[  680.156740] [<0>] llog_cat_add_rec+0x1e1/0x990 [obdclass]
[  680.158225] [<0>] llog_add+0x17f/0x1f0 [obdclass]
[  680.159875] [<0>] osp_sync_add_rec+0x177/0x780 [osp]
[  680.161819] [<0>] osp_sync_add+0x47/0x50 [osp]
[  680.163722] [<0>] osp_destroy+0x115/0x2e0 [osp]
[  680.165193] [<0>] lod_sub_destroy+0x1bb/0x4e0 [lod]
[  680.166828] [<0>] lod_obj_stripe_destroy_cb+0x3e/0x110 [lod]
[  680.168910] [<0>] lod_obj_for_each_stripe+0x11d/0x300 [lod]
[  680.170641] [<0>] lod_destroy+0x7c9/0x9d0 [lod]
[  680.172200] [<0>] mdd_finish_unlink+0x283/0x3c0 [mdd]
[  680.174117] [<0>] mdd_unlink+0xb5c/0xdb0 [mdd]
[  680.175944] [<0>] mdt_reint_unlink+0xe32/0x1dc0 [mdt]
[  680.177465] [<0>] mdt_reint_rec+0x87/0x240 [mdt]
[  680.179030] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
[  680.180533] [<0>] mdt_reint+0x67/0x150 [mdt]
[  680.182054] [<0>] tgt_request_handle+0x93a/0x19c0 [ptlrpc]
[  680.183685] [<0>] ptlrpc_server_handle_request+0x250/0xc30 [ptlrpc]
[  680.185863] [<0>] ptlrpc_main+0xbd9/0x15f0 [ptlrpc]
[  680.187608] [<0>] kthread+0xe4/0xf0

Common factors in the stacks is that they are all in mdd_unlink()->lod_sub_{declare,}_destroy()->llog_*. A likely culprit is patch https://review.whamcloud.com/47698 "LU-15938 lod: prevent endless retry in recovery thread", which landed on 2022-07-18 and added the -53 = -EBADR error return in llog_osd_next_block().

Relevant patches landed between 2022-07-15 and 2022-07-19:

dd670d968a LU-15984 o2iblnd: debug message is missing a newline
1a24dcdce1 LU-15938 lod: prevent endless retry in recovery thread  ****
5038bf8db8 LU-10994 clio: Remove cl_2queue_add wrapper
9153049bdc LU-15925 lnet: add debug messages for IB
1ebc9ed460 LU-15902 obdclass: dt_try_as_dir() check dir exists
40daa59ac4 LU-15880 quota: fix issues in reserving quota
ee4b50278e LU-15993 ofd: don't leak pages if nodemap fails
566edb8c43 LU-8582 target: send error reply on wrong opcode
b52b52c2d1 LU-15886 lfsck: remove unreasonable assertions
54a2d4662b LU-15868 lfsck: don't crash upon dir migration failure
66b3e74bcc LU-15132 hsm: Protect against parallel HSM restore requests
f238540c87 LU-15913 mdt: disable parallel rename for striped dirs

between 2022-07-10..2022-07-14 (outside probability if test fails every 4 days) doesn't show as much of interest:

210803a247 LU-15653 client: able to cleanup devices manually
a21ce928aa LU-15894 ofd: revert range locking in ofd
b2dfb4457f LU-15759 libcfs: debugfs file_operation should have an owner
98ba508190 LU-15779 ofd: don't hold read lock over bulk
0396310692 LU-15727 lod: honor append_pool with default composite layouts
71d63602c5 LU-15922 sec: new connect flag for name encryption
9bf968db56 LU-15942 utils: ofd_access_log_reader exit status
bc69a8d058 LU-8621 utils: cmd help to stdout or short cmd error
6ab060e58e LU-14555 lnet: asym route inconsistency warning
aa22a6826e LU-15481 llog: Add LLOG_SKIP_PLAIN to skip llog plain
e7ce67de92 LU-15451 sec: read-only nodemap flag
8db455c772 LU-15399 llite: dont restart directIO with IOCB_NOWAIT
530861b344 LU-15926 nrs: fix tbf realtime rules
78b04d8ee7 LU-6142 obdclass: checkpatch cleanup of obd_mount_server.c
192902851d LU-11695 som: disabling xattr cache for LSOM on client
1ac4b9598a LU-15720 dne: add crush2 hash type

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_106 - onyx-120vm4 crashed during conf-sanity test_106

Attachments

Issue Links

is related to

LU-7011 Kernel part of llog subsystem can do self-repairing in some cases

Resolved

is related to

LU-15938 MDT recovery did not finish due to corrupt llog record

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.

Activity

[LU-16052] conf-sanity test_106: crash after osp_sync_process_queues failed: -53

Gerrit Updater added a comment - 02/Aug/23 6:18 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48772/
Subject: ~~LU-16052~~ llog: handle -EBADR for catalog processing
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: b996d1e0276fdf6c084410cd1dcfac0df13437fe

Gerrit Updater added a comment - 02/Aug/23 6:18 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48772/ Subject: LU-16052 llog: handle -EBADR for catalog processing Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: b996d1e0276fdf6c084410cd1dcfac0df13437fe

Gerrit Updater added a comment - 05/Oct/22 7:02 AM

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48772
Subject: ~~LU-16052~~ llog: handle -EBADR for catalog processing
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: ed025edc7045a1edd4ed0a528ada48306b4bf150

Gerrit Updater added a comment - 05/Oct/22 7:02 AM "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48772 Subject: LU-16052 llog: handle -EBADR for catalog processing Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: ed025edc7045a1edd4ed0a528ada48306b4bf150

Peter Jones added a comment - 12/Sep/22 3:13 PM

Landed for 2.16

Peter Jones added a comment - 12/Sep/22 3:13 PM Landed for 2.16

Mikhail Pershin added a comment - 29/Jul/22 8:33 AM

So far I fixed the problem with -EBADR handling causing crashes. The additional patch is needed to handle error in osp_sync_thread() without exiting the thread.

Mikhail Pershin added a comment - 29/Jul/22 8:33 AM So far I fixed the problem with -EBADR handling causing crashes. The additional patch is needed to handle error in osp_sync_thread() without exiting the thread.

Mikhail Pershin added a comment - 28/Jul/22 12:43 PM - edited

it is not so simple as it seems. Interesting that osp_sync_thread() can exit just if llog_process_thread() would return error. Previously it would assert if that wasn't due to umount process, but now it just exits with error message. The problem is that thread is never restarted, so server will stay without sync thread at all, so it is basically not operational after that. Originally that exits only on umount, that is why it has llog_cleanup() here, it cleanups context after all and that was correct. But now thread can exit while server is active causing crashed we are seeing. So not much better that assertion. I am not sure how to handle that better

Considering that server can't operate normally without sync thread, it seem that we shouldn't exit in any case until umount OR thread must be re-started. I don't see how restart could be better, so we shouldn't exit. In that case either we need just start reprocess again (which also will result the same error). Probably the best we can do is full llog cleanup

Mikhail Pershin added a comment - 28/Jul/22 12:43 PM - edited it is not so simple as it seems. Interesting that osp_sync_thread() can exit just if llog_process_thread() would return error. Previously it would assert if that wasn't due to umount process, but now it just exits with error message. The problem is that thread is never restarted, so server will stay without sync thread at all, so it is basically not operational after that. Originally that exits only on umount, that is why it has llog_cleanup() here, it cleanups context after all and that was correct. But now thread can exit while server is active causing crashed we are seeing. So not much better that assertion. I am not sure how to handle that better Considering that server can't operate normally without sync thread, it seem that we shouldn't exit in any case until umount OR thread must be re-started. I don't see how restart could be better, so we shouldn't exit. In that case either we need just start reprocess again (which also will result the same error). Probably the best we can do is full llog cleanup

Mikhail Pershin added a comment - 28/Jul/22 11:59 AM - edited

So far I've pushed patch to prove that, meanwhile I'd check why llog_cleanup() was used originally, was that intentionally or not. Also I'd add -EBADR handling in llog_process_thread() instead of returning it to the caller. So final patch is in progress

Mikhail Pershin added a comment - 28/Jul/22 11:59 AM - edited So far I've pushed patch to prove that, meanwhile I'd check why llog_cleanup() was used originally, was that intentionally or not. Also I'd add -EBADR handling in llog_process_thread() instead of returning it to the caller. So final patch is in progress

Mikhail Pershin added a comment - 28/Jul/22 11:57 AM

I think the reason of crashes is just caused by recent patch from ~~LU-15938~~ but problem existed before. Patch just creates conditions to see it happens. The problem is that osp_sync_thread() uses llog_cleanup() instead of llog_ctxt_put(). That causes llog context complete cleanup which is to be done on device cleanup usually but not on active device. As result llog_handle pointer is llog_ctxt becomes invalid and concurrent processes might see various corruptions like we see in traces above

Mikhail Pershin added a comment - 28/Jul/22 11:57 AM I think the reason of crashes is just caused by recent patch from LU-15938 but problem existed before. Patch just creates conditions to see it happens. The problem is that osp_sync_thread() uses llog_cleanup() instead of llog_ctxt_put() . That causes llog context complete cleanup which is to be done on device cleanup usually but not on active device. As result llog_handle pointer is llog_ctxt becomes invalid and concurrent processes might see various corruptions like we see in traces above

People

Assignee:: Mikhail Pershin

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 27/Jul/22 11:32 PM

Updated:: 18/Mar/25 7:19 AM

Resolved:: 12/Sep/22 3:13 PM