[LU-16052] conf-sanity test_106: crash after osp_sync_process_queues failed: -53 Created: 27/Jul/22 Updated: 02/Aug/23 Resolved: 12/Sep/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite runs: https://testing.whamcloud.com/test_sets/47bbfb4a-1425-4d4c-b039-a9ebf527424f https://testing.whamcloud.com/gerrit-janitor/24377/testresults/ test_106 crashed after hitting "osp_sync_process_queues() failed: -53" and has hit this 3 times between 2022-07-19 and 2022-07-27 (and never between 2022-01-01..2022-07-18), so likely relates to a recently-landed patch. This is a "SLOW" test, so there were only 69 test_106 runs between 2022-07-19..2022-07-27, about 1/23 = 4% failure. Each stack trace is slightly different, so likely relates to random memory/stack corruption or use-after-free. [14872.316473] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1312 fail_val=5 [14994.249862] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [14994.369136] LustreError: 19690:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0xc266:0x0]: have 0 want 1 [15209.861438] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [15210.268814] LustreError: 23314:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0x184cb:0x0]: have 0 want 1 [15426.956281] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [15427.417079] LustreError: 19691:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x4730:0x0]: have 0 want 1 [15448.744066] LustreError: 22077:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53 [15453.764018] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: [15453.770242] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) LBUG [15453.771540] Pid: 19690, comm: mdt00_001 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1 SMP Mon Jul 18 20:59:11 UTC 2022 [15453.773552] Call Trace: [15453.774160] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [15453.775185] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [15453.776274] [<0>] osd_trunc_lock+0x242/0x250 [osd_ldiskfs] [15453.777405] [<0>] osd_declare_write+0x360/0x4a0 [osd_ldiskfs] [15453.778769] [<0>] llog_osd_declare_write_rec+0xe0/0x3a0 [obdclass] [15453.779994] [<0>] llog_declare_write_rec+0x1e6/0x240 [obdclass] [15453.781161] [<0>] llog_cat_declare_add_rec+0x9c/0x260 [obdclass] [15453.782340] [<0>] llog_declare_add+0x14f/0x1c0 [obdclass] [15453.783452] [<0>] osp_sync_declare_add+0x11a/0x490 [osp] [15453.784496] [<0>] osp_declare_destroy+0xfa/0x250 [osp] [15453.785607] [<0>] lod_sub_declare_destroy+0x106/0x320 [lod] [15453.786698] [<0>] lod_obj_stripe_destroy_cb+0xfb/0x110 [lod] [15453.787806] [<0>] lod_obj_for_each_stripe+0x11e/0x2c0 [lod] [15453.788898] [<0>] lod_declare_destroy+0x64a/0x700 [lod] [15453.789978] [<0>] mdd_declare_finish_unlink+0x83/0x260 [mdd] [15453.791100] [<0>] mdd_unlink+0x556/0xcb0 [mdd] [15453.792109] [<0>] mdt_reint_unlink+0xdb9/0x1fe0 [mdt] [15453.793113] [<0>] mdt_reint_rec+0x83/0x210 [mdt] [15453.794033] [<0>] mdt_reint_internal+0x730/0xb00 [mdt] [15453.795047] [<0>] mdt_reint+0x67/0x140 [mdt] [15453.796342] [<0>] tgt_request_handle+0x8bf/0x18c0 [ptlrpc] [15453.797463] [<0>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc] [15453.798712] [<0>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc] [15453.799726] [<0>] kthread+0xd1/0xe0 [32881.988866] LustreError: 2111699:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [32882.011552] LustreError: 2105039:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000405:0x44de:0x0]: have 0 want 1 [32901.212021] LustreError: 2111698:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53 [32916.247614] general protection fault: 0000 [#1] SMP PTI [32916.248831] CPU: 0 PID: 2105038 Comm: mdt00_000 4.18.0-348.23.1.el8_lustre.ddn17.x86_64 #1 [32916.251311] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [32916.252690] RIP: 0010:llog_exist+0xd9/0x180 [obdclass] [32916.268987] Call Trace: [32916.269576] llog_cat_prep_log+0x4f/0x3c0 [obdclass] [32916.270584] llog_cat_declare_add_rec+0xbe/0x220 [obdclass] [32916.272788] llog_declare_add+0x187/0x1d0 [obdclass] [32916.273842] osp_sync_declare_add+0x1c2/0x460 [osp] [32916.274823] osp_declare_destroy+0x15f/0x230 [osp] [32916.275897] lod_sub_declare_destroy+0x195/0x310 [lod] [32916.276932] lod_obj_for_each_stripe+0x11f/0x2b0 [lod] [32916.277968] lod_declare_destroy+0x4f1/0x500 [lod] [32916.279953] mdd_declare_finish_unlink+0xa9/0x250 [mdd] [32916.281003] mdd_unlink+0x46b/0xbe0 [mdd] [32916.281992] mdt_reint_unlink+0xd43/0x1570 [mdt] [32916.282939] mdt_reint_rec+0x11f/0x250 [mdt] [32916.283819] mdt_reint_internal+0x498/0x780 [mdt] [32916.284776] mdt_reint+0x5e/0x100 [mdt] [32916.286004] tgt_request_handle+0xc90/0x1970 [ptlrpc] [32916.288289] ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc] [32916.289495] ptlrpc_main+0xba6/0x14a0 [ptlrpc] [32916.292328] kthread+0x116/0x130 [ 335.344635] LustreError: 25107:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28 [ 335.349455] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [ 335.391269] LustreError: 22802:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0xc225:0x0]: have 0 want 1 [ 503.875146] LustreError: 25865:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28 [ 503.880055] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [ 503.914196] LustreError: 25561:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x18448:0x0]: have 0 want 1 [ 660.811921] LustreError: 25853:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28 [ 660.816609] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [ 660.841116] LustreError: 25867:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000403:0x466b:0x0]: have 0 want 1 [ 675.126857] LustreError: 25105:0:(osp_sync.c:1294:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53 [ 680.141546] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) ASSERTION( llh->llh_size == reclen ) failed: [ 680.145147] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) LBUG [ 680.147035] Pid: 22802, comm: mdt00_002 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022 [ 680.149556] Call Trace: [ 680.150356] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [ 680.152054] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 680.153679] [<0>] llog_osd_write_rec+0x190f/0x1b60 [obdclass] [ 680.155391] [<0>] llog_write_rec+0x290/0x590 [obdclass] [ 680.156740] [<0>] llog_cat_add_rec+0x1e1/0x990 [obdclass] [ 680.158225] [<0>] llog_add+0x17f/0x1f0 [obdclass] [ 680.159875] [<0>] osp_sync_add_rec+0x177/0x780 [osp] [ 680.161819] [<0>] osp_sync_add+0x47/0x50 [osp] [ 680.163722] [<0>] osp_destroy+0x115/0x2e0 [osp] [ 680.165193] [<0>] lod_sub_destroy+0x1bb/0x4e0 [lod] [ 680.166828] [<0>] lod_obj_stripe_destroy_cb+0x3e/0x110 [lod] [ 680.168910] [<0>] lod_obj_for_each_stripe+0x11d/0x300 [lod] [ 680.170641] [<0>] lod_destroy+0x7c9/0x9d0 [lod] [ 680.172200] [<0>] mdd_finish_unlink+0x283/0x3c0 [mdd] [ 680.174117] [<0>] mdd_unlink+0xb5c/0xdb0 [mdd] [ 680.175944] [<0>] mdt_reint_unlink+0xe32/0x1dc0 [mdt] [ 680.177465] [<0>] mdt_reint_rec+0x87/0x240 [mdt] [ 680.179030] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt] [ 680.180533] [<0>] mdt_reint+0x67/0x150 [mdt] [ 680.182054] [<0>] tgt_request_handle+0x93a/0x19c0 [ptlrpc] [ 680.183685] [<0>] ptlrpc_server_handle_request+0x250/0xc30 [ptlrpc] [ 680.185863] [<0>] ptlrpc_main+0xbd9/0x15f0 [ptlrpc] [ 680.187608] [<0>] kthread+0xe4/0xf0 Common factors in the stacks is that they are all in mdd_unlink()->lod_sub_{declare,}_destroy()->llog_*. A likely culprit is patch https://review.whamcloud.com/47698 " Relevant patches landed between 2022-07-15 and 2022-07-19: dd670d968a LU-15984 o2iblnd: debug message is missing a newline 1a24dcdce1 LU-15938 lod: prevent endless retry in recovery thread **** 5038bf8db8 LU-10994 clio: Remove cl_2queue_add wrapper 9153049bdc LU-15925 lnet: add debug messages for IB 1ebc9ed460 LU-15902 obdclass: dt_try_as_dir() check dir exists 40daa59ac4 LU-15880 quota: fix issues in reserving quota ee4b50278e LU-15993 ofd: don't leak pages if nodemap fails 566edb8c43 LU-8582 target: send error reply on wrong opcode b52b52c2d1 LU-15886 lfsck: remove unreasonable assertions 54a2d4662b LU-15868 lfsck: don't crash upon dir migration failure 66b3e74bcc LU-15132 hsm: Protect against parallel HSM restore requests f238540c87 LU-15913 mdt: disable parallel rename for striped dirs between 2022-07-10..2022-07-14 (outside probability if test fails every 4 days) doesn't show as much of interest: 210803a247 LU-15653 client: able to cleanup devices manually a21ce928aa LU-15894 ofd: revert range locking in ofd b2dfb4457f LU-15759 libcfs: debugfs file_operation should have an owner 98ba508190 LU-15779 ofd: don't hold read lock over bulk 0396310692 LU-15727 lod: honor append_pool with default composite layouts 71d63602c5 LU-15922 sec: new connect flag for name encryption 9bf968db56 LU-15942 utils: ofd_access_log_reader exit status bc69a8d058 LU-8621 utils: cmd help to stdout or short cmd error 6ab060e58e LU-14555 lnet: asym route inconsistency warning aa22a6826e LU-15481 llog: Add LLOG_SKIP_PLAIN to skip llog plain e7ce67de92 LU-15451 sec: read-only nodemap flag 8db455c772 LU-15399 llite: dont restart directIO with IOCB_NOWAIT 530861b344 LU-15926 nrs: fix tbf realtime rules 78b04d8ee7 LU-6142 obdclass: checkpatch cleanup of obd_mount_server.c 192902851d LU-11695 som: disabling xattr cache for LSOM on client 1ac4b9598a LU-15720 dne: add crush2 hash type VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Gerrit Updater [ 27/Jul/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48054 |
| Comment by Gerrit Updater [ 27/Jul/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48055 |
| Comment by Andreas Dilger [ 28/Jul/22 ] |
|
Hi Mike, |
| Comment by Gerrit Updater [ 28/Jul/22 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48070 |
| Comment by Mikhail Pershin [ 28/Jul/22 ] |
|
I think the reason of crashes is just caused by recent patch from |
| Comment by Mikhail Pershin [ 28/Jul/22 ] |
|
So far I've pushed patch to prove that, meanwhile I'd check why llog_cleanup() was used originally, was that intentionally or not. Also I'd add -EBADR handling in llog_process_thread() instead of returning it to the caller. So final patch is in progress |
| Comment by Mikhail Pershin [ 28/Jul/22 ] |
|
it is not so simple as it seems. Interesting that osp_sync_thread() can exit just if llog_process_thread() would return error. Previously it would assert if that wasn't due to umount process, but now it just exits with error message. The problem is that thread is never restarted, so server will stay without sync thread at all, so it is basically not operational after that. Originally that exits only on umount, that is why it has llog_cleanup() here, it cleanups context after all and that was correct. But now thread can exit while server is active causing crashed we are seeing. So not much better that assertion. I am not sure how to handle that better Considering that server can't operate normally without sync thread, it seem that we shouldn't exit in any case until umount OR thread must be re-started. I don't see how restart could be better, so we shouldn't exit. In that case either we need just start reprocess again (which also will result the same error). Probably the best we can do is full llog cleanup |
| Comment by Mikhail Pershin [ 29/Jul/22 ] |
|
So far I fixed the problem with -EBADR handling causing crashes. The additional patch is needed to handle error in osp_sync_thread() without exiting the thread. |
| Comment by Peter Jones [ 12/Sep/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 05/Oct/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48772 |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48772/ |