Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>
This issue relates to the following test suite runs:
https://testing.whamcloud.com/test_sets/7a5e3020-ab45-4e10-b38d-a4b75cf4bc12
https://testing.whamcloud.com/test_sets/47bbfb4a-1425-4d4c-b039-a9ebf527424f
https://testing.whamcloud.com/gerrit-janitor/24377/testresults/
test_106 crashed after hitting "osp_sync_process_queues() failed: -53" and has hit this 3 times between 2022-07-19 and 2022-07-27 (and never between 2022-01-01..2022-07-18), so likely relates to a recently-landed patch. This is a "SLOW" test, so there were only 69 test_106 runs between 2022-07-19..2022-07-27, about 1/23 = 4% failure.
Each stack trace is slightly different, so likely relates to random memory/stack corruption or use-after-free.
[14872.316473] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1312 fail_val=5 [14994.249862] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [14994.369136] LustreError: 19690:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0xc266:0x0]: have 0 want 1 [15209.861438] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [15210.268814] LustreError: 23314:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0x184cb:0x0]: have 0 want 1 [15426.956281] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [15427.417079] LustreError: 19691:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x4730:0x0]: have 0 want 1 [15448.744066] LustreError: 22077:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53 [15453.764018] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: [15453.770242] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) LBUG [15453.771540] Pid: 19690, comm: mdt00_001 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1 SMP Mon Jul 18 20:59:11 UTC 2022 [15453.773552] Call Trace: [15453.774160] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [15453.775185] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [15453.776274] [<0>] osd_trunc_lock+0x242/0x250 [osd_ldiskfs] [15453.777405] [<0>] osd_declare_write+0x360/0x4a0 [osd_ldiskfs] [15453.778769] [<0>] llog_osd_declare_write_rec+0xe0/0x3a0 [obdclass] [15453.779994] [<0>] llog_declare_write_rec+0x1e6/0x240 [obdclass] [15453.781161] [<0>] llog_cat_declare_add_rec+0x9c/0x260 [obdclass] [15453.782340] [<0>] llog_declare_add+0x14f/0x1c0 [obdclass] [15453.783452] [<0>] osp_sync_declare_add+0x11a/0x490 [osp] [15453.784496] [<0>] osp_declare_destroy+0xfa/0x250 [osp] [15453.785607] [<0>] lod_sub_declare_destroy+0x106/0x320 [lod] [15453.786698] [<0>] lod_obj_stripe_destroy_cb+0xfb/0x110 [lod] [15453.787806] [<0>] lod_obj_for_each_stripe+0x11e/0x2c0 [lod] [15453.788898] [<0>] lod_declare_destroy+0x64a/0x700 [lod] [15453.789978] [<0>] mdd_declare_finish_unlink+0x83/0x260 [mdd] [15453.791100] [<0>] mdd_unlink+0x556/0xcb0 [mdd] [15453.792109] [<0>] mdt_reint_unlink+0xdb9/0x1fe0 [mdt] [15453.793113] [<0>] mdt_reint_rec+0x83/0x210 [mdt] [15453.794033] [<0>] mdt_reint_internal+0x730/0xb00 [mdt] [15453.795047] [<0>] mdt_reint+0x67/0x140 [mdt] [15453.796342] [<0>] tgt_request_handle+0x8bf/0x18c0 [ptlrpc] [15453.797463] [<0>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc] [15453.798712] [<0>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc] [15453.799726] [<0>] kthread+0xd1/0xe0
[32881.988866] LustreError: 2111699:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [32882.011552] LustreError: 2105039:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000405:0x44de:0x0]: have 0 want 1 [32901.212021] LustreError: 2111698:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53 [32916.247614] general protection fault: 0000 [#1] SMP PTI [32916.248831] CPU: 0 PID: 2105038 Comm: mdt00_000 4.18.0-348.23.1.el8_lustre.ddn17.x86_64 #1 [32916.251311] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [32916.252690] RIP: 0010:llog_exist+0xd9/0x180 [obdclass] [32916.268987] Call Trace: [32916.269576] llog_cat_prep_log+0x4f/0x3c0 [obdclass] [32916.270584] llog_cat_declare_add_rec+0xbe/0x220 [obdclass] [32916.272788] llog_declare_add+0x187/0x1d0 [obdclass] [32916.273842] osp_sync_declare_add+0x1c2/0x460 [osp] [32916.274823] osp_declare_destroy+0x15f/0x230 [osp] [32916.275897] lod_sub_declare_destroy+0x195/0x310 [lod] [32916.276932] lod_obj_for_each_stripe+0x11f/0x2b0 [lod] [32916.277968] lod_declare_destroy+0x4f1/0x500 [lod] [32916.279953] mdd_declare_finish_unlink+0xa9/0x250 [mdd] [32916.281003] mdd_unlink+0x46b/0xbe0 [mdd] [32916.281992] mdt_reint_unlink+0xd43/0x1570 [mdt] [32916.282939] mdt_reint_rec+0x11f/0x250 [mdt] [32916.283819] mdt_reint_internal+0x498/0x780 [mdt] [32916.284776] mdt_reint+0x5e/0x100 [mdt] [32916.286004] tgt_request_handle+0xc90/0x1970 [ptlrpc] [32916.288289] ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc] [32916.289495] ptlrpc_main+0xba6/0x14a0 [ptlrpc] [32916.292328] kthread+0x116/0x130
[ 335.344635] LustreError: 25107:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28 [ 335.349455] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [ 335.391269] LustreError: 22802:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0xc225:0x0]: have 0 want 1 [ 503.875146] LustreError: 25865:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28 [ 503.880055] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [ 503.914196] LustreError: 25561:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x18448:0x0]: have 0 want 1 [ 660.811921] LustreError: 25853:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28 [ 660.816609] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 [ 660.841116] LustreError: 25867:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000403:0x466b:0x0]: have 0 want 1 [ 675.126857] LustreError: 25105:0:(osp_sync.c:1294:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53 [ 680.141546] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) ASSERTION( llh->llh_size == reclen ) failed: [ 680.145147] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) LBUG [ 680.147035] Pid: 22802, comm: mdt00_002 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022 [ 680.149556] Call Trace: [ 680.150356] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [ 680.152054] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 680.153679] [<0>] llog_osd_write_rec+0x190f/0x1b60 [obdclass] [ 680.155391] [<0>] llog_write_rec+0x290/0x590 [obdclass] [ 680.156740] [<0>] llog_cat_add_rec+0x1e1/0x990 [obdclass] [ 680.158225] [<0>] llog_add+0x17f/0x1f0 [obdclass] [ 680.159875] [<0>] osp_sync_add_rec+0x177/0x780 [osp] [ 680.161819] [<0>] osp_sync_add+0x47/0x50 [osp] [ 680.163722] [<0>] osp_destroy+0x115/0x2e0 [osp] [ 680.165193] [<0>] lod_sub_destroy+0x1bb/0x4e0 [lod] [ 680.166828] [<0>] lod_obj_stripe_destroy_cb+0x3e/0x110 [lod] [ 680.168910] [<0>] lod_obj_for_each_stripe+0x11d/0x300 [lod] [ 680.170641] [<0>] lod_destroy+0x7c9/0x9d0 [lod] [ 680.172200] [<0>] mdd_finish_unlink+0x283/0x3c0 [mdd] [ 680.174117] [<0>] mdd_unlink+0xb5c/0xdb0 [mdd] [ 680.175944] [<0>] mdt_reint_unlink+0xe32/0x1dc0 [mdt] [ 680.177465] [<0>] mdt_reint_rec+0x87/0x240 [mdt] [ 680.179030] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt] [ 680.180533] [<0>] mdt_reint+0x67/0x150 [mdt] [ 680.182054] [<0>] tgt_request_handle+0x93a/0x19c0 [ptlrpc] [ 680.183685] [<0>] ptlrpc_server_handle_request+0x250/0xc30 [ptlrpc] [ 680.185863] [<0>] ptlrpc_main+0xbd9/0x15f0 [ptlrpc] [ 680.187608] [<0>] kthread+0xe4/0xf0
Common factors in the stacks is that they are all in mdd_unlink()->lod_sub_{declare,}_destroy()->llog_*. A likely culprit is patch https://review.whamcloud.com/47698 "LU-15938 lod: prevent endless retry in recovery thread", which landed on 2022-07-18 and added the -53 = -EBADR error return in llog_osd_next_block().
Relevant patches landed between 2022-07-15 and 2022-07-19:
dd670d968a LU-15984 o2iblnd: debug message is missing a newline 1a24dcdce1 LU-15938 lod: prevent endless retry in recovery thread **** 5038bf8db8 LU-10994 clio: Remove cl_2queue_add wrapper 9153049bdc LU-15925 lnet: add debug messages for IB 1ebc9ed460 LU-15902 obdclass: dt_try_as_dir() check dir exists 40daa59ac4 LU-15880 quota: fix issues in reserving quota ee4b50278e LU-15993 ofd: don't leak pages if nodemap fails 566edb8c43 LU-8582 target: send error reply on wrong opcode b52b52c2d1 LU-15886 lfsck: remove unreasonable assertions 54a2d4662b LU-15868 lfsck: don't crash upon dir migration failure 66b3e74bcc LU-15132 hsm: Protect against parallel HSM restore requests f238540c87 LU-15913 mdt: disable parallel rename for striped dirs
between 2022-07-10..2022-07-14 (outside probability if test fails every 4 days) doesn't show as much of interest:
210803a247 LU-15653 client: able to cleanup devices manually a21ce928aa LU-15894 ofd: revert range locking in ofd b2dfb4457f LU-15759 libcfs: debugfs file_operation should have an owner 98ba508190 LU-15779 ofd: don't hold read lock over bulk 0396310692 LU-15727 lod: honor append_pool with default composite layouts 71d63602c5 LU-15922 sec: new connect flag for name encryption 9bf968db56 LU-15942 utils: ofd_access_log_reader exit status bc69a8d058 LU-8621 utils: cmd help to stdout or short cmd error 6ab060e58e LU-14555 lnet: asym route inconsistency warning aa22a6826e LU-15481 llog: Add LLOG_SKIP_PLAIN to skip llog plain e7ce67de92 LU-15451 sec: read-only nodemap flag 8db455c772 LU-15399 llite: dont restart directIO with IOCB_NOWAIT 530861b344 LU-15926 nrs: fix tbf realtime rules 78b04d8ee7 LU-6142 obdclass: checkpatch cleanup of obd_mount_server.c 192902851d LU-11695 som: disabling xattr cache for LSOM on client 1ac4b9598a LU-15720 dne: add crush2 hash type
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_106 - onyx-120vm4 crashed during conf-sanity test_106