Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16052

conf-sanity test_106: crash after osp_sync_process_queues failed: -53

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite runs:
      https://testing.whamcloud.com/test_sets/7a5e3020-ab45-4e10-b38d-a4b75cf4bc12

      https://testing.whamcloud.com/test_sets/47bbfb4a-1425-4d4c-b039-a9ebf527424f

      https://testing.whamcloud.com/gerrit-janitor/24377/testresults/

      test_106 crashed after hitting "osp_sync_process_queues() failed: -53" and has hit this 3 times between 2022-07-19 and 2022-07-27 (and never between 2022-01-01..2022-07-18), so likely relates to a recently-landed patch. This is a "SLOW" test, so there were only 69 test_106 runs between 2022-07-19..2022-07-27, about 1/23 = 4% failure.

      Each stack trace is slightly different, so likely relates to random memory/stack corruption or use-after-free.

      [14872.316473] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1312 fail_val=5
      [14994.249862] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [14994.369136] LustreError: 19690:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0xc266:0x0]: have 0 want 1
      [15209.861438] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [15210.268814] LustreError: 23314:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0x184cb:0x0]: have 0 want 1
      [15426.956281] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [15427.417079] LustreError: 19691:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x4730:0x0]: have 0 want 1
      [15448.744066] LustreError: 22077:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
      [15453.764018] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
      [15453.770242] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) LBUG
      [15453.771540] Pid: 19690, comm: mdt00_001 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1 SMP Mon Jul 18 20:59:11 UTC 2022
      [15453.773552] Call Trace:
      [15453.774160] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [15453.775185] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [15453.776274] [<0>] osd_trunc_lock+0x242/0x250 [osd_ldiskfs]
      [15453.777405] [<0>] osd_declare_write+0x360/0x4a0 [osd_ldiskfs]
      [15453.778769] [<0>] llog_osd_declare_write_rec+0xe0/0x3a0 [obdclass]
      [15453.779994] [<0>] llog_declare_write_rec+0x1e6/0x240 [obdclass]
      [15453.781161] [<0>] llog_cat_declare_add_rec+0x9c/0x260 [obdclass]
      [15453.782340] [<0>] llog_declare_add+0x14f/0x1c0 [obdclass]
      [15453.783452] [<0>] osp_sync_declare_add+0x11a/0x490 [osp]
      [15453.784496] [<0>] osp_declare_destroy+0xfa/0x250 [osp]
      [15453.785607] [<0>] lod_sub_declare_destroy+0x106/0x320 [lod]
      [15453.786698] [<0>] lod_obj_stripe_destroy_cb+0xfb/0x110 [lod]
      [15453.787806] [<0>] lod_obj_for_each_stripe+0x11e/0x2c0 [lod]
      [15453.788898] [<0>] lod_declare_destroy+0x64a/0x700 [lod]
      [15453.789978] [<0>] mdd_declare_finish_unlink+0x83/0x260 [mdd]
      [15453.791100] [<0>] mdd_unlink+0x556/0xcb0 [mdd]
      [15453.792109] [<0>] mdt_reint_unlink+0xdb9/0x1fe0 [mdt]
      [15453.793113] [<0>] mdt_reint_rec+0x83/0x210 [mdt]
      [15453.794033] [<0>] mdt_reint_internal+0x730/0xb00 [mdt]
      [15453.795047] [<0>] mdt_reint+0x67/0x140 [mdt]
      [15453.796342] [<0>] tgt_request_handle+0x8bf/0x18c0 [ptlrpc]
      [15453.797463] [<0>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc]
      [15453.798712] [<0>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc]
      [15453.799726] [<0>] kthread+0xd1/0xe0
      
      [32881.988866] LustreError: 2111699:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [32882.011552] LustreError: 2105039:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000405:0x44de:0x0]: have 0 want 1
      [32901.212021] LustreError: 2111698:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
      [32916.247614] general protection fault: 0000 [#1] SMP PTI
      [32916.248831] CPU: 0 PID: 2105038 Comm: mdt00_000 4.18.0-348.23.1.el8_lustre.ddn17.x86_64 #1
      [32916.251311] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [32916.252690] RIP: 0010:llog_exist+0xd9/0x180 [obdclass]
      [32916.268987] Call Trace:
      [32916.269576]  llog_cat_prep_log+0x4f/0x3c0 [obdclass]
      [32916.270584]  llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
      [32916.272788]  llog_declare_add+0x187/0x1d0 [obdclass]
      [32916.273842]  osp_sync_declare_add+0x1c2/0x460 [osp]
      [32916.274823]  osp_declare_destroy+0x15f/0x230 [osp]
      [32916.275897]  lod_sub_declare_destroy+0x195/0x310 [lod]
      [32916.276932]  lod_obj_for_each_stripe+0x11f/0x2b0 [lod]
      [32916.277968]  lod_declare_destroy+0x4f1/0x500 [lod]
      [32916.279953]  mdd_declare_finish_unlink+0xa9/0x250 [mdd]
      [32916.281003]  mdd_unlink+0x46b/0xbe0 [mdd]
      [32916.281992]  mdt_reint_unlink+0xd43/0x1570 [mdt]
      [32916.282939]  mdt_reint_rec+0x11f/0x250 [mdt]
      [32916.283819]  mdt_reint_internal+0x498/0x780 [mdt]
      [32916.284776]  mdt_reint+0x5e/0x100 [mdt]
      [32916.286004]  tgt_request_handle+0xc90/0x1970 [ptlrpc]
      [32916.288289]  ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
      [32916.289495]  ptlrpc_main+0xba6/0x14a0 [ptlrpc]
      [32916.292328]  kthread+0x116/0x130
      
      [  335.344635] LustreError: 25107:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
      [  335.349455] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [  335.391269] LustreError: 22802:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0xc225:0x0]: have 0 want 1
      [  503.875146] LustreError: 25865:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
      [  503.880055] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [  503.914196] LustreError: 25561:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x18448:0x0]: have 0 want 1
      [  660.811921] LustreError: 25853:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
      [  660.816609] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [  660.841116] LustreError: 25867:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000403:0x466b:0x0]: have 0 want 1
      [  675.126857] LustreError: 25105:0:(osp_sync.c:1294:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
      [  680.141546] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) ASSERTION( llh->llh_size == reclen ) failed: 
      [  680.145147] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) LBUG
      [  680.147035] Pid: 22802, comm: mdt00_002 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022
      [  680.149556] Call Trace:
      [  680.150356] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [  680.152054] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [  680.153679] [<0>] llog_osd_write_rec+0x190f/0x1b60 [obdclass]
      [  680.155391] [<0>] llog_write_rec+0x290/0x590 [obdclass]
      [  680.156740] [<0>] llog_cat_add_rec+0x1e1/0x990 [obdclass]
      [  680.158225] [<0>] llog_add+0x17f/0x1f0 [obdclass]
      [  680.159875] [<0>] osp_sync_add_rec+0x177/0x780 [osp]
      [  680.161819] [<0>] osp_sync_add+0x47/0x50 [osp]
      [  680.163722] [<0>] osp_destroy+0x115/0x2e0 [osp]
      [  680.165193] [<0>] lod_sub_destroy+0x1bb/0x4e0 [lod]
      [  680.166828] [<0>] lod_obj_stripe_destroy_cb+0x3e/0x110 [lod]
      [  680.168910] [<0>] lod_obj_for_each_stripe+0x11d/0x300 [lod]
      [  680.170641] [<0>] lod_destroy+0x7c9/0x9d0 [lod]
      [  680.172200] [<0>] mdd_finish_unlink+0x283/0x3c0 [mdd]
      [  680.174117] [<0>] mdd_unlink+0xb5c/0xdb0 [mdd]
      [  680.175944] [<0>] mdt_reint_unlink+0xe32/0x1dc0 [mdt]
      [  680.177465] [<0>] mdt_reint_rec+0x87/0x240 [mdt]
      [  680.179030] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
      [  680.180533] [<0>] mdt_reint+0x67/0x150 [mdt]
      [  680.182054] [<0>] tgt_request_handle+0x93a/0x19c0 [ptlrpc]
      [  680.183685] [<0>] ptlrpc_server_handle_request+0x250/0xc30 [ptlrpc]
      [  680.185863] [<0>] ptlrpc_main+0xbd9/0x15f0 [ptlrpc]
      [  680.187608] [<0>] kthread+0xe4/0xf0
      

      Common factors in the stacks is that they are all in mdd_unlink()->lod_sub_{declare,}_destroy()->llog_*. A likely culprit is patch https://review.whamcloud.com/47698 "LU-15938 lod: prevent endless retry in recovery thread", which landed on 2022-07-18 and added the -53 = -EBADR error return in llog_osd_next_block().

      Relevant patches landed between 2022-07-15 and 2022-07-19:

      dd670d968a LU-15984 o2iblnd: debug message is missing a newline
      1a24dcdce1 LU-15938 lod: prevent endless retry in recovery thread  ****
      5038bf8db8 LU-10994 clio: Remove cl_2queue_add wrapper
      9153049bdc LU-15925 lnet: add debug messages for IB
      1ebc9ed460 LU-15902 obdclass: dt_try_as_dir() check dir exists
      40daa59ac4 LU-15880 quota: fix issues in reserving quota
      ee4b50278e LU-15993 ofd: don't leak pages if nodemap fails
      566edb8c43 LU-8582 target: send error reply on wrong opcode
      b52b52c2d1 LU-15886 lfsck: remove unreasonable assertions
      54a2d4662b LU-15868 lfsck: don't crash upon dir migration failure
      66b3e74bcc LU-15132 hsm: Protect against parallel HSM restore requests
      f238540c87 LU-15913 mdt: disable parallel rename for striped dirs
      

      between 2022-07-10..2022-07-14 (outside probability if test fails every 4 days) doesn't show as much of interest:

      210803a247 LU-15653 client: able to cleanup devices manually
      a21ce928aa LU-15894 ofd: revert range locking in ofd
      b2dfb4457f LU-15759 libcfs: debugfs file_operation should have an owner
      98ba508190 LU-15779 ofd: don't hold read lock over bulk
      0396310692 LU-15727 lod: honor append_pool with default composite layouts
      71d63602c5 LU-15922 sec: new connect flag for name encryption
      9bf968db56 LU-15942 utils: ofd_access_log_reader exit status
      bc69a8d058 LU-8621 utils: cmd help to stdout or short cmd error
      6ab060e58e LU-14555 lnet: asym route inconsistency warning
      aa22a6826e LU-15481 llog: Add LLOG_SKIP_PLAIN to skip llog plain
      e7ce67de92 LU-15451 sec: read-only nodemap flag
      8db455c772 LU-15399 llite: dont restart directIO with IOCB_NOWAIT
      530861b344 LU-15926 nrs: fix tbf realtime rules
      78b04d8ee7 LU-6142 obdclass: checkpatch cleanup of obd_mount_server.c
      192902851d LU-11695 som: disabling xattr cache for LSOM on client
      1ac4b9598a LU-15720 dne: add crush2 hash type
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_106 - onyx-120vm4 crashed during conf-sanity test_106

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: