Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16052

conf-sanity test_106: crash after osp_sync_process_queues failed: -53

Details

    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite runs:
      https://testing.whamcloud.com/test_sets/7a5e3020-ab45-4e10-b38d-a4b75cf4bc12

      https://testing.whamcloud.com/test_sets/47bbfb4a-1425-4d4c-b039-a9ebf527424f

      https://testing.whamcloud.com/gerrit-janitor/24377/testresults/

      test_106 crashed after hitting "osp_sync_process_queues() failed: -53" and has hit this 3 times between 2022-07-19 and 2022-07-27 (and never between 2022-01-01..2022-07-18), so likely relates to a recently-landed patch. This is a "SLOW" test, so there were only 69 test_106 runs between 2022-07-19..2022-07-27, about 1/23 = 4% failure.

      Each stack trace is slightly different, so likely relates to random memory/stack corruption or use-after-free.

      [14872.316473] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1312 fail_val=5
      [14994.249862] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [14994.369136] LustreError: 19690:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0xc266:0x0]: have 0 want 1
      [15209.861438] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [15210.268814] LustreError: 23314:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000401:0x184cb:0x0]: have 0 want 1
      [15426.956281] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [15427.417079] LustreError: 19691:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x4730:0x0]: have 0 want 1
      [15448.744066] LustreError: 22077:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
      [15453.764018] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
      [15453.770242] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) LBUG
      [15453.771540] Pid: 19690, comm: mdt00_001 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1 SMP Mon Jul 18 20:59:11 UTC 2022
      [15453.773552] Call Trace:
      [15453.774160] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [15453.775185] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [15453.776274] [<0>] osd_trunc_lock+0x242/0x250 [osd_ldiskfs]
      [15453.777405] [<0>] osd_declare_write+0x360/0x4a0 [osd_ldiskfs]
      [15453.778769] [<0>] llog_osd_declare_write_rec+0xe0/0x3a0 [obdclass]
      [15453.779994] [<0>] llog_declare_write_rec+0x1e6/0x240 [obdclass]
      [15453.781161] [<0>] llog_cat_declare_add_rec+0x9c/0x260 [obdclass]
      [15453.782340] [<0>] llog_declare_add+0x14f/0x1c0 [obdclass]
      [15453.783452] [<0>] osp_sync_declare_add+0x11a/0x490 [osp]
      [15453.784496] [<0>] osp_declare_destroy+0xfa/0x250 [osp]
      [15453.785607] [<0>] lod_sub_declare_destroy+0x106/0x320 [lod]
      [15453.786698] [<0>] lod_obj_stripe_destroy_cb+0xfb/0x110 [lod]
      [15453.787806] [<0>] lod_obj_for_each_stripe+0x11e/0x2c0 [lod]
      [15453.788898] [<0>] lod_declare_destroy+0x64a/0x700 [lod]
      [15453.789978] [<0>] mdd_declare_finish_unlink+0x83/0x260 [mdd]
      [15453.791100] [<0>] mdd_unlink+0x556/0xcb0 [mdd]
      [15453.792109] [<0>] mdt_reint_unlink+0xdb9/0x1fe0 [mdt]
      [15453.793113] [<0>] mdt_reint_rec+0x83/0x210 [mdt]
      [15453.794033] [<0>] mdt_reint_internal+0x730/0xb00 [mdt]
      [15453.795047] [<0>] mdt_reint+0x67/0x140 [mdt]
      [15453.796342] [<0>] tgt_request_handle+0x8bf/0x18c0 [ptlrpc]
      [15453.797463] [<0>] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc]
      [15453.798712] [<0>] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc]
      [15453.799726] [<0>] kthread+0xd1/0xe0
      
      [32881.988866] LustreError: 2111699:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [32882.011552] LustreError: 2105039:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can't lstripe objid [0x200000405:0x44de:0x0]: have 0 want 1
      [32901.212021] LustreError: 2111698:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
      [32916.247614] general protection fault: 0000 [#1] SMP PTI
      [32916.248831] CPU: 0 PID: 2105038 Comm: mdt00_000 4.18.0-348.23.1.el8_lustre.ddn17.x86_64 #1
      [32916.251311] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [32916.252690] RIP: 0010:llog_exist+0xd9/0x180 [obdclass]
      [32916.268987] Call Trace:
      [32916.269576]  llog_cat_prep_log+0x4f/0x3c0 [obdclass]
      [32916.270584]  llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
      [32916.272788]  llog_declare_add+0x187/0x1d0 [obdclass]
      [32916.273842]  osp_sync_declare_add+0x1c2/0x460 [osp]
      [32916.274823]  osp_declare_destroy+0x15f/0x230 [osp]
      [32916.275897]  lod_sub_declare_destroy+0x195/0x310 [lod]
      [32916.276932]  lod_obj_for_each_stripe+0x11f/0x2b0 [lod]
      [32916.277968]  lod_declare_destroy+0x4f1/0x500 [lod]
      [32916.279953]  mdd_declare_finish_unlink+0xa9/0x250 [mdd]
      [32916.281003]  mdd_unlink+0x46b/0xbe0 [mdd]
      [32916.281992]  mdt_reint_unlink+0xd43/0x1570 [mdt]
      [32916.282939]  mdt_reint_rec+0x11f/0x250 [mdt]
      [32916.283819]  mdt_reint_internal+0x498/0x780 [mdt]
      [32916.284776]  mdt_reint+0x5e/0x100 [mdt]
      [32916.286004]  tgt_request_handle+0xc90/0x1970 [ptlrpc]
      [32916.288289]  ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
      [32916.289495]  ptlrpc_main+0xba6/0x14a0 [ptlrpc]
      [32916.292328]  kthread+0x116/0x130
      
      [  335.344635] LustreError: 25107:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
      [  335.349455] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [  335.391269] LustreError: 22802:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0xc225:0x0]: have 0 want 1
      [  503.875146] LustreError: 25865:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
      [  503.880055] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [  503.914196] LustreError: 25561:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000402:0x18448:0x0]: have 0 want 1
      [  660.811921] LustreError: 25853:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
      [  660.816609] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
      [  660.841116] LustreError: 25867:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can't lstripe objid [0x200000403:0x466b:0x0]: have 0 want 1
      [  675.126857] LustreError: 25105:0:(osp_sync.c:1294:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
      [  680.141546] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) ASSERTION( llh->llh_size == reclen ) failed: 
      [  680.145147] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) LBUG
      [  680.147035] Pid: 22802, comm: mdt00_002 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022
      [  680.149556] Call Trace:
      [  680.150356] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [  680.152054] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [  680.153679] [<0>] llog_osd_write_rec+0x190f/0x1b60 [obdclass]
      [  680.155391] [<0>] llog_write_rec+0x290/0x590 [obdclass]
      [  680.156740] [<0>] llog_cat_add_rec+0x1e1/0x990 [obdclass]
      [  680.158225] [<0>] llog_add+0x17f/0x1f0 [obdclass]
      [  680.159875] [<0>] osp_sync_add_rec+0x177/0x780 [osp]
      [  680.161819] [<0>] osp_sync_add+0x47/0x50 [osp]
      [  680.163722] [<0>] osp_destroy+0x115/0x2e0 [osp]
      [  680.165193] [<0>] lod_sub_destroy+0x1bb/0x4e0 [lod]
      [  680.166828] [<0>] lod_obj_stripe_destroy_cb+0x3e/0x110 [lod]
      [  680.168910] [<0>] lod_obj_for_each_stripe+0x11d/0x300 [lod]
      [  680.170641] [<0>] lod_destroy+0x7c9/0x9d0 [lod]
      [  680.172200] [<0>] mdd_finish_unlink+0x283/0x3c0 [mdd]
      [  680.174117] [<0>] mdd_unlink+0xb5c/0xdb0 [mdd]
      [  680.175944] [<0>] mdt_reint_unlink+0xe32/0x1dc0 [mdt]
      [  680.177465] [<0>] mdt_reint_rec+0x87/0x240 [mdt]
      [  680.179030] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
      [  680.180533] [<0>] mdt_reint+0x67/0x150 [mdt]
      [  680.182054] [<0>] tgt_request_handle+0x93a/0x19c0 [ptlrpc]
      [  680.183685] [<0>] ptlrpc_server_handle_request+0x250/0xc30 [ptlrpc]
      [  680.185863] [<0>] ptlrpc_main+0xbd9/0x15f0 [ptlrpc]
      [  680.187608] [<0>] kthread+0xe4/0xf0
      

      Common factors in the stacks is that they are all in mdd_unlink()->lod_sub_{declare,}_destroy()->llog_*. A likely culprit is patch https://review.whamcloud.com/47698 "LU-15938 lod: prevent endless retry in recovery thread", which landed on 2022-07-18 and added the -53 = -EBADR error return in llog_osd_next_block().

      Relevant patches landed between 2022-07-15 and 2022-07-19:

      dd670d968a LU-15984 o2iblnd: debug message is missing a newline
      1a24dcdce1 LU-15938 lod: prevent endless retry in recovery thread  ****
      5038bf8db8 LU-10994 clio: Remove cl_2queue_add wrapper
      9153049bdc LU-15925 lnet: add debug messages for IB
      1ebc9ed460 LU-15902 obdclass: dt_try_as_dir() check dir exists
      40daa59ac4 LU-15880 quota: fix issues in reserving quota
      ee4b50278e LU-15993 ofd: don't leak pages if nodemap fails
      566edb8c43 LU-8582 target: send error reply on wrong opcode
      b52b52c2d1 LU-15886 lfsck: remove unreasonable assertions
      54a2d4662b LU-15868 lfsck: don't crash upon dir migration failure
      66b3e74bcc LU-15132 hsm: Protect against parallel HSM restore requests
      f238540c87 LU-15913 mdt: disable parallel rename for striped dirs
      

      between 2022-07-10..2022-07-14 (outside probability if test fails every 4 days) doesn't show as much of interest:

      210803a247 LU-15653 client: able to cleanup devices manually
      a21ce928aa LU-15894 ofd: revert range locking in ofd
      b2dfb4457f LU-15759 libcfs: debugfs file_operation should have an owner
      98ba508190 LU-15779 ofd: don't hold read lock over bulk
      0396310692 LU-15727 lod: honor append_pool with default composite layouts
      71d63602c5 LU-15922 sec: new connect flag for name encryption
      9bf968db56 LU-15942 utils: ofd_access_log_reader exit status
      bc69a8d058 LU-8621 utils: cmd help to stdout or short cmd error
      6ab060e58e LU-14555 lnet: asym route inconsistency warning
      aa22a6826e LU-15481 llog: Add LLOG_SKIP_PLAIN to skip llog plain
      e7ce67de92 LU-15451 sec: read-only nodemap flag
      8db455c772 LU-15399 llite: dont restart directIO with IOCB_NOWAIT
      530861b344 LU-15926 nrs: fix tbf realtime rules
      78b04d8ee7 LU-6142 obdclass: checkpatch cleanup of obd_mount_server.c
      192902851d LU-11695 som: disabling xattr cache for LSOM on client
      1ac4b9598a LU-15720 dne: add crush2 hash type
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_106 - onyx-120vm4 crashed during conf-sanity test_106

      Attachments

        Issue Links

          Activity

            [LU-16052] conf-sanity test_106: crash after osp_sync_process_queues failed: -53

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48772/
            Subject: LU-16052 llog: handle -EBADR for catalog processing
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: b996d1e0276fdf6c084410cd1dcfac0df13437fe

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48772/ Subject: LU-16052 llog: handle -EBADR for catalog processing Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: b996d1e0276fdf6c084410cd1dcfac0df13437fe

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48772
            Subject: LU-16052 llog: handle -EBADR for catalog processing
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: ed025edc7045a1edd4ed0a528ada48306b4bf150

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48772 Subject: LU-16052 llog: handle -EBADR for catalog processing Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: ed025edc7045a1edd4ed0a528ada48306b4bf150
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            So far I fixed the problem with -EBADR handling causing crashes. The additional patch is needed to handle error in osp_sync_thread() without exiting the thread. 

            tappro Mikhail Pershin added a comment - So far I fixed the problem with -EBADR handling causing crashes. The additional patch is needed to handle error in osp_sync_thread() without exiting the thread. 
            tappro Mikhail Pershin added a comment - - edited

            it is not so simple as it seems. Interesting that osp_sync_thread() can exit just if llog_process_thread() would return error. Previously it would assert if that wasn't due to umount process, but now it just exits with error message. The problem is that thread is never restarted, so server will stay without sync thread at all, so it is basically not operational after that. Originally that  exits only on umount, that is why it has llog_cleanup() here, it cleanups context after all and that was correct. But now thread can exit while server is active causing crashed we are seeing. So not much better that assertion. I am not sure how to handle that better

            Considering that server can't operate normally without sync thread, it seem that we shouldn't exit in any case until umount OR thread must be re-started. I don't see how restart could be better, so we shouldn't exit. In that case either we need just start reprocess again (which also will result the same error). Probably the best we can do is full llog cleanup 

            tappro Mikhail Pershin added a comment - - edited it is not so simple as it seems. Interesting that osp_sync_thread() can exit just if llog_process_thread() would return error. Previously it would assert if that wasn't due to umount process, but now it just exits with error message. The problem is that thread is never restarted, so server will stay without sync thread at all, so it is basically not operational after that. Originally that  exits only on umount, that is why it has llog_cleanup() here, it cleanups context after all and that was correct. But now thread can exit while server is active causing crashed we are seeing. So not much better that assertion. I am not sure how to handle that better Considering that server can't operate normally without sync thread, it seem that we shouldn't exit in any case until umount OR thread must be re-started. I don't see how restart could be better, so we shouldn't exit. In that case either we need just start reprocess again (which also will result the same error). Probably the best we can do is full llog cleanup 
            tappro Mikhail Pershin added a comment - - edited

            So far I've pushed patch to prove that, meanwhile I'd check why llog_cleanup() was used originally, was that intentionally or not. Also I'd add -EBADR handling in llog_process_thread() instead of returning it to the caller. So final patch is in progress

            tappro Mikhail Pershin added a comment - - edited So far I've pushed patch to prove that, meanwhile I'd check why llog_cleanup() was used originally, was that intentionally or not. Also I'd add -EBADR handling in llog_process_thread() instead of returning it to the caller. So final patch is in progress

            I think the reason of crashes is just caused by recent patch from LU-15938 but problem existed before. Patch just creates conditions to see it happens. The problem is that osp_sync_thread() uses llog_cleanup() instead of llog_ctxt_put(). That causes llog context complete cleanup which is to be done on device cleanup usually but not on active device. As result llog_handle pointer is llog_ctxt becomes invalid and concurrent processes might see various corruptions like we see in traces above

            tappro Mikhail Pershin added a comment - I think the reason of crashes is just caused by recent patch from LU-15938 but problem existed before. Patch just creates conditions to see it happens. The problem is that osp_sync_thread() uses llog_cleanup() instead of llog_ctxt_put() . That causes llog context complete cleanup which is to be done on device cleanup usually but not on active device. As result llog_handle pointer is llog_ctxt becomes invalid and concurrent processes might see various corruptions like we see in traces above

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: