<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:23:34 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16052] conf-sanity test_106: crash after osp_sync_process_queues failed: -53</title>
                <link>https://jira.whamcloud.com/browse/LU-16052</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite runs:&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/7a5e3020-ab45-4e10-b38d-a4b75cf4bc12&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/7a5e3020-ab45-4e10-b38d-a4b75cf4bc12&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://testing.whamcloud.com/test_sets/47bbfb4a-1425-4d4c-b039-a9ebf527424f&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/47bbfb4a-1425-4d4c-b039-a9ebf527424f&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://testing.whamcloud.com/gerrit-janitor/24377/testresults/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/gerrit-janitor/24377/testresults/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;test_106 crashed after hitting &quot;&lt;tt&gt;osp_sync_process_queues() failed: -53&lt;/tt&gt;&quot; and has hit this 3 times between 2022-07-19 and 2022-07-27 (and never between 2022-01-01..2022-07-18), so likely relates to a recently-landed patch.   This is a &quot;SLOW&quot; test, so there were only 69 test_106 runs between 2022-07-19..2022-07-27, about 1/23 = 4% failure.&lt;/p&gt;

&lt;p&gt;Each stack trace is slightly different, so likely relates to random memory/stack corruption or use-after-free.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[14872.316473] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1312 fail_val=5
[14994.249862] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can&apos;t precreate: rc = -28
[14994.369136] LustreError: 19690:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can&apos;t lstripe objid [0x200000401:0xc266:0x0]: have 0 want 1
[15209.861438] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can&apos;t precreate: rc = -28
[15210.268814] LustreError: 23314:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can&apos;t lstripe objid [0x200000401:0x184cb:0x0]: have 0 want 1
[15426.956281] LustreError: 22079:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can&apos;t precreate: rc = -28
[15427.417079] LustreError: 19691:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can&apos;t lstripe objid [0x200000402:0x4730:0x0]: have 0 want 1
[15448.744066] LustreError: 22077:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[15453.764018] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) ASSERTION( atomic_read(&amp;amp;o-&amp;gt;lo_header-&amp;gt;loh_ref) &amp;gt; 0 ) failed: 
[15453.770242] LustreError: 19690:0:(lu_object.h:708:lu_object_get()) LBUG
[15453.771540] Pid: 19690, comm: mdt00_001 3.10.0-1160.71.1.el7_lustre.ddn17.x86_64 #1 SMP Mon Jul 18 20:59:11 UTC 2022
[15453.773552] Call Trace:
[15453.774160] [&amp;lt;0&amp;gt;] libcfs_call_trace+0x90/0xf0 [libcfs]
[15453.775185] [&amp;lt;0&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
[15453.776274] [&amp;lt;0&amp;gt;] osd_trunc_lock+0x242/0x250 [osd_ldiskfs]
[15453.777405] [&amp;lt;0&amp;gt;] osd_declare_write+0x360/0x4a0 [osd_ldiskfs]
[15453.778769] [&amp;lt;0&amp;gt;] llog_osd_declare_write_rec+0xe0/0x3a0 [obdclass]
[15453.779994] [&amp;lt;0&amp;gt;] llog_declare_write_rec+0x1e6/0x240 [obdclass]
[15453.781161] [&amp;lt;0&amp;gt;] llog_cat_declare_add_rec+0x9c/0x260 [obdclass]
[15453.782340] [&amp;lt;0&amp;gt;] llog_declare_add+0x14f/0x1c0 [obdclass]
[15453.783452] [&amp;lt;0&amp;gt;] osp_sync_declare_add+0x11a/0x490 [osp]
[15453.784496] [&amp;lt;0&amp;gt;] osp_declare_destroy+0xfa/0x250 [osp]
[15453.785607] [&amp;lt;0&amp;gt;] lod_sub_declare_destroy+0x106/0x320 [lod]
[15453.786698] [&amp;lt;0&amp;gt;] lod_obj_stripe_destroy_cb+0xfb/0x110 [lod]
[15453.787806] [&amp;lt;0&amp;gt;] lod_obj_for_each_stripe+0x11e/0x2c0 [lod]
[15453.788898] [&amp;lt;0&amp;gt;] lod_declare_destroy+0x64a/0x700 [lod]
[15453.789978] [&amp;lt;0&amp;gt;] mdd_declare_finish_unlink+0x83/0x260 [mdd]
[15453.791100] [&amp;lt;0&amp;gt;] mdd_unlink+0x556/0xcb0 [mdd]
[15453.792109] [&amp;lt;0&amp;gt;] mdt_reint_unlink+0xdb9/0x1fe0 [mdt]
[15453.793113] [&amp;lt;0&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[15453.794033] [&amp;lt;0&amp;gt;] mdt_reint_internal+0x730/0xb00 [mdt]
[15453.795047] [&amp;lt;0&amp;gt;] mdt_reint+0x67/0x140 [mdt]
[15453.796342] [&amp;lt;0&amp;gt;] tgt_request_handle+0x8bf/0x18c0 [ptlrpc]
[15453.797463] [&amp;lt;0&amp;gt;] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc]
[15453.798712] [&amp;lt;0&amp;gt;] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc]
[15453.799726] [&amp;lt;0&amp;gt;] kthread+0xd1/0xe0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[32881.988866] LustreError: 2111699:0:(osp_precreate.c:677:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can&apos;t precreate: rc = -28
[32882.011552] LustreError: 2105039:0:(lod_qos.c:1362:lod_ost_alloc_specific()) can&apos;t lstripe objid [0x200000405:0x44de:0x0]: have 0 want 1
[32901.212021] LustreError: 2111698:0:(osp_sync.c:1268:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[32916.247614] general protection fault: 0000 [#1] SMP PTI
[32916.248831] CPU: 0 PID: 2105038 Comm: mdt00_000 4.18.0-348.23.1.el8_lustre.ddn17.x86_64 #1
[32916.251311] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[32916.252690] RIP: 0010:llog_exist+0xd9/0x180 [obdclass]
[32916.268987] Call Trace:
[32916.269576]  llog_cat_prep_log+0x4f/0x3c0 [obdclass]
[32916.270584]  llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
[32916.272788]  llog_declare_add+0x187/0x1d0 [obdclass]
[32916.273842]  osp_sync_declare_add+0x1c2/0x460 [osp]
[32916.274823]  osp_declare_destroy+0x15f/0x230 [osp]
[32916.275897]  lod_sub_declare_destroy+0x195/0x310 [lod]
[32916.276932]  lod_obj_for_each_stripe+0x11f/0x2b0 [lod]
[32916.277968]  lod_declare_destroy+0x4f1/0x500 [lod]
[32916.279953]  mdd_declare_finish_unlink+0xa9/0x250 [mdd]
[32916.281003]  mdd_unlink+0x46b/0xbe0 [mdd]
[32916.281992]  mdt_reint_unlink+0xd43/0x1570 [mdt]
[32916.282939]  mdt_reint_rec+0x11f/0x250 [mdt]
[32916.283819]  mdt_reint_internal+0x498/0x780 [mdt]
[32916.284776]  mdt_reint+0x5e/0x100 [mdt]
[32916.286004]  tgt_request_handle+0xc90/0x1970 [ptlrpc]
[32916.288289]  ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
[32916.289495]  ptlrpc_main+0xba6/0x14a0 [ptlrpc]
[32916.292328]  kthread+0x116/0x130
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[  335.344635] LustreError: 25107:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  335.349455] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can&apos;t precreate: rc = -28
[  335.391269] LustreError: 22802:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can&apos;t lstripe objid [0x200000402:0xc225:0x0]: have 0 want 1
[  503.875146] LustreError: 25865:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  503.880055] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can&apos;t precreate: rc = -28
[  503.914196] LustreError: 25561:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can&apos;t lstripe objid [0x200000402:0x18448:0x0]: have 0 want 1
[  660.811921] LustreError: 25853:0:(ofd_dev.c:1739:ofd_create_hdl()) lustre-OST0000: unable to precreate: rc = -28
[  660.816609] LustreError: 25106:0:(osp_precreate.c:679:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can&apos;t precreate: rc = -28
[  660.841116] LustreError: 25867:0:(lod_qos.c:1356:lod_ost_alloc_specific()) can&apos;t lstripe objid [0x200000403:0x466b:0x0]: have 0 want 1
[  675.126857] LustreError: 25105:0:(osp_sync.c:1294:osp_sync_thread()) lustre-OST0000-osc-MDT0000: llog process with osp_sync_process_queues failed: -53
[  680.141546] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) ASSERTION( llh-&amp;gt;llh_size == reclen ) failed: 
[  680.145147] LustreError: 22802:0:(llog_osd.c:417:llog_osd_write_rec()) LBUG
[  680.147035] Pid: 22802, comm: mdt00_002 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022
[  680.149556] Call Trace:
[  680.150356] [&amp;lt;0&amp;gt;] libcfs_call_trace+0x90/0xf0 [libcfs]
[  680.152054] [&amp;lt;0&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
[  680.153679] [&amp;lt;0&amp;gt;] llog_osd_write_rec+0x190f/0x1b60 [obdclass]
[  680.155391] [&amp;lt;0&amp;gt;] llog_write_rec+0x290/0x590 [obdclass]
[  680.156740] [&amp;lt;0&amp;gt;] llog_cat_add_rec+0x1e1/0x990 [obdclass]
[  680.158225] [&amp;lt;0&amp;gt;] llog_add+0x17f/0x1f0 [obdclass]
[  680.159875] [&amp;lt;0&amp;gt;] osp_sync_add_rec+0x177/0x780 [osp]
[  680.161819] [&amp;lt;0&amp;gt;] osp_sync_add+0x47/0x50 [osp]
[  680.163722] [&amp;lt;0&amp;gt;] osp_destroy+0x115/0x2e0 [osp]
[  680.165193] [&amp;lt;0&amp;gt;] lod_sub_destroy+0x1bb/0x4e0 [lod]
[  680.166828] [&amp;lt;0&amp;gt;] lod_obj_stripe_destroy_cb+0x3e/0x110 [lod]
[  680.168910] [&amp;lt;0&amp;gt;] lod_obj_for_each_stripe+0x11d/0x300 [lod]
[  680.170641] [&amp;lt;0&amp;gt;] lod_destroy+0x7c9/0x9d0 [lod]
[  680.172200] [&amp;lt;0&amp;gt;] mdd_finish_unlink+0x283/0x3c0 [mdd]
[  680.174117] [&amp;lt;0&amp;gt;] mdd_unlink+0xb5c/0xdb0 [mdd]
[  680.175944] [&amp;lt;0&amp;gt;] mdt_reint_unlink+0xe32/0x1dc0 [mdt]
[  680.177465] [&amp;lt;0&amp;gt;] mdt_reint_rec+0x87/0x240 [mdt]
[  680.179030] [&amp;lt;0&amp;gt;] mdt_reint_internal+0x76c/0xb50 [mdt]
[  680.180533] [&amp;lt;0&amp;gt;] mdt_reint+0x67/0x150 [mdt]
[  680.182054] [&amp;lt;0&amp;gt;] tgt_request_handle+0x93a/0x19c0 [ptlrpc]
[  680.183685] [&amp;lt;0&amp;gt;] ptlrpc_server_handle_request+0x250/0xc30 [ptlrpc]
[  680.185863] [&amp;lt;0&amp;gt;] ptlrpc_main+0xbd9/0x15f0 [ptlrpc]
[  680.187608] [&amp;lt;0&amp;gt;] kthread+0xe4/0xf0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Common factors in the stacks is that they are all in &lt;tt&gt;mdd&amp;#95;unlink()&amp;#45;&amp;gt;lod&amp;#95;sub&amp;#95;{declare,}&amp;#95;destroy()&amp;#45;&amp;gt;llog&amp;#95;&amp;#42;&lt;/tt&gt;.  A likely culprit is patch &lt;a href=&quot;https://review.whamcloud.com/47698&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47698&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15938&quot; title=&quot;MDT recovery did not finish due to corrupt llog record&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15938&quot;&gt;&lt;del&gt;LU-15938&lt;/del&gt;&lt;/a&gt; lod: prevent endless retry in recovery thread&lt;/tt&gt;&quot;, which landed on 2022-07-18 and added the &lt;tt&gt;&amp;#45;53 = &amp;#45;EBADR&lt;/tt&gt; error return in &lt;tt&gt;llog&amp;#95;osd&amp;#95;next&amp;#95;block()&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;Relevant patches landed between 2022-07-15 and 2022-07-19:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;dd670d968a LU-15984 o2iblnd: debug message is missing a newline
1a24dcdce1 LU-15938 lod: prevent endless retry in recovery thread  ****
5038bf8db8 LU-10994 clio: Remove cl_2queue_add wrapper
9153049bdc LU-15925 lnet: add debug messages for IB
1ebc9ed460 LU-15902 obdclass: dt_try_as_dir() check dir exists
40daa59ac4 LU-15880 quota: fix issues in reserving quota
ee4b50278e LU-15993 ofd: don&apos;t leak pages if nodemap fails
566edb8c43 LU-8582 target: send error reply on wrong opcode
b52b52c2d1 LU-15886 lfsck: remove unreasonable assertions
54a2d4662b LU-15868 lfsck: don&apos;t crash upon dir migration failure
66b3e74bcc LU-15132 hsm: Protect against parallel HSM restore requests
f238540c87 LU-15913 mdt: disable parallel rename for striped dirs
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;between 2022-07-10..2022-07-14 (outside probability if test fails every 4 days) doesn&apos;t show as much of interest:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;210803a247 LU-15653 client: able to cleanup devices manually
a21ce928aa LU-15894 ofd: revert range locking in ofd
b2dfb4457f LU-15759 libcfs: debugfs file_operation should have an owner
98ba508190 LU-15779 ofd: don&apos;t hold read lock over bulk
0396310692 LU-15727 lod: honor append_pool with default composite layouts
71d63602c5 LU-15922 sec: new connect flag for name encryption
9bf968db56 LU-15942 utils: ofd_access_log_reader exit status
bc69a8d058 LU-8621 utils: cmd help to stdout or short cmd error
6ab060e58e LU-14555 lnet: asym route inconsistency warning
aa22a6826e LU-15481 llog: Add LLOG_SKIP_PLAIN to skip llog plain
e7ce67de92 LU-15451 sec: read-only nodemap flag
8db455c772 LU-15399 llite: dont restart directIO with IOCB_NOWAIT
530861b344 LU-15926 nrs: fix tbf realtime rules
78b04d8ee7 LU-6142 obdclass: checkpatch cleanup of obd_mount_server.c
192902851d LU-11695 som: disabling xattr cache for LSOM on client
1ac4b9598a LU-15720 dne: add crush2 hash type
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV&lt;br/&gt;
conf-sanity test_106 - onyx-120vm4 crashed during conf-sanity test_106&lt;/p&gt;</description>
                <environment></environment>
        <key id="71519">LU-16052</key>
            <summary>conf-sanity test_106: crash after osp_sync_process_queues failed: -53</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                    </labels>
                <created>Wed, 27 Jul 2022 23:32:19 +0000</created>
                <updated>Wed, 2 Aug 2023 13:03:13 +0000</updated>
                            <resolved>Mon, 12 Sep 2022 15:13:06 +0000</resolved>
                                                    <fixVersion>Lustre 2.16.0</fixVersion>
                    <fixVersion>Lustre 2.15.4</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="341786" author="gerrit" created="Wed, 27 Jul 2022 23:44:33 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/48054&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/48054&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16052&quot; title=&quot;conf-sanity test_106: crash after osp_sync_process_queues failed: -53&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16052&quot;&gt;&lt;del&gt;LU-16052&lt;/del&gt;&lt;/a&gt; tests: run conf-sanity test_106 for debug2&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 3a1b65c93eba00247f54d211618d4bc48252df02&lt;/p&gt;</comment>
                            <comment id="341787" author="gerrit" created="Wed, 27 Jul 2022 23:46:50 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/48055&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/48055&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16052&quot; title=&quot;conf-sanity test_106: crash after osp_sync_process_queues failed: -53&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16052&quot;&gt;&lt;del&gt;LU-16052&lt;/del&gt;&lt;/a&gt; tests: run conf-sanity test_106 for debug1&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 617c188cb3fd33be3893093ef0d1826293d0907f&lt;/p&gt;</comment>
                            <comment id="341830" author="adilger" created="Thu, 28 Jul 2022 06:51:13 +0000"  >&lt;p&gt;Hi Mike,&lt;br/&gt;
could you please take a look at this crash.  The test patch passed 50/50 runs right before your &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15938&quot; title=&quot;MDT recovery did not finish due to corrupt llog record&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15938&quot;&gt;&lt;del&gt;LU-15938&lt;/del&gt;&lt;/a&gt; patch, but crashed 2/41 runs on the tip of master.  I&apos;ve rebased the &quot;after&quot; patch to be right on top of the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15938&quot; title=&quot;MDT recovery did not finish due to corrupt llog record&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15938&quot;&gt;&lt;del&gt;LU-15938&lt;/del&gt;&lt;/a&gt; patch to confirm it isn&apos;t some bad interaction with another patch after that, though there are only 8 other patches afterward, and most are unrelated.&lt;/p&gt;</comment>
                            <comment id="341844" author="gerrit" created="Thu, 28 Jul 2022 11:50:43 +0000"  >&lt;p&gt;&quot;Mike Pershin &amp;lt;mpershin@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/48070&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/48070&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16052&quot; title=&quot;conf-sanity test_106: crash after osp_sync_process_queues failed: -53&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16052&quot;&gt;&lt;del&gt;LU-16052&lt;/del&gt;&lt;/a&gt; osp: don&apos;t cleanup llog context in use&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: fd497004b9815a5ae2dd41460ffd84f4dc3e6555&lt;/p&gt;</comment>
                            <comment id="341846" author="tappro" created="Thu, 28 Jul 2022 11:57:03 +0000"  >&lt;p&gt;I think the reason of crashes is just caused by recent patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15938&quot; title=&quot;MDT recovery did not finish due to corrupt llog record&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15938&quot;&gt;&lt;del&gt;LU-15938&lt;/del&gt;&lt;/a&gt; but problem existed before. Patch just creates conditions to see it happens. The problem is that &lt;tt&gt;osp_sync_thread()&lt;/tt&gt; uses &lt;tt&gt;llog_cleanup()&lt;/tt&gt; instead of &lt;tt&gt;llog_ctxt_put()&lt;/tt&gt;. That causes llog context complete cleanup which is to be done on device cleanup usually but not on active device. As result llog_handle pointer is llog_ctxt becomes invalid and concurrent processes might see various corruptions like we see in traces above&lt;/p&gt;</comment>
                            <comment id="341847" author="tappro" created="Thu, 28 Jul 2022 11:59:41 +0000"  >&lt;p&gt;So far I&apos;ve pushed patch to prove that, meanwhile I&apos;d check why &lt;tt&gt;llog_cleanup()&lt;/tt&gt; was used originally, was that intentionally or not. Also I&apos;d add &lt;tt&gt;-EBADR&lt;/tt&gt; handling in &lt;tt&gt;llog_process_thread()&lt;/tt&gt; instead of returning it to the caller. So final patch is in progress&lt;/p&gt;</comment>
                            <comment id="341852" author="tappro" created="Thu, 28 Jul 2022 12:43:44 +0000"  >&lt;p&gt;it is not so simple as it seems. Interesting that &lt;tt&gt;osp_sync_thread()&lt;/tt&gt; can exit just if &lt;tt&gt;llog_process_thread()&lt;/tt&gt; would return error. Previously it would assert if that wasn&apos;t due to umount process, but now it just exits with error message. The problem is that thread is never restarted, so server will stay without sync thread at all, so it is basically not operational after that. Originally that&#160; exits only on umount, that is why it has llog_cleanup() here, it cleanups context after all and that was correct. But now thread can exit while server is active causing crashed we are seeing. So not much better that assertion. I am not sure how to handle that better&lt;/p&gt;

&lt;p&gt;Considering that server can&apos;t operate normally without sync thread, it seem that we shouldn&apos;t exit in any case until umount OR thread must be re-started. I don&apos;t see how restart could be better, so we shouldn&apos;t exit. In that case either we need just start reprocess again (which also will result the same error). Probably the best we can do is full llog cleanup&#160;&lt;/p&gt;</comment>
                            <comment id="341996" author="tappro" created="Fri, 29 Jul 2022 08:33:51 +0000"  >&lt;p&gt;So far I fixed the problem with -EBADR handling causing crashes. The additional patch is needed to handle error in &lt;tt&gt;osp_sync_thread()&lt;/tt&gt; without exiting the thread.&#160;&lt;/p&gt;</comment>
                            <comment id="346371" author="pjones" created="Mon, 12 Sep 2022 15:13:06 +0000"  >&lt;p&gt;Landed for 2.16&lt;/p&gt;</comment>
                            <comment id="348745" author="gerrit" created="Wed, 5 Oct 2022 07:02:57 +0000"  >&lt;p&gt;&quot;Jian Yu &amp;lt;yujian@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/48772&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/48772&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16052&quot; title=&quot;conf-sanity test_106: crash after osp_sync_process_queues failed: -53&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16052&quot;&gt;&lt;del&gt;LU-16052&lt;/del&gt;&lt;/a&gt; llog: handle -EBADR for catalog processing&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_15&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: ed025edc7045a1edd4ed0a528ada48306b4bf150&lt;/p&gt;</comment>
                            <comment id="381022" author="gerrit" created="Wed, 2 Aug 2023 06:18:06 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/48772/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/48772/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16052&quot; title=&quot;conf-sanity test_106: crash after osp_sync_process_queues failed: -53&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16052&quot;&gt;&lt;del&gt;LU-16052&lt;/del&gt;&lt;/a&gt; llog: handle -EBADR for catalog processing&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_15&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: b996d1e0276fdf6c084410cd1dcfac0df13437fe&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10322">
                    <name>Gantt End to Start</name>
                                            <outwardlinks description="has to be done before">
                                                        </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="70717">LU-15938</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02vlz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>