<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:48:48 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12001] sanity-gss issues</title>
                <link>https://jira.whamcloud.com/browse/LU-12001</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;It looks like in my testing sanity-gss test 150 has a 100% crash failure rate like this:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 3757.187493] Lustre: DEBUG MARKER: == sanity-gss test 150: secure mgs connection: client flavor setting ================================= 07:38:39 (1551011919)
[ 3759.003595] LustreError: 5651:0:(file.c:4273:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -1
[ 3759.126390] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) page@ffff8800b2af7e00[2 ffff8800c2ee3558 2 1           (null)]

[ 3759.131722] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) vvp-page@ffff8800b2af7e50(0:0) vm@ffffea0002721f88 1fffff0000282c 2:0 ffff8800b2af7e00 6144 lru

[ 3759.136409] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) lov-page@ffff8800b2af7e90, comp index: 0, gen: 0

[ 3759.139344] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) osc-page@ffff8800b2af7ec8 6144: 1&amp;lt; 0x845fed 258 0 + + &amp;gt; 2&amp;lt; 25165824 0 4096 0x5 0x520 |           (null) ffff8800c4288798 ffff8800cff6de60 &amp;gt; 3&amp;lt; 1 25 0 &amp;gt; 4&amp;lt; 0 0 8 227942400 - | - - + - &amp;gt; 5&amp;lt; - - + - | 0 - | 0 - -&amp;gt;

[ 3759.145390] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) end page@ffff8800b2af7e00

[ 3759.147726] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) uncovered page!
[ 3759.149580] LustreError: 555:0:(ldlm_resource.c:1725:ldlm_resource_dump()) --- Resource: [0xa:0x0:0x0].0x0 (ffff8800c33fb9c0) refcount = 2
[ 3759.152436] LustreError: 555:0:(ldlm_resource.c:1728:ldlm_resource_dump()) Granted locks (in reverse order):
[ 3759.154733] LustreError: 555:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: lustre-OST0001-osc-ffff8800d231d800 lock: ffff8800da181d80/0x513d4886fdafe22b lrc: 2/0,0 mode: PW/PW res: [0xa:0x0:0x0].0x0 rrc: 3 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;4194303) flags: 0x20000020000 nid: local remote: 0x6960c454aded7432 expref: -99 pid: 2547 timeout: 0 lvb_type: 1
[ 3759.162297] Pid: 555, comm: kworker/u16:4 3.10.0-7.6-debug #2 SMP Wed Nov 7 02:25:20 EST 2018
[ 3759.164316] Call Trace:
[ 3759.165036]  [&amp;lt;ffffffffa01197dc&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 3759.166673]  [&amp;lt;ffffffffa0119836&amp;gt;] libcfs_debug_dumpstack+0x26/0x30 [libcfs]
[ 3759.169212]  [&amp;lt;ffffffffa06b3cd0&amp;gt;] osc_req_attr_set+0x3e0/0x610 [osc]
[ 3759.171461]  [&amp;lt;ffffffffa02acdf0&amp;gt;] cl_req_attr_set+0x60/0x150 [obdclass]
[ 3759.173538]  [&amp;lt;ffffffffa06ae546&amp;gt;] osc_build_rpc+0x496/0xfe0 [osc]
[ 3759.175067]  [&amp;lt;ffffffffa06c8d27&amp;gt;] osc_io_unplug0+0xe17/0x1920 [osc]
[ 3759.176554]  [&amp;lt;ffffffffa06cd7e0&amp;gt;] osc_cache_writeback_range+0xc20/0x1260 [osc]
[ 3759.178457]  [&amp;lt;ffffffffa06bc2c5&amp;gt;] osc_io_fsync_start+0x85/0x1a0 [osc]
[ 3759.180085]  [&amp;lt;ffffffffa02acf45&amp;gt;] cl_io_start+0x65/0x130 [obdclass]
[ 3759.181571]  [&amp;lt;ffffffffa07b9d05&amp;gt;] lov_io_call.isra.7+0x85/0x140 [lov]
[ 3759.183270]  [&amp;lt;ffffffffa07b9ec6&amp;gt;] lov_io_start+0x56/0x150 [lov]
[ 3759.184661]  [&amp;lt;ffffffffa02acf45&amp;gt;] cl_io_start+0x65/0x130 [obdclass]
[ 3759.186366]  [&amp;lt;ffffffffa02af21c&amp;gt;] cl_io_loop+0xcc/0x1c0 [obdclass]
[ 3759.187837]  [&amp;lt;ffffffffa0c278ab&amp;gt;] cl_sync_file_range+0x2db/0x380 [lustre]
[ 3759.189677]  [&amp;lt;ffffffffa0c4519a&amp;gt;] ll_writepages+0x7a/0x200 [lustre]
[ 3759.191845]  [&amp;lt;ffffffff811b92a1&amp;gt;] do_writepages+0x21/0x50
[ 3759.193168]  [&amp;lt;ffffffff812654e0&amp;gt;] __writeback_single_inode+0x40/0x2a0
[ 3759.194940]  [&amp;lt;ffffffff812662a6&amp;gt;] writeback_sb_inodes+0x2a6/0x4b0
[ 3759.196596]  [&amp;lt;ffffffff81266c52&amp;gt;] wb_writeback+0x102/0x340
[ 3759.198126]  [&amp;lt;ffffffff81267511&amp;gt;] bdi_writeback_workfn+0x141/0x4e0
[ 3759.199598]  [&amp;lt;ffffffff810ad8ad&amp;gt;] process_one_work+0x18d/0x4a0
[ 3759.201118]  [&amp;lt;ffffffff810adce6&amp;gt;] worker_thread+0x126/0x3b0
[ 3759.202435]  [&amp;lt;ffffffff810b4ed4&amp;gt;] kthread+0xe4/0xf0
[ 3759.203719]  [&amp;lt;ffffffff817c4c77&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[ 3759.205294]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[ 3759.206488] LustreError: 555:0:(osc_object.c:401:osc_req_attr_set()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So I looked inside and it seems there are much deeper problems going on under the hood.&lt;/p&gt;

&lt;p&gt;It all starts in early init with this list warning:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[  113.445869] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) create svc ctx ffff88009df5de40: user from 192.168.10.181@tcp authenticated as root
[  113.449010] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) Skipped 1 previous similar message
[  113.455283] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) server installed reverse ctx ffff8800a8b93f00 idx 0x96480a9894da0293, expiry 1551008338(+60s)
[  113.460308] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) Skipped 2 previous similar messages
[  115.146368] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) create svc ctx ffff88009ddffe40: user from 0@lo authenticated as mds
[  115.148646] ------------[ cut here ]------------
[  115.148653] WARNING: CPU: 2 PID: 15079 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
[  115.148654] list_del corruption. prev-&amp;gt;next should be ffff8800a6c5fba0, but was ffff8800d354af68
[  115.148683] Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) jbd2 mbcache lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod crc_t10dif crct10dif_generic crct10dif_common squashfs pcspkr i2c_piix4 i2c_core binfmt_misc ip_tables rpcsec_gss_krb5 ata_generic pata_acpi ata_piix serio_raw libata virtio_blk floppy
[  115.148687] CPU: 2 PID: 15079 Comm: ll_ost00_001 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #2
[  115.148688] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  115.148689] Call Trace:
[  115.148696]  [&amp;lt;ffffffff817afbf2&amp;gt;] dump_stack+0x19/0x1b
[  115.148701]  [&amp;lt;ffffffff810895c8&amp;gt;] __warn+0xd8/0x100
[  115.148704]  [&amp;lt;ffffffff8108964f&amp;gt;] warn_slowpath_fmt+0x5f/0x80
[  115.148722]  [&amp;lt;ffffffff8109bfb0&amp;gt;] ? __internal_add_timer+0x130/0x130
[  115.148724]  [&amp;lt;ffffffff813ff301&amp;gt;] __list_del_entry+0xa1/0xd0
[  115.148726]  [&amp;lt;ffffffff813ff33d&amp;gt;] list_del+0xd/0x30
[  115.148730]  [&amp;lt;ffffffff810b5c66&amp;gt;] remove_wait_queue+0x26/0x40
[  115.148744]  [&amp;lt;ffffffffa0734937&amp;gt;] gss_svc_upcall_handle_init+0x267/0xf10 [ptlrpc_gss]
[  115.148748]  [&amp;lt;ffffffff810caae0&amp;gt;] ? wake_up_state+0x20/0x20
[  115.148766]  [&amp;lt;ffffffffa0726839&amp;gt;] gss_svc_handle_init+0x7e9/0xb70 [ptlrpc_gss]
[  115.148769]  [&amp;lt;ffffffff810ac981&amp;gt;] ? try_to_grab_pending+0xb1/0x180
[  115.148774]  [&amp;lt;ffffffffa072cffb&amp;gt;] gss_svc_accept+0x81b/0xad0 [ptlrpc_gss]
[  115.148780]  [&amp;lt;ffffffffa0741fe8&amp;gt;] gss_svc_accept_kr+0x18/0x20 [ptlrpc_gss]
[  115.148836]  [&amp;lt;ffffffffa04fc30e&amp;gt;] sptlrpc_svc_unwrap_request+0xee/0x600 [ptlrpc]
[  115.148871]  [&amp;lt;ffffffffa04dd108&amp;gt;] ptlrpc_main+0xa28/0x2040 [ptlrpc]
[  115.148873]  [&amp;lt;ffffffff810c32ed&amp;gt;] ? finish_task_switch+0x5d/0x1b0
[  115.148919]  [&amp;lt;ffffffffa04dc6e0&amp;gt;] ? ptlrpc_register_service+0xfe0/0xfe0 [ptlrpc]
[  115.148922]  [&amp;lt;ffffffff810b4ed4&amp;gt;] kthread+0xe4/0xf0
[  115.148925]  [&amp;lt;ffffffff810b4df0&amp;gt;] ? kthread_create_on_node+0x140/0x140
[  115.148928]  [&amp;lt;ffffffff817c4c77&amp;gt;] ret_from_fork_nospec_begin+0x21/0x21
[  115.148931]  [&amp;lt;ffffffff810b4df0&amp;gt;] ? kthread_create_on_node+0x140/0x140
[  115.148933] ---[ end trace 3829a662976cf54a ]---
[  115.149029] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) server installed reverse ctx ffff88009cbe5f00 idx 0x88a4014f1934dac8, expiry 1551008339(+60s)
[  115.149417] Lustre: 17039:0:(sec_gss.c:377:gss_cli_ctx_uptodate()) client refreshed ctx ffff8800bb3a6f00 idx 0x88a4014f1934daca (0-&amp;gt;lustre-OST0000_UUID), expiry 1551008329(+50s)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and then it kind of goes downhill from there complaining about errors authentication failures and such.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[  711.023235] LustreError: 21817:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err 0, gss err 0
[  711.027252] LustreError: 21817:0:(gss_keyring.c:1423:gss_kt_update()) Skipped 629 previous similar messages
[ 1208.887079] LustreError: 15487:0:(gss_svc_upcall.c:1044:gss_svc_upcall_handle_init()) authentication failed
[ 1208.891203] LustreError: 15487:0:(gss_svc_upcall.c:1044:gss_svc_upcall_handle_init()) Skipped 2077 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and eventually it dies in that crash.&lt;/p&gt;

&lt;p&gt;So clearly something very wrong is going on.&lt;/p&gt;

&lt;p&gt;This is on a dual-VM setup where client and server are on separate VMs.&lt;/p&gt;

&lt;p&gt;you can see all related logs from a sample run here:&lt;br/&gt;
&lt;a href=&quot;http://testing.linuxhacker.ru:3333/lustre-reports/72/testresults/sanity-gss-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://testing.linuxhacker.ru:3333/lustre-reports/72/testresults/sanity-gss-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/&lt;/a&gt;&lt;/p&gt;
</description>
                <environment></environment>
        <key id="54972">LU-12001</key>
            <summary>sanity-gss issues</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="green">Oleg Drokin</reporter>
                        <labels>
                    </labels>
                <created>Mon, 25 Feb 2019 01:16:43 +0000</created>
                <updated>Wed, 9 Nov 2022 16:58:32 +0000</updated>
                                            <version>Lustre 2.13.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="320588" author="adilger" created="Sat, 11 Dec 2021 18:29:42 +0000"  >&lt;p&gt;+2 on master: &lt;a href=&quot;https://testing.whamcloud.com/gerrit-janitor/20448/results.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/gerrit-janitor/20448/results.html&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="349298" author="adilger" created="Tue, 11 Oct 2022 18:25:43 +0000"  >&lt;p&gt;Sebastien, could you please take a look to see if this problem is likely to hit in production, or is it just something in the test script?  It still fails repeatedly when tested (eg. for patch &lt;a href=&quot;https://review.whamcloud.com/46778&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/46778&lt;/a&gt; that is mostly just changing white space in the test script).&lt;/p&gt;</comment>
                            <comment id="352197" author="gerrit" created="Tue, 8 Nov 2022 17:18:45 +0000"  >&lt;p&gt;&quot;Sebastien Buisson &amp;lt;sbuisson@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/49071&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/49071&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12001&quot; title=&quot;sanity-gss issues&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12001&quot;&gt;LU-12001&lt;/a&gt; tests: investigate &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12001&quot; title=&quot;sanity-gss issues&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12001&quot;&gt;LU-12001&lt;/a&gt;&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 2e14474366cc62264e976fbd00b4a1554d9c7d26&lt;/p&gt;</comment>
                            <comment id="352318" author="sebastien" created="Wed, 9 Nov 2022 16:58:32 +0000"  >&lt;p&gt;According to the test results of &lt;a href=&quot;https://review.whamcloud.com/49071&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/49071&lt;/a&gt;, the crash occurs in sanity-gss test_150 when it is run with SHARED_KEY=true, leading to GSS flavor back-and-forth switches. I found that the problem stems from the fact that the &lt;tt&gt;lsvcgssd&lt;/tt&gt; daemon is killed then relaunched by the test, but lacking the &lt;tt&gt;-s&lt;/tt&gt; flag that enables SSK:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;usage: lsvcgssd [ -fnvmogk ]
-f      - Run in foreground
-n      - Don&apos;t establish kerberos credentials
-v      - Verbosity
-m      - Service MDS
-o      - Service OSS
-g      - Service MGS
-k      - Enable kerberos support
-s      - Enable shared secret key support
-z      - Enable gssnull support
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So the client ends up in an infinite loop, trying to authenticate against the servers but the servers being unable to complete the authentication process because of the &lt;tt&gt;lsvcgssd&lt;/tt&gt; daemon incorrectly set up. And finally crashes.&lt;/p&gt;

&lt;p&gt;While it is an actual problem, this is not something that could likely be hit in the production lifecycle of a cluster. The standard procedure is to configure an SSK or Kerberos flavor on a file system (&lt;tt&gt;gssnull&lt;/tt&gt; flavor tested in sanity-gss is not intended - and useless - for production use), and never change it during the life of the file system, or only on occasions while the system is mostly offline.&lt;br/&gt;
And in case of server failover, the recommendation is to always start the &lt;tt&gt;lsvcgssd&lt;/tt&gt; daemon prior to mounting the Lustre targets.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00c6f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>