[LU-12001] sanity-gss issues Created: 25/Feb/19  Updated: 09/Nov/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like in my testing sanity-gss test 150 has a 100% crash failure rate like this:

[ 3757.187493] Lustre: DEBUG MARKER: == sanity-gss test 150: secure mgs connection: client flavor setting ================================= 07:38:39 (1551011919)
[ 3759.003595] LustreError: 5651:0:(file.c:4273:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -1
[ 3759.126390] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) page@ffff8800b2af7e00[2 ffff8800c2ee3558 2 1           (null)]

[ 3759.131722] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) vvp-page@ffff8800b2af7e50(0:0) vm@ffffea0002721f88 1fffff0000282c 2:0 ffff8800b2af7e00 6144 lru

[ 3759.136409] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) lov-page@ffff8800b2af7e90, comp index: 0, gen: 0

[ 3759.139344] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) osc-page@ffff8800b2af7ec8 6144: 1< 0x845fed 258 0 + + > 2< 25165824 0 4096 0x5 0x520 |           (null) ffff8800c4288798 ffff8800cff6de60 > 3< 1 25 0 > 4< 0 0 8 227942400 - | - - + - > 5< - - + - | 0 - | 0 - ->

[ 3759.145390] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) end page@ffff8800b2af7e00

[ 3759.147726] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) uncovered page!
[ 3759.149580] LustreError: 555:0:(ldlm_resource.c:1725:ldlm_resource_dump()) --- Resource: [0xa:0x0:0x0].0x0 (ffff8800c33fb9c0) refcount = 2
[ 3759.152436] LustreError: 555:0:(ldlm_resource.c:1728:ldlm_resource_dump()) Granted locks (in reverse order):
[ 3759.154733] LustreError: 555:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: lustre-OST0001-osc-ffff8800d231d800 lock: ffff8800da181d80/0x513d4886fdafe22b lrc: 2/0,0 mode: PW/PW res: [0xa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x20000020000 nid: local remote: 0x6960c454aded7432 expref: -99 pid: 2547 timeout: 0 lvb_type: 1
[ 3759.162297] Pid: 555, comm: kworker/u16:4 3.10.0-7.6-debug #2 SMP Wed Nov 7 02:25:20 EST 2018
[ 3759.164316] Call Trace:
[ 3759.165036]  [<ffffffffa01197dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 3759.166673]  [<ffffffffa0119836>] libcfs_debug_dumpstack+0x26/0x30 [libcfs]
[ 3759.169212]  [<ffffffffa06b3cd0>] osc_req_attr_set+0x3e0/0x610 [osc]
[ 3759.171461]  [<ffffffffa02acdf0>] cl_req_attr_set+0x60/0x150 [obdclass]
[ 3759.173538]  [<ffffffffa06ae546>] osc_build_rpc+0x496/0xfe0 [osc]
[ 3759.175067]  [<ffffffffa06c8d27>] osc_io_unplug0+0xe17/0x1920 [osc]
[ 3759.176554]  [<ffffffffa06cd7e0>] osc_cache_writeback_range+0xc20/0x1260 [osc]
[ 3759.178457]  [<ffffffffa06bc2c5>] osc_io_fsync_start+0x85/0x1a0 [osc]
[ 3759.180085]  [<ffffffffa02acf45>] cl_io_start+0x65/0x130 [obdclass]
[ 3759.181571]  [<ffffffffa07b9d05>] lov_io_call.isra.7+0x85/0x140 [lov]
[ 3759.183270]  [<ffffffffa07b9ec6>] lov_io_start+0x56/0x150 [lov]
[ 3759.184661]  [<ffffffffa02acf45>] cl_io_start+0x65/0x130 [obdclass]
[ 3759.186366]  [<ffffffffa02af21c>] cl_io_loop+0xcc/0x1c0 [obdclass]
[ 3759.187837]  [<ffffffffa0c278ab>] cl_sync_file_range+0x2db/0x380 [lustre]
[ 3759.189677]  [<ffffffffa0c4519a>] ll_writepages+0x7a/0x200 [lustre]
[ 3759.191845]  [<ffffffff811b92a1>] do_writepages+0x21/0x50
[ 3759.193168]  [<ffffffff812654e0>] __writeback_single_inode+0x40/0x2a0
[ 3759.194940]  [<ffffffff812662a6>] writeback_sb_inodes+0x2a6/0x4b0
[ 3759.196596]  [<ffffffff81266c52>] wb_writeback+0x102/0x340
[ 3759.198126]  [<ffffffff81267511>] bdi_writeback_workfn+0x141/0x4e0
[ 3759.199598]  [<ffffffff810ad8ad>] process_one_work+0x18d/0x4a0
[ 3759.201118]  [<ffffffff810adce6>] worker_thread+0x126/0x3b0
[ 3759.202435]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[ 3759.203719]  [<ffffffff817c4c77>] ret_from_fork_nospec_end+0x0/0x39
[ 3759.205294]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 3759.206488] LustreError: 555:0:(osc_object.c:401:osc_req_attr_set()) LBUG

So I looked inside and it seems there are much deeper problems going on under the hood.

It all starts in early init with this list warning:

[  113.445869] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) create svc ctx ffff88009df5de40: user from 192.168.10.181@tcp authenticated as root
[  113.449010] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) Skipped 1 previous similar message
[  113.455283] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) server installed reverse ctx ffff8800a8b93f00 idx 0x96480a9894da0293, expiry 1551008338(+60s)
[  113.460308] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) Skipped 2 previous similar messages
[  115.146368] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) create svc ctx ffff88009ddffe40: user from 0@lo authenticated as mds
[  115.148646] ------------[ cut here ]------------
[  115.148653] WARNING: CPU: 2 PID: 15079 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
[  115.148654] list_del corruption. prev->next should be ffff8800a6c5fba0, but was ffff8800d354af68
[  115.148683] Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) jbd2 mbcache lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod crc_t10dif crct10dif_generic crct10dif_common squashfs pcspkr i2c_piix4 i2c_core binfmt_misc ip_tables rpcsec_gss_krb5 ata_generic pata_acpi ata_piix serio_raw libata virtio_blk floppy
[  115.148687] CPU: 2 PID: 15079 Comm: ll_ost00_001 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #2
[  115.148688] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  115.148689] Call Trace:
[  115.148696]  [<ffffffff817afbf2>] dump_stack+0x19/0x1b
[  115.148701]  [<ffffffff810895c8>] __warn+0xd8/0x100
[  115.148704]  [<ffffffff8108964f>] warn_slowpath_fmt+0x5f/0x80
[  115.148722]  [<ffffffff8109bfb0>] ? __internal_add_timer+0x130/0x130
[  115.148724]  [<ffffffff813ff301>] __list_del_entry+0xa1/0xd0
[  115.148726]  [<ffffffff813ff33d>] list_del+0xd/0x30
[  115.148730]  [<ffffffff810b5c66>] remove_wait_queue+0x26/0x40
[  115.148744]  [<ffffffffa0734937>] gss_svc_upcall_handle_init+0x267/0xf10 [ptlrpc_gss]
[  115.148748]  [<ffffffff810caae0>] ? wake_up_state+0x20/0x20
[  115.148766]  [<ffffffffa0726839>] gss_svc_handle_init+0x7e9/0xb70 [ptlrpc_gss]
[  115.148769]  [<ffffffff810ac981>] ? try_to_grab_pending+0xb1/0x180
[  115.148774]  [<ffffffffa072cffb>] gss_svc_accept+0x81b/0xad0 [ptlrpc_gss]
[  115.148780]  [<ffffffffa0741fe8>] gss_svc_accept_kr+0x18/0x20 [ptlrpc_gss]
[  115.148836]  [<ffffffffa04fc30e>] sptlrpc_svc_unwrap_request+0xee/0x600 [ptlrpc]
[  115.148871]  [<ffffffffa04dd108>] ptlrpc_main+0xa28/0x2040 [ptlrpc]
[  115.148873]  [<ffffffff810c32ed>] ? finish_task_switch+0x5d/0x1b0
[  115.148919]  [<ffffffffa04dc6e0>] ? ptlrpc_register_service+0xfe0/0xfe0 [ptlrpc]
[  115.148922]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[  115.148925]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[  115.148928]  [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21
[  115.148931]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[  115.148933] ---[ end trace 3829a662976cf54a ]---
[  115.149029] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) server installed reverse ctx ffff88009cbe5f00 idx 0x88a4014f1934dac8, expiry 1551008339(+60s)
[  115.149417] Lustre: 17039:0:(sec_gss.c:377:gss_cli_ctx_uptodate()) client refreshed ctx ffff8800bb3a6f00 idx 0x88a4014f1934daca (0->lustre-OST0000_UUID), expiry 1551008329(+50s)

and then it kind of goes downhill from there complaining about errors authentication failures and such.

[  711.023235] LustreError: 21817:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err 0, gss err 0
[  711.027252] LustreError: 21817:0:(gss_keyring.c:1423:gss_kt_update()) Skipped 629 previous similar messages
[ 1208.887079] LustreError: 15487:0:(gss_svc_upcall.c:1044:gss_svc_upcall_handle_init()) authentication failed
[ 1208.891203] LustreError: 15487:0:(gss_svc_upcall.c:1044:gss_svc_upcall_handle_init()) Skipped 2077 previous similar messages

and eventually it dies in that crash.

So clearly something very wrong is going on.

This is on a dual-VM setup where client and server are on separate VMs.

you can see all related logs from a sample run here:
http://testing.linuxhacker.ru:3333/lustre-reports/72/testresults/sanity-gss-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/



 Comments   
Comment by Andreas Dilger [ 11/Dec/21 ]

+2 on master: https://testing.whamcloud.com/gerrit-janitor/20448/results.html

Comment by Andreas Dilger [ 11/Oct/22 ]

Sebastien, could you please take a look to see if this problem is likely to hit in production, or is it just something in the test script? It still fails repeatedly when tested (eg. for patch https://review.whamcloud.com/46778 that is mostly just changing white space in the test script).

Comment by Gerrit Updater [ 08/Nov/22 ]

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49071
Subject: LU-12001 tests: investigate LU-12001
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2e14474366cc62264e976fbd00b4a1554d9c7d26

Comment by Sebastien Buisson [ 09/Nov/22 ]

According to the test results of https://review.whamcloud.com/49071, the crash occurs in sanity-gss test_150 when it is run with SHARED_KEY=true, leading to GSS flavor back-and-forth switches. I found that the problem stems from the fact that the lsvcgssd daemon is killed then relaunched by the test, but lacking the -s flag that enables SSK:

usage: lsvcgssd [ -fnvmogk ]
-f      - Run in foreground
-n      - Don't establish kerberos credentials
-v      - Verbosity
-m      - Service MDS
-o      - Service OSS
-g      - Service MGS
-k      - Enable kerberos support
-s      - Enable shared secret key support
-z      - Enable gssnull support

So the client ends up in an infinite loop, trying to authenticate against the servers but the servers being unable to complete the authentication process because of the lsvcgssd daemon incorrectly set up. And finally crashes.

While it is an actual problem, this is not something that could likely be hit in the production lifecycle of a cluster. The standard procedure is to configure an SSK or Kerberos flavor on a file system (gssnull flavor tested in sanity-gss is not intended - and useless - for production use), and never change it during the life of the file system, or only on occasions while the system is mostly offline.
And in case of server failover, the recommendation is to always start the lsvcgssd daemon prior to mounting the Lustre targets.

Generated at Sat Feb 10 02:48:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.