[LU-9582] gssnull instability Created: 01/Jun/17  Updated: 06/Sep/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sebastien Buisson (Inactive) Assignee: Sebastien Buisson
Resolution: Unresolved Votes: 0
Labels: gss

Issue Links:
Related
is related to LU-9073 SSK: lgss_sk generates keys with inva... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Trying to run sanity-gss, I found gssnull flavor very unstable on server side.
Once lsvcgssd (-z) daemon is started on server side and flavor is set to gssnull (lctl conf_param <fsname>.srpc.flavor.default=gssnull), connections between nodes get authenticated. But then, stack traces similar to the following get dumped on server side:

 [ 535.556541] WARNING: at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0() [ 535.556885] list_del corruption. prev->next should be ffff8803fa1a3bd0, but was ffff880405b71f58
 [ 535.557043] Modules linked in: ptlrpc_gss(OF) sunrpc osp(OF) mdd(OF) lod(OF) mdt(OF) lfsck(OF) mgc(OF) osd_ldiskfs(OF) lquota(OF) fid(OF) fld(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) libcfs(OF) ldiskfs(OF) loop mbcache jbd2 sha512_generic ppdev pcspkr parport_pc parport i2c_piix4 i2c_core serio_raw virtio_balloon xfs libcrc32c sd_mod crc_t10dif crct10dif_common ata_generic pata_acpi virtio_scsi 8139too ata_piix 8139cp libata mii virtio_pci virtio_ring virtio floppy [last unloaded: libcfs] [ 535.557043] CPU: 5 PID: 3378 Comm: mdt00_003 Tainted: GF O-------------- 3.10.0-229.20.1.el7.x86_64 #1
 [ 535.557043] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 535.557043] ffff8803fa1a3ac8 00000000ace999ec ffff8803fa1a3a80 ffffffff816045b6
 [ 535.557043] ffff8803fa1a3ab8 ffffffff8106e29b ffff8803fa1a3bd0 ffff8803fa1a3bb8
[ 535.557043] 0000000000000246 0000000000000000 ffff8803eecf25a0 ffff8803fa1a3b20
 [ 535.557043] Call Trace:
 [ 535.557043] [<ffffffff816045b6>] dump_stack+0x19/0x1b
 [ 535.557043] [<ffffffff8106e29b>] warn_slowpath_common+0x6b/0xb0
 [ 535.557043] [<ffffffff8106e33c>] warn_slowpath_fmt+0x5c/0x80
 [ 535.557043] [<ffffffff8107eda0>] ? __internal_add_timer+0x130/0x130
 [ 535.557043] [<ffffffff812ed9f1>] __list_del_entry+0xa1/0xd0
 [ 535.557043] [<ffffffff812eda2d>] list_del+0xd/0x30
 [ 535.557043] [<ffffffff81098086>] remove_wait_queue+0x26/0x40
 [ 535.557043] [<ffffffffa0bde99f>] gss_svc_upcall_handle_init+0x25f/0xee0 [ptlrpc_gss]
 [ 535.557043] [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
 [ 535.557043] [<ffffffffa0bd0c49>] gss_svc_handle_init+0x7e9/0xb60 [ptlrpc_gss]
 [ 535.557043] [<ffffffffa0bd70db>] gss_svc_accept+0x81b/0xad0 [ptlrpc_gss]
 [ 535.557043] [<ffffffffa0bebf18>] gss_svc_accept_kr+0x18/0x20 [ptlrpc_gss]
 [ 535.557043] [<ffffffffa062f70e>] sptlrpc_svc_unwrap_request+0xee/0x600 [ptlrpc]
 [ 535.557043] [<ffffffffa060f594>] ptlrpc_main+0x964/0x1de0 [ptlrpc]
 [ 535.557043] [<ffffffffa060ec30>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
 [ 535.557043] [<ffffffff8109727f>] kthread+0xcf/0xe0
 [ 535.557043] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
 [ 535.557043] [<ffffffff81614358>] ret_from_fork+0x58/0x90
 [ 535.557043] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140

followed by this message:

[ 535.571130] Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound).}}

This pattern is repeated several times, until a GPF occurs:

 [ 996.052879] general protection fault: 0000 [#1] SMP
 [ 996.053003] Modules linked in: ptlrpc_gss(OF) sunrpc osp(OF) mdd(OF) lod(OF) mdt(OF) lfsck(OF) mgc(OF) osd_ldiskfs(OF) lquota(OF) fid(OF) fld(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) libcfs(OF) ldiskfs(OF) loop mbcache jbd2 sha512_generic ppdev pcspkr parport_pc parport i2c_piix4 i2c_core serio_raw virtio_balloon xfs libcrc32c sd_mod crc_t10dif crct10dif_common ata_generic pata_acpi virtio_scsi 8139too ata_piix 8139cp libata mii virtio_pci virtio_ring virtio floppy [last unloaded: libcfs]
 [ 996.053003] CPU: 5 PID: 2951 Comm: mdt_out00_001 Tainted: GF W O-------------- 3.10.0-229.20.1.el7.x86_64 #1
 [ 996.053003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
 [ 996.053003] task: ffff8800da83a220 ti: ffff8803fc070000 task.ti: ffff8803fc070000
 [ 996.053003] RIP: 0010:[<ffffffff812e29e6>] [<ffffffff812e29e6>] memcpy+0x16/0x110
 [ 996.053003] RSP: 0018:ffff8803fc073998 EFLAGS: 00010202
 [ 996.053003] RAX: ffffc9000e8c3000 RBX: ffff8803fc0739f8 RCX: ffff880406762300
 [ 996.053003] RDX: 000000005a5a5a1a RSI: 5a5a5a5a5a5a5a5a RDI: ffffc9000e8c3000
 [ 996.053003] RBP: ffff8803fc0739b8 R08: 0000000000000000 R09: ffffea000e9cfc80
 [ 996.053003] R10: 0000000000004120 R11: fffffffffffffff8 R12: ffff8803ee020808
 [ 996.053003] R13: ffff8803fc05d050 R14: 0000000000000000 R15: ffff8803f0b18e40
 [ 996.053003] FS: 0000000000000000(0000) GS:ffff88041fd40000(0000) knlGS:0000000000000000
 [ 996.053003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 [ 996.053003] CR2: 00007f6eb788aba0 CR3: 0000000036483000 CR4: 00000000000006e0
 [ 996.053003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [ 996.053003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 [ 996.053003] Stack:
 [ 996.053003] ffffffffa0be03dd ffff8803fc0739c8 ffff8803ee020780 ffff8803fc05d050
 [ 996.053003] ffff8803fc073b70 ffffffffa0bdcd90 0000000000000000 0000000000000000
 [ 996.053003] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 [ 996.053003] Call Trace:
 [ 996.053003] [<ffffffffa0be03dd>] ? rawobj_dup+0x15d/0x2e0 [ptlrpc_gss]
 [ 996.053003] [<ffffffffa0bdcd90>] gss_svc_searchbyctx+0x40/0xa0 [ptlrpc_gss]
 [ 996.053003] [<ffffffffa0bdc870>] ? rsc_alloc+0xc0/0xc0 [ptlrpc_gss]
 [ 996.053003] [<ffffffffa0bdecc5>] gss_svc_upcall_handle_init+0x585/0xee0 [ptlrpc_gss]
 [ 996.053003] [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
 [ 996.053003] [<ffffffffa0bd0c49>] gss_svc_handle_init+0x7e9/0xb60 [ptlrpc_gss]
 [ 996.053003] [<ffffffffa0bd70db>] gss_svc_accept+0x81b/0xad0 [ptlrpc_gss]
 [ 996.053003] [<ffffffffa0bebf18>] gss_svc_accept_kr+0x18/0x20 [ptlrpc_gss]
 [ 996.053003] [<ffffffffa062f70e>] sptlrpc_svc_unwrap_request+0xee/0x600 [ptlrpc]
 [ 996.053003] [<ffffffffa060f594>] ptlrpc_main+0x964/0x1de0 [ptlrpc]
 [ 996.053003] [<ffffffffa060ec30>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
 [ 996.053003] [<ffffffff8109727f>] kthread+0xcf/0xe0
 [ 996.053003] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
 [ 996.053003] [<ffffffff81614358>] ret_from_fork+0x58/0x90
 [ 996.053003] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
 [ 996.053003] Code: 00 00 00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 <4c> 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 4c 89
 [ 996.053003] RIP [<ffffffff812e29e6>] memcpy+0x16/0x110
 [ 996.053003] RSP <ffff8803fc073998>


 Comments   
Comment by Peter Jones [ 03/Jun/17 ]

Sebastien

Is this something that you have the bandwidth to investigate further?

Peter

Comment by James A Simmons [ 03/Jun/17 ]

Sebastien can you try https://review.whamcloud.com/#/c/25199

Comment by Sebastien Buisson (Inactive) [ 07/Jun/17 ]

Hi James,

I was doing my previous tests on 2.9.
I tried master branch with https://review.whamcloud.com/25199 in addition to https://review.whamcloud.com/27320, but the problem is still there: list_del corruptions and a GPF to end up.

Comment by Sebastien Buisson (Inactive) [ 06/Sep/17 ]

Hi,

Just to report same problem still occurs with current head of master branch.

Comment by Peter Jones [ 21/Mar/18 ]

sbuisson is this still an issue on master or has it been fixed by LU-8602 (or something else)?

Comment by Peter Jones [ 06/Sep/18 ]

Assigning to Sebastien to determine whether this is still an issue

Generated at Sat Feb 10 02:27:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.