[LU-12896] recovery-small test_110k: (gss_keyring.c:152:ctx_upcall_timeout_kr()) ASSERTION( key ) failed Created: 22/Oct/19  Updated: 19/Oct/23  Resolved: 16/Oct/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13498 sanity test 56w fails with '/usr/bin/... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Chris Horn <hornc@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f10303a6-f4c7-11e9-add9-52540065bddc

test_110k failed and hit an assertion:

[ 5669.278804] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 10:29:44 (1571740184)
[ 5669.612623] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-35vm12.onyx.whamcloud.com: executing set_default_debug -1 all 4
[ 5669.809330] Lustre: DEBUG MARKER: onyx-35vm12.onyx.whamcloud.com: executing set_default_debug -1 all 4
[ 5704.525897] Lustre: 0:0:(gss_keyring.c:150:ctx_upcall_timeout_kr()) ctx ffff94b33fc03da0, key           (null)
[ 5704.527744] LustreError: 0:0:(gss_keyring.c:152:ctx_upcall_timeout_kr()) ASSERTION( key ) failed: 
[ 5704.529239] LustreError: 0:0:(gss_keyring.c:152:ctx_upcall_timeout_kr()) LBUG
[ 5704.530425] Kernel panic - not syncing: LBUG in interrupt.

[ 5704.531587] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7.x86_64 #1
[ 5704.533438] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5704.534384] Call Trace:
[ 5704.534832]    [] dump_stack+0x19/0x1b
[ 5704.535867]  [] panic+0xe8/0x21f
[ 5704.536703]  [] ? ctx_unlist_kr+0xc0/0xc0 [ptlrpc_gss]
[ 5704.537877]  [] lbug_with_loc+0x8d/0xa0 [libcfs]
[ 5704.538922]  [] ? ctx_unlist_kr+0xc0/0xc0 [ptlrpc_gss]
[ 5704.540029]  [] ctx_upcall_timeout_kr+0xc3/0xd0 [ptlrpc_gss]
[ 5704.541244]  [] call_timer_fn+0x38/0x110
[ 5704.542162]  [] ? ctx_unlist_kr+0xc0/0xc0 [ptlrpc_gss]
[ 5704.543272]  [] run_timer_softirq+0x24d/0x300
[ 5704.544254]  [] __do_softirq+0xf5/0x280
[ 5704.545180]  [] call_softirq+0x1c/0x30
[ 5704.546095]  [] do_softirq+0x65/0xa0
[ 5704.546972]  [] irq_exit+0x105/0x110
[ 5704.547823]  [] smp_apic_timer_interrupt+0x48/0x60
[ 5704.548871]  [] apic_timer_interrupt+0x162/0x170
[ 5704.549893]    [] ? __cpuidle_text_start+0x8/0x8
[ 5704.551024]  [] ? native_safe_halt+0xb/0x20
[ 5704.551976]  [] default_idle+0x1e/0xc0
[ 5704.552881]  [] arch_cpu_idle+0x20/0xc0
[ 5704.553807]  [] cpu_startup_entry+0x14a/0x1e0
[ 5704.554795]  [] rest_init+0x77/0x80
[ 5704.555665]  [] start_kernel+0x44b/0x46c
[ 5704.556587]  [] ? repair_env_string+0x5c/0x5c
[ 5704.557586]  [] ? early_idt_handler_array+0x120/0x120
[ 5704.558683]  [] x86_64_start_reservations+0x24/0x26
[ 5704.559753]  [] x86_64_start_kernel+0x154/0x177
[ 5704.560776]  [] start_cpu+0x5/0x14


 Comments   
Comment by Chris Horn [ 22/Oct/19 ]

Looks like same issue with different signature: https://testing.whamcloud.com/test_sets/1a32dd98-f4d5-11e9-a197-52540065bddc

[ 7024.609940] BUG: unable to handle kernel paging request at ffffffff9d4aab37
[ 7024.611169] IP: [] ctx_upcall_timeout_kr+0x85/0xd0 [ptlrpc_gss]
[ 7024.612513] PGD 28814067 PUD 28815063 PMD 27c000e1 
[ 7024.613458] Oops: 0003 [#1] SMP 
[ 7024.614093] Modules linked in: ptlrpc_gss(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ucm rpcrdma rdma_ucm ib_uverbs ib_umad ib_iser rdma_cm ib_ipoib iw_cm libiscsi scsi_transport_iscsi ib_cm mlx4_ib ib_core sunrpc iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev i2c_piix4 pcspkr virtio_balloon parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi mlx4_en ptp pps_core virtio_blk ata_piix mlx4_core libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel
[ 7024.628253]  serio_raw virtio_pci devlink virtio_ring virtio 8139cp mii floppy
[ 7024.629523] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7.x86_64 #1
[ 7024.631366] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 7024.632322] task: ffffffff9e018480 ti: ffffffff9e000000 task.ti: ffffffff9e000000
[ 7024.633557] RIP: 0010:[]  [] ctx_upcall_timeout_kr+0x85/0xd0 [ptlrpc_gss]
[ 7024.635229] RSP: 0018:ffff8df03fc03e50  EFLAGS: 00010292
[ 7024.636120] RAX: 0000000000000000 RBX: ffffffff9d4aaac7 RCX: 000000000000083f
[ 7024.637298] RDX: 00000000ffffffff RSI: 0000000000000200 RDI: ffff8df03fc03da0
[ 7024.638464] RBP: ffff8df03fc03e60 R08: 0000000000000000 R09: ffff8df03d160f00
[ 7024.639638] R10: 000000000000082c R11: ffff8df03fc039ce R12: ffff8df0254611e0
[ 7024.640825] R13: 0000000000000100 R14: ffffffffc0fe38d0 R15: ffff8df0254611e0
[ 7024.642127] FS:  0000000000000000(0000) GS:ffff8df03fc00000(0000) knlGS:0000000000000000
[ 7024.643909] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7024.644970] CR2: ffffffff9d4aab37 CR3: 000000007b1bc000 CR4: 00000000000606f0
[ 7024.646171] Call Trace:
[ 7024.646610]   
[ 7024.646999]  [] call_timer_fn+0x38/0x110
[ 7024.647991]  [] ? ctx_unlist_kr+0xc0/0xc0 [ptlrpc_gss]
[ 7024.649106]  [] run_timer_softirq+0x24d/0x300
[ 7024.650107]  [] __do_softirq+0xf5/0x280
[ 7024.651070]  [] call_softirq+0x1c/0x30
[ 7024.651999]  [] do_softirq+0x65/0xa0
[ 7024.652889]  [] irq_exit+0x105/0x110
[ 7024.653745]  [] smp_apic_timer_interrupt+0x48/0x60
[ 7024.654815]  [] apic_timer_interrupt+0x162/0x170
[ 7024.655843]   
[ 7024.656186]  [] ? __cpuidle_text_start+0x8/0x8
[ 7024.657272]  [] ? native_safe_halt+0xb/0x20
[ 7024.658235]  [] default_idle+0x1e/0xc0
[ 7024.659135]  [] arch_cpu_idle+0x20/0xc0
[ 7024.660066]  [] cpu_startup_entry+0x14a/0x1e0
[ 7024.661069]  [] rest_init+0x77/0x80
[ 7024.661946]  [] start_kernel+0x44b/0x46c
[ 7024.662860]  [] ? repair_env_string+0x5c/0x5c
[ 7024.663840]  [] ? early_idt_handler_array+0x120/0x120
[ 7024.664942]  [] x86_64_start_reservations+0x24/0x26
[ 7024.666009]  [] x86_64_start_kernel+0x154/0x177
[ 7024.667030]  [] start_cpu+0x5/0x14
[ 7024.667864] Code: c7 05 84 f2 01 00 00 04 00 00 48 c7 05 81 f2 01 00 90 2b 00 c1 e8 fc ad 8d ff 48 85 db 74 18 48 8d bd 40 ff ff ff e8 ab 50 fe ff  80 4b 70 04 48 83 c4 08 5b 5d c3 48 c7 c7 00 89 ff c0 48 c7 
[ 7024.673175] RIP  [] ctx_upcall_timeout_kr+0x85/0xd0 [ptlrpc_gss]
[ 7024.674479]  RSP 
[ 7024.675076] CR2: ffffffff9d4aab37
Comment by Alex Zhuravlev [ 31/Oct/19 ]

https://testing.whamcloud.com/test_sets/9fa3cc34-fb6b-11e9-a197-52540065bddc

Comment by Andreas Dilger [ 20/Oct/20 ]

The patch https://review.whamcloud.com/40161 "LU-13498 tests: remove tests from ALWAYS_EXCEPT with SSK" is removing this subtest from ALWAYS_EXCEPT, so this issue can be resolved when it lands and the always_except label removed.

Comment by Gerrit Updater [ 22/Sep/23 ]

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52473
Subject: LU-12896 gss: key can be unlinked when timeout expires
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6a456726aa57d34f51c0a1186e6c9b4bce60aeac

Comment by Gerrit Updater [ 16/Oct/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52473/
Subject: LU-12896 gss: key can be unlinked when timeout expires
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4c6290087b3bf0838a00de8f8b1cfde86efbc409

Comment by Peter Jones [ 16/Oct/23 ]

Landed for 2.16

Generated at Sat Feb 10 02:56:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.