Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
Lustre 2.17.0, Lustre 2.16.1
-
None
-
3
-
9223372036854775807
Description
Looks like recent lnet landings (I think LU-15135 and LU-18555) broke some things in interop
soft lockups (master node crashed with softlockup, 2.16 interop):
[22859.194693] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [lnet_discovery:609699] [22859.194877] Modules linked in: osp(OE) ofd(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey tls dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common virtio_balloon pcspkr joydev i2c_piix4 fuse drm ext4 mbcache jbd2 ata_generic ata_piix libata crct10dif_pclmul crc32_pclmul crc32c_intel virtio_net ghash_clmulni_intel virtio_blk net_failover failover serio_raw [last unloaded: libcfs(OE)] [22859.195057] CPU: 1 PID: 609699 Comm: lnet_discovery Kdump: loaded Tainted: G OE ------- --- 5.14.0-503.40.1_lustre.el9.x86_64 #1 [22859.195134] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [22859.195200] RIP: 0010:lnet_peerni_by_nid_locked+0x21/0x140 [lnet] [22859.195506] Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 83 3d 84 d6 01 00 01 74 0c 48 c7 c0 94 ff ff ff c3 cc cc cc cc 41 55 49 89 f5 41 54 <41> 89 d4 55 48 89 fd 48 83 ec 08 e8 6f c5 ff ff 48 85 c0 74 0e 48 [22859.195570] RSP: 0018:ffffa9fa409efd40 EFLAGS: 00000246 [22859.195638] RAX: 0000000000000000 RBX: ffffa9fa409efe48 RCX: ffff8cda13ed6dd0 [22859.195711] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8cda03c5c440 [22859.195770] RBP: ffff8cda03c5c420 R08: 0000000000000000 R09: ffffffffc0be0b70 [22859.195828] R10: 0000000000000000 R11: 0000000000000200 R12: ffff8cda03c5c440 [22859.195887] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8cda03c5c498 [22859.195945] FS: 0000000000000000(0000) GS:ffff8cdabfd00000(0000) knlGS:0000000000000000 [22859.196004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [22859.196062] CR2: 00007fd84e001010 CR3: 0000000005920006 CR4: 00000000000606f0 [22859.196124] Call Trace: [22859.196184] <IRQ> [22859.196250] ? show_trace_log_lvl+0x1c4/0x2df [22859.196364] ? show_trace_log_lvl+0x1c4/0x2df [22859.196425] ? lnet_select_pathway+0x1b6/0x640 [lnet] [22859.196533] ? watchdog_timer_fn+0x1ad/0x210 [22859.196630] ? __pfx_watchdog_timer_fn+0x10/0x10 [22859.196699] ? __hrtimer_run_queues+0x112/0x2b0 [22859.196778] ? hrtimer_interrupt+0xfc/0x210 [22859.196838] ? kvm_sched_clock_read+0xd/0x20 [22859.196935] ? __sysvec_apic_timer_interrupt+0x4e/0x100 [22859.197015] ? sysvec_apic_timer_interrupt+0x6d/0x90 [22859.197076] </IRQ> [22859.197134] <TASK> [22859.197191] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [22859.197278] ? lnet_peerni_by_nid_locked+0x21/0x140 [lnet] [22859.197387] lnet_select_pathway+0x1b6/0x640 [lnet] [22859.197500] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [22859.197565] lnet_send+0x6d/0x1e0 [lnet] [22859.197676] lnet_peer_discovery_complete+0x21c/0x390 [lnet] [22859.197786] lnet_peer_discovery+0x485/0xaf0 [lnet] [22859.197894] ? __pfx_autoremove_wake_function+0x10/0x10 [22859.197975] ? __pfx_lnet_peer_discovery+0x10/0x10 [lnet] [22859.198081] kthread+0xe0/0x100 [22859.198160] ? __pfx_kthread+0x10/0x10 [22859.198224] ret_from_fork+0x2c/0x50 [22859.198299] </TASK> [22859.198367] Kernel panic - not syncing: softlockup: hung tasks
https://testing.whamcloud.com/test_sets/49c461ea-e4fd-40c9-8d23-cc9251bc3d47
A siomewhat similar soft lockup in 2.15:
[ 5742.576727] Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec [ 5742.865727] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [lnet_discovery:101584] [ 5742.868982] Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) lnet_selftest(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev i2c_piix4 virtio_balloon pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel libata virtio_net serio_raw virtio_blk net_failover failover [last unloaded: libcfs] [ 5742.876797] CPU: 1 PID: 101584 Comm: lnet_discovery Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.58.1.el8_10.x86_64 #1 [ 5742.878996] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 5742.880036] RIP: 0010:__raw_callee_save___pv_queued_spin_unlock+0x6/0x16 [ 5742.881310] Code: 51 52 56 57 41 50 41 51 41 52 41 53 e8 4f 05 00 00 41 5b 41 5a 41 59 41 58 5f 5e 5a 59 c3 cc cc cc cc 66 90 52 b8 01 00 00 00 <31> d2 f0 0f b0 17 3c 01 75 06 5a c3 cc cc cc cc 56 0f b6 f0 e8 bd [ 5742.884537] RSP: 0018:ffffbdde01137bd8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 [ 5742.885894] RAX: 0000000000000001 RBX: ffff96a4937a1000 RCX: 000000006e27f00a [ 5742.887173] RDX: ffff96a4937a1020 RSI: ffff96a4937a1020 RDI: ffff96a4937a10b4 [ 5742.888450] RBP: ffff96a4937a10b4 R08: ffff96a4937a1020 R09: ffff96a4937a1100 [ 5742.889712] R10: 61c8864680b583eb R11: 0000000000000800 R12: ffff96a4937a1020 [ 5742.890982] R13: ffff96a4b5323ae0 R14: ffff96a4b5323ad0 R15: ffff96a4937a1020 [ 5742.892252] FS: 0000000000000000(0000) GS:ffff96a53fd00000(0000) knlGS:0000000000000000 [ 5742.893694] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5742.894751] CR2: 000055dfb181a5f8 CR3: 000000006fa10006 CR4: 00000000000606e0 [ 5742.896023] Call Trace: [ 5742.896540] <IRQ> [ 5742.896976] ? watchdog_timer_fn.cold.10+0x46/0x9e [ 5742.897878] ? watchdog+0x30/0x30 [ 5742.898519] ? __hrtimer_run_queues+0x101/0x280 [ 5742.899397] ? hrtimer_interrupt+0x100/0x220 [ 5742.900196] ? smp_apic_timer_interrupt+0x6a/0x130 [ 5742.901090] ? apic_timer_interrupt+0xf/0x20 [ 5742.901894] </IRQ> [ 5742.902332] ? __raw_callee_save___pv_queued_spin_unlock+0x6/0x16 [ 5742.903424] lnet_peer_queue_message+0x6b/0x90 [lnet] [ 5742.904556] lnet_initiate_peer_discovery+0x13b/0x2b0 [lnet] [ 5742.905642] lnet_select_pathway+0x13f/0x1ab0 [lnet] [ 5742.906597] ? ksocknal_launch_packet+0x2e1/0x510 [ksocklnd] [ 5742.907669] ? _cond_resched+0x15/0x30 [ 5742.908395] ? apic_timer_interrupt+0xa/0x20 [ 5742.909192] ? apic_timer_interrupt+0xa/0x20 [ 5742.909986] ? apic_timer_interrupt+0xa/0x20 [ 5742.910786] lnet_send+0x6d/0x1e0 [lnet] [ 5742.911559] lnet_peer_discovery_complete+0x21b/0x390 [lnet] [ 5742.912632] lnet_peer_discovery+0x51f/0x1ba0 [lnet] [ 5742.913584] ? finish_task_switch+0x86/0x2f0 [ 5742.914406] ? finish_wait+0x80/0x80 [ 5742.915086] ? lnet_peer_merge_data+0x10c0/0x10c0 [lnet] [ 5742.916094] kthread+0x134/0x150 [ 5742.916744] ? set_kthread_struct+0x50/0x50 [ 5742.917532] ret_from_fork+0x35/0x40 [ 5742.918230] Kernel panic - not syncing: softlockup: hung tasks [ 5742.919289] CPU: 1 PID: 101584 Comm: lnet_discovery Kdump: loaded Tainted: G OEL -------- - - 4.18.0-553.58.1.el8_10.x86_64 #1 [ 5742.921494] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 5742.922532] Call Trace: [ 5742.923043] <IRQ> [ 5742.923479] dump_stack+0x41/0x60 [ 5742.924142] panic+0xe7/0x2ac [ 5742.924744] ? syscall_return_via_sysret+0x6e/0x94 [ 5742.925637] watchdog_timer_fn.cold.10+0x85/0x9e [ 5742.926507] ? watchdog+0x30/0x30 [ 5742.927152] __hrtimer_run_queues+0x101/0x280 [ 5742.927970] hrtimer_interrupt+0x100/0x220 [ 5742.928752] smp_apic_timer_interrupt+0x6a/0x130 [ 5742.929623] apic_timer_interrupt+0xf/0x20 [ 5742.930409] </IRQ> [ 5742.930844] RIP: 0010:__raw_callee_save___pv_queued_spin_unlock+0x6/0x16 [ 5742.932046] Code: 51 52 56 57 41 50 41 51 41 52 41 53 e8 4f 05 00 00 41 5b 41 5a 41 59 41 58 5f 5e 5a 59 c3 cc cc cc cc 66 90 52 b8 01 00 00 00 <31> d2 f0 0f b0 17 3c 01 75 06 5a c3 cc cc cc cc 56 0f b6 f0 e8 bd
https://testing.whamcloud.com/test_sets/be93cb17-bba4-4510-97a5-ada48573d38c
Attachments
Issue Links
- is related to
-
LU-19310 Interop crash in sanity-lnet test 256 lnet_return_rx_credits_locked()) ASSERTION( msg->msg_kiov != ((void *)0) ) failed
-
- Open
-