Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19309

Lnet interop crashes / lockups in master

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • None
    • Lustre 2.17.0, Lustre 2.16.1
    • None
    • 3
    • 9223372036854775807

    Description

      Looks like recent lnet landings (I think LU-15135 and LU-18555) broke some things in interop

      soft lockups (master node crashed with softlockup, 2.16 interop):

       [22859.194693] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [lnet_discovery:609699]
      [22859.194877] Modules linked in: osp(OE) ofd(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey tls dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common virtio_balloon pcspkr joydev i2c_piix4 fuse drm ext4 mbcache jbd2 ata_generic ata_piix libata crct10dif_pclmul crc32_pclmul crc32c_intel virtio_net ghash_clmulni_intel virtio_blk net_failover failover serio_raw [last unloaded: libcfs(OE)]
      [22859.195057] CPU: 1 PID: 609699 Comm: lnet_discovery Kdump: loaded Tainted: G           OE     -------  ---  5.14.0-503.40.1_lustre.el9.x86_64 #1
      [22859.195134] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [22859.195200] RIP: 0010:lnet_peerni_by_nid_locked+0x21/0x140 [lnet]
      [22859.195506] Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 83 3d 84 d6 01 00 01 74 0c 48 c7 c0 94 ff ff ff c3 cc cc cc cc 41 55 49 89 f5 41 54 <41> 89 d4 55 48 89 fd 48 83 ec 08 e8 6f c5 ff ff 48 85 c0 74 0e 48
      [22859.195570] RSP: 0018:ffffa9fa409efd40 EFLAGS: 00000246
      [22859.195638] RAX: 0000000000000000 RBX: ffffa9fa409efe48 RCX: ffff8cda13ed6dd0
      [22859.195711] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8cda03c5c440
      [22859.195770] RBP: ffff8cda03c5c420 R08: 0000000000000000 R09: ffffffffc0be0b70
      [22859.195828] R10: 0000000000000000 R11: 0000000000000200 R12: ffff8cda03c5c440
      [22859.195887] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8cda03c5c498
      [22859.195945] FS:  0000000000000000(0000) GS:ffff8cdabfd00000(0000) knlGS:0000000000000000
      [22859.196004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [22859.196062] CR2: 00007fd84e001010 CR3: 0000000005920006 CR4: 00000000000606f0
      [22859.196124] Call Trace:
      [22859.196184]  <IRQ>
      [22859.196250]  ? show_trace_log_lvl+0x1c4/0x2df
      [22859.196364]  ? show_trace_log_lvl+0x1c4/0x2df
      [22859.196425]  ? lnet_select_pathway+0x1b6/0x640 [lnet]
      [22859.196533]  ? watchdog_timer_fn+0x1ad/0x210
      [22859.196630]  ? __pfx_watchdog_timer_fn+0x10/0x10
      [22859.196699]  ? __hrtimer_run_queues+0x112/0x2b0
      [22859.196778]  ? hrtimer_interrupt+0xfc/0x210
      [22859.196838]  ? kvm_sched_clock_read+0xd/0x20
      [22859.196935]  ? __sysvec_apic_timer_interrupt+0x4e/0x100
      [22859.197015]  ? sysvec_apic_timer_interrupt+0x6d/0x90
      [22859.197076]  </IRQ>
      [22859.197134]  <TASK>
      [22859.197191]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
      [22859.197278]  ? lnet_peerni_by_nid_locked+0x21/0x140 [lnet]
      [22859.197387]  lnet_select_pathway+0x1b6/0x640 [lnet]
      [22859.197500]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
      [22859.197565]  lnet_send+0x6d/0x1e0 [lnet]
      [22859.197676]  lnet_peer_discovery_complete+0x21c/0x390 [lnet]
      [22859.197786]  lnet_peer_discovery+0x485/0xaf0 [lnet]
      [22859.197894]  ? __pfx_autoremove_wake_function+0x10/0x10
      [22859.197975]  ? __pfx_lnet_peer_discovery+0x10/0x10 [lnet]
      [22859.198081]  kthread+0xe0/0x100
      [22859.198160]  ? __pfx_kthread+0x10/0x10
      [22859.198224]  ret_from_fork+0x2c/0x50
      [22859.198299]  </TASK>
      [22859.198367] Kernel panic - not syncing: softlockup: hung tasks

      https://testing.whamcloud.com/test_sets/49c461ea-e4fd-40c9-8d23-cc9251bc3d47

      A siomewhat similar soft lockup in 2.15:

       [ 5742.576727] Lustre: DEBUG MARKER: os[cp].lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
      [ 5742.865727] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [lnet_discovery:101584]
      [ 5742.868982] Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) lnet_selftest(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev i2c_piix4 virtio_balloon pcspkr sunrpc ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel libata virtio_net serio_raw virtio_blk net_failover failover [last unloaded: libcfs]
      [ 5742.876797] CPU: 1 PID: 101584 Comm: lnet_discovery Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.58.1.el8_10.x86_64 #1
      [ 5742.878996] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 5742.880036] RIP: 0010:__raw_callee_save___pv_queued_spin_unlock+0x6/0x16
      [ 5742.881310] Code: 51 52 56 57 41 50 41 51 41 52 41 53 e8 4f 05 00 00 41 5b 41 5a 41 59 41 58 5f 5e 5a 59 c3 cc cc cc cc 66 90 52 b8 01 00 00 00 <31> d2 f0 0f b0 17 3c 01 75 06 5a c3 cc cc cc cc 56 0f b6 f0 e8 bd
      [ 5742.884537] RSP: 0018:ffffbdde01137bd8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
      [ 5742.885894] RAX: 0000000000000001 RBX: ffff96a4937a1000 RCX: 000000006e27f00a
      [ 5742.887173] RDX: ffff96a4937a1020 RSI: ffff96a4937a1020 RDI: ffff96a4937a10b4
      [ 5742.888450] RBP: ffff96a4937a10b4 R08: ffff96a4937a1020 R09: ffff96a4937a1100
      [ 5742.889712] R10: 61c8864680b583eb R11: 0000000000000800 R12: ffff96a4937a1020
      [ 5742.890982] R13: ffff96a4b5323ae0 R14: ffff96a4b5323ad0 R15: ffff96a4937a1020
      [ 5742.892252] FS:  0000000000000000(0000) GS:ffff96a53fd00000(0000) knlGS:0000000000000000
      [ 5742.893694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5742.894751] CR2: 000055dfb181a5f8 CR3: 000000006fa10006 CR4: 00000000000606e0
      [ 5742.896023] Call Trace:
      [ 5742.896540]  <IRQ>
      [ 5742.896976]  ? watchdog_timer_fn.cold.10+0x46/0x9e
      [ 5742.897878]  ? watchdog+0x30/0x30
      [ 5742.898519]  ? __hrtimer_run_queues+0x101/0x280
      [ 5742.899397]  ? hrtimer_interrupt+0x100/0x220
      [ 5742.900196]  ? smp_apic_timer_interrupt+0x6a/0x130
      [ 5742.901090]  ? apic_timer_interrupt+0xf/0x20
      [ 5742.901894]  </IRQ>
      [ 5742.902332]  ? __raw_callee_save___pv_queued_spin_unlock+0x6/0x16
      [ 5742.903424]  lnet_peer_queue_message+0x6b/0x90 [lnet]
      [ 5742.904556]  lnet_initiate_peer_discovery+0x13b/0x2b0 [lnet]
      [ 5742.905642]  lnet_select_pathway+0x13f/0x1ab0 [lnet]
      [ 5742.906597]  ? ksocknal_launch_packet+0x2e1/0x510 [ksocklnd]
      [ 5742.907669]  ? _cond_resched+0x15/0x30
      [ 5742.908395]  ? apic_timer_interrupt+0xa/0x20
      [ 5742.909192]  ? apic_timer_interrupt+0xa/0x20
      [ 5742.909986]  ? apic_timer_interrupt+0xa/0x20
      [ 5742.910786]  lnet_send+0x6d/0x1e0 [lnet]
      [ 5742.911559]  lnet_peer_discovery_complete+0x21b/0x390 [lnet]
      [ 5742.912632]  lnet_peer_discovery+0x51f/0x1ba0 [lnet]
      [ 5742.913584]  ? finish_task_switch+0x86/0x2f0
      [ 5742.914406]  ? finish_wait+0x80/0x80
      [ 5742.915086]  ? lnet_peer_merge_data+0x10c0/0x10c0 [lnet]
      [ 5742.916094]  kthread+0x134/0x150
      [ 5742.916744]  ? set_kthread_struct+0x50/0x50
      [ 5742.917532]  ret_from_fork+0x35/0x40
      [ 5742.918230] Kernel panic - not syncing: softlockup: hung tasks
      [ 5742.919289] CPU: 1 PID: 101584 Comm: lnet_discovery Kdump: loaded Tainted: G           OEL    -------- -  - 4.18.0-553.58.1.el8_10.x86_64 #1
      [ 5742.921494] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 5742.922532] Call Trace:
      [ 5742.923043]  <IRQ>
      [ 5742.923479]  dump_stack+0x41/0x60
      [ 5742.924142]  panic+0xe7/0x2ac
      [ 5742.924744]  ? syscall_return_via_sysret+0x6e/0x94
      [ 5742.925637]  watchdog_timer_fn.cold.10+0x85/0x9e
      [ 5742.926507]  ? watchdog+0x30/0x30
      [ 5742.927152]  __hrtimer_run_queues+0x101/0x280
      [ 5742.927970]  hrtimer_interrupt+0x100/0x220
      [ 5742.928752]  smp_apic_timer_interrupt+0x6a/0x130
      [ 5742.929623]  apic_timer_interrupt+0xf/0x20
      [ 5742.930409]  </IRQ>
      [ 5742.930844] RIP: 0010:__raw_callee_save___pv_queued_spin_unlock+0x6/0x16
      [ 5742.932046] Code: 51 52 56 57 41 50 41 51 41 52 41 53 e8 4f 05 00 00 41 5b 41 5a 41 59 41 58 5f 5e 5a 59 c3 cc cc cc cc 66 90 52 b8 01 00 00 00 <31> d2 f0 0f b0 17 3c 01 75 06 5a c3 cc cc cc cc 56 0f b6 f0 e8 bd

      https://testing.whamcloud.com/test_sets/be93cb17-bba4-4510-97a5-ada48573d38c

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: