Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18405

client crash: Kernel panic - not syncing: Hard LOCKUP

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0
    • None
    • 2.16.0-rc4
    • 3
    • 9223372036854775807

    Description

      2 clients (wr-es-31, wr-es-29) crashed in soak testing when soak has been running for about 4 days
      server and clients are
      version=2.16.0_RC4

      vmcore can be found on

      wr-es-29: 127.0.0.1-2024-10-27-21:29:21
      wr-es-31: 127.0.0.1-2024-10-28-13:32:16
      
      [351552.564424] Lustre: 24975:0:(llite_lib.c:4120:ll_dirty_page_discard_warn()) sfa18k03: dirty page discard: 172.25.80.50@tcp:172.25.80.51@tcp:172.25.80.52@tcp:1
      72.25.80.53@tcp:/sfa18k03/fid: [0x28003b737:0x2f97:0x0]// may get corrupted (rc -108)
      [351552.564435] Lustre: 24975:0:(llite_lib.c:4120:ll_dirty_page_discard_warn()) sfa18k03: dirty page discard: 172.25.80.50@tcp:172.25.80.51@tcp:172.25.80.52@tcp:1
      72.25.80.53@tcp:/sfa18k03/fid: [0x28003b737:0x10a0:0x0]// may get corrupted (rc -108)
      [351612.548777] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [351612.674645] Lustre: sfa18k03-MDT0001-mdc-ffff94d2b0acc000: Connection restored to 172.25.80.52@tcp (at 172.25.80.52@tcp)
      [351612.677153] rcu:    30-...!: (1 GPs behind) idle=a5e/1/0x4000000000000002 softirq=25673874/25673874 fqs=24 
      [351630.520318] NMI watchdog: Watchdog detected hard LOCKUP on cpu 61Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksockln
      d(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sunrpc bridge stp llc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac_mod edac_mce_amd amd_energy kvm_amd 
      rdma_ucm(OE) kvm rdma_cm(OE) iw_cm(OE) irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ib_ipoib(OE) rapl pcspkr sp5100_tco ib_cm(OE) ccp ptdma k10temp
       i2c_piix4 acpi_ipmi ipmi_si ib_umad(OE) ipmi_devintf ipmi_msghandler acpi_cpufreq ext4 mbcache jbd2 mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg crc32c
      _intel mlx5_core(OE) mlxfw(OE) pci_hyperv_intf tls ahci libahci psample mlxdevm(OE) igb libata i2c_algo_bit mlx_compat(OE) dca
      [351630.520379] CPU: 61 PID: 14 Comm: rcu_sched Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-513.24.1.el8_9.x86_64 #1
      [351630.520380] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023
      [351630.520381] RIP: 0010:native_queued_spin_lock_slowpath+0x61/0x1c0
      [351630.520383] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 
      b8 01 00 00 00 66 89 07 e9 a9 68 ba 00 8b 37 81 fe 00
      [351630.520385] RSP: 0018:ffffb57ac01efe60 EFLAGS: 00000002
      [351630.520386] RAX: 0000000000000101 RBX: 0000000000000246 RCX: dead000000000200
      [351630.520387] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffacbe5100
      [351630.520389] RBP: ffffffffacc2e120 R08: ffff95106df73bb8 R09: 0000000000000384
      [351630.520390] R10: 0000000000000001 R11: ffff95106df71dc4 R12: 0000000000000000
      [351630.520391] R13: 0000000000000000 R14: ffffffffaaf800f0 R15: ffffb57ac01efec8
      [351630.520392] FS:  0000000000000000(0000) GS:ffff95106df40000(0000) knlGS:0000000000000000
      [351630.520393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [351630.520394] CR2: 0000148ce0023008 CR3: 0000002167410002 CR4: 0000000000770ee0
      [351630.520395] PKRU: 55555554
      [351630.520396] Call Trace:
      [351630.520396]  <NMI>
      [351630.520397]  ? watchdog_overflow_callback.cold.7+0x1e/0x70
      [351630.520398]  ? __perf_event_overflow+0x52/0x100
      [351630.520399]  ? x86_pmu_handle_irq+0x12f/0x190
      [351630.520400]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351630.520401]  ? __set_pte_vaddr+0x32/0x50
      [351630.520402]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351630.520403]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351630.520404]  ? ghes_copy_tofrom_phys+0xf9/0x250
      [351630.520405]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351630.520406]  ? amd_pmu_handle_irq+0x46/0xc0
      [351630.520407]  ? perf_event_nmi_handler+0x2d/0x50
      [351630.520408]  ? nmi_handle+0x63/0x110
      [351630.520409]  ? default_do_nmi+0x49/0x110
      [351630.520410]  ? do_nmi+0x1af/0x220
      [351630.520411]  ? end_repeat_nmi+0x16/0x69
      [351630.520412]  ? rcu_exp_handler+0x70/0x70
      [351630.520413]  ? native_queued_spin_lock_slowpath+0x61/0x1c0
      [351630.520414]  ? native_queued_spin_lock_slowpath+0x61/0x1c0
      [351630.520415]  ? native_queued_spin_lock_slowpath+0x61/0x1c0
      [351630.520416]  </NMI>
      [351630.520417]  _raw_spin_lock_irqsave+0x34/0x40
      [351630.520418]  force_qs_rnp+0x87/0x1d0
      [351630.520419]  rcu_gp_kthread+0x66e/0x8a0
      [351630.520420]  ? rcu_gp_cleanup+0x3b0/0x3b0
      [351630.520421]  kthread+0x134/0x150
      [351630.520422]  ? set_kthread_struct+0x50/0x50
      [351630.520423]  ret_from_fork+0x35/0x40
      [351630.520424] Kernel panic - not syncing: Hard LOCKUP
      [351630.520425] CPU: 61 PID: 14 Comm: rcu_sched Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-513.24.1.el8_9.x86_64 #1
      [351630.520426] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023
      [351630.520428] Call Trace:
      [351630.520428]  <NMI>
      [351630.520429]  dump_stack+0x41/0x60
      [351630.520430]  panic+0xe7/0x2ac
      [351630.520431]  ? __switch_to_asm+0x11/0x80
      [351630.520432]  nmi_panic.cold.11+0xc/0xc
      [351630.520433]  watchdog_overflow_callback.cold.7+0x5c/0x70
      [351630.520434]  __perf_event_overflow+0x52/0x100
      [351630.520435]  x86_pmu_handle_irq+0x12f/0x190
      [351630.520436]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351630.520437]  ? __set_pte_vaddr+0x32/0x50
      [351630.520438]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351630.520439]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351630.520440]  ? ghes_copy_tofrom_phys+0xf9/0x250
      [351630.520441]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351630.520442]  amd_pmu_handle_irq+0x46/0xc0
      [351630.520443]  perf_event_nmi_handler+0x2d/0x50
      [351630.520444]  nmi_handle+0x63/0x110
      [351630.520445]  default_do_nmi+0x49/0x110
      [351630.520446]  do_nmi+0x1af/0x220
      [351630.520447]  end_repeat_nmi+0x16/0x69
      [351630.520448] RIP: 0010:native_queued_spin_lock_slowpath+0x61/0x1c0
      [351630.520449] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 e9 a9 68 ba 00 8b 37 81 fe 00
      [351630.520451] RSP: 0018:ffffb57ac01efe60 EFLAGS: 00000002
      [351630.520453] RAX: 0000000000000101 RBX: 0000000000000246 RCX: dead000000000200
      [351630.520454] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffacbe5100
      [351630.520455] RBP: ffffffffacc2e120 R08: ffff95106df73bb8 R09: 0000000000000384
      [351630.520456] R10: 0000000000000001 R11: ffff95106df71dc4 R12: 0000000000000000
      [351630.520457] R13: 0000000000000000 R14: ffffffffaaf800f0 R15: ffffb57ac01efec8
      [351630.520458]  ? rcu_exp_handler+0x70/0x70
      [351630.520459]  ? native_queued_spin_lock_slowpath+0x61/0x1c0
      [351630.520460]  ? native_queued_spin_lock_slowpath+0x61/0x1c0
      [351630.520461]  </NMI>
      [351630.520462]  _raw_spin_lock_irqsave+0x34/0x40
      [351630.520463]  force_qs_rnp+0x87/0x1d0
      [351630.520464]  rcu_gp_kthread+0x66e/0x8a0
      [351630.520465]  ? rcu_gp_cleanup+0x3b0/0x3b0
      [351630.520466]  kthread+0x134/0x150
      [351630.520467]  ? set_kthread_struct+0x50/0x50
      [351630.520468]  ret_from_fork+0x35/0x40
      

      Attachments

        Activity

          People

            ys Yang Sheng
            sarah Sarah Liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: