Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17790

BUG: unable to handle kernel paging request IP: qmt_lvbo_update [lquota]

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.15.4
    • None
    • 3.10.0-1160.108.1.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      After 47 days of uptime with 2.15.x, we hit a MDS kernel BUG on Oak last Friday evening.

      Kernel CentOS 7.9 (3.10.0-1160.108.1.el7 + old patch for LU-10709)

      Our Lustre version is based on 2.15.4 plus the following patches:

      $ git log --oneline
      7e5cc33 LU-16771 llite: add statfs_project tunable
      cf4313e LU-16345 ofd: ofd_commitrw_read() with non-existing object
      70ada39 LU-7668 utils: add lctl del_ost
      b57592e LU-15117 ofd: no lock for dt_bufs_get() in read path
      158e06b LU-15117 ofd: don't take lock for dt_bufs_get()
      594254b LU-16044 osd: discard pagecache in truncate's declaration
      dbc74df LU-15880 quota: fix insane grant quota
      7a7a5d8 LU-15694 quota: keep grace time while setting default limits
      bc99828 LU-15880 quota: fix issues in reserving quota
      02a0ce1 LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb
      cac870c New release 2.15.4
      

      Note: this is also available at https://github.com/stanford-rc/lustre/commits/b2_15_4_stanford-rc/

      MDS crash:

      [3922902.137858] BUG: unable to handle kernel paging request at ffff9040df5e80c8
      [3922902.145044] IP: [<ffffffffc1615081>] qmt_lvbo_update+0x271/0xf00 [lquota]
      [3922902.152063] PGD dc58a5067 PUD 3ffb259063 PMD 3fe02f3063 PTE 8000003fdf5e8061
      [3922902.159361] Oops: 0003 [#1] SMP 
      [3922902.162803] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache libcfs(OE) mpt2sas mptctl mptbase bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx4_ib(OE) ib_core(OE) mlx4_en(OE) mlx4_core(OE) sunrpc vfat fat dm_round_robin dm_service_time dm_multipath dm_mod dell_smbios dell_wmi_descriptor dcdbas kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr cdc_ether ses usbnet enclosure mii sg i2c_piix4 wmi ipmi_si ipmi_devintf ipmi_msghandler tpm_crb
      [3922902.234941]  acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt mlx5_core(OE) fb_sys_fops ttm mlxfw(OE) ahci mpt3sas(OE) bnxt_en tg3 libahci mlx_compat(OE) crct10dif_pclmul crct10dif_common raid_class drm crc32c_intel ptp libata scsi_transport_sas devlink drm_panel_orientation_quirks megaraid_sas pps_core
      [3922902.269349] CPU: 11 PID: 33277 Comm: qmt_reba_oak-QM Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-1160.108.1.el7_lustre.pl1.x86_64 #1
      [3922902.282893] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.14.1 12/17/2023
      [3922902.290710] task: ffff903882efc200 ti: ffff903885dbc000 task.ti: ffff903885dbc000
      [3922902.298356] RIP: 0010:[<ffffffffc1615081>]  [<ffffffffc1615081>] qmt_lvbo_update+0x271/0xf00 [lquota]
      [3922902.307767] RSP: 0018:ffff903885dbfaf0  EFLAGS: 00010216
      [3922902.313252] RAX: 00000000000020c0 RBX: ffff9010ea3ed290 RCX: ffff9040df5e6000
      [3922902.320549] RDX: ffff903ec1d0e420 RSI: 000000000000020c RDI: 0000000000000400
      [3922902.327854] RBP: ffff903885dbfb88 R08: 0000000000000001 R09: 0000000000000004
      [3922902.335152] R10: 0000000000000000 R11: f000000000000000 R12: 0000000000000001
      [3922902.342449] R13: ffff9040e8e10488 R14: ffff9025ab640c00 R15: ffff901becc9d400
      [3922902.349747] FS:  0000000000000000(0000) GS:ffff9020fecc0000(0000) knlGS:0000000000000000
      [3922902.358007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [3922902.363926] CR2: ffff9040df5e80c8 CR3: 0000003ff98fe000 CR4: 0000000000760ee0
      [3922902.371230] PKRU: 00000000
      [3922902.374108] Call Trace:
      [3922902.376743]  [<ffffffffc0a8903e>] ? ktime_get_real_seconds+0xe/0x20 [libcfs]
      [3922902.384083]  [<ffffffffc117553e>] ? at_measured+0x5e/0x370 [ptlrpc]
      [3922902.390564]  [<ffffffffc180019b>] mdt_lvbo_update+0xbb/0x140 [mdt]
      [3922902.396925]  [<ffffffffc1157eb7>] ? lustre_msg_set_transno+0x27/0xb0 [ptlrpc]
      [3922902.404236]  [<ffffffffc112dd42>] ldlm_cb_interpret+0x122/0x740 [ptlrpc]
      [3922902.411118]  [<ffffffffc1149898>] ptlrpc_check_set+0x3f8/0x2290 [ptlrpc]
      [3922902.417998]  [<ffffffffc114b94b>] ptlrpc_set_wait+0x21b/0x840 [ptlrpc]
      [3922902.424698]  [<ffffffff8fccc790>] ? wake_up_atomic_t+0x40/0x40
      [3922902.430711]  [<ffffffffc1105335>] ldlm_run_ast_work+0xd5/0x3e0 [ptlrpc]
      [3922902.437508]  [<ffffffffc1129518>] ldlm_glimpse_locks+0x38/0x110 [ptlrpc]
      [3922902.444378]  [<ffffffffc161357f>] qmt_glimpse_lock.isra.15+0x37f/0xa60 [lquota]
      [3922902.451857]  [<ffffffffc1610610>] ? qmt_dqacq+0x880/0x880 [lquota]
      [3922902.458210]  [<ffffffffc1614104>] qmt_reba_thread+0x4a4/0xa80 [lquota]
      [3922902.464909]  [<ffffffffc1613c60>] ? qmt_glimpse_lock.isra.15+0xa60/0xa60 [lquota]
      [3922902.472559]  [<ffffffff8fccb621>] kthread+0xd1/0xe0
      [3922902.477603]  [<ffffffff8fccb550>] ? insert_kthread_work+0x40/0x40
      [3922902.483864]  [<ffffffff903c51dd>] ret_from_fork_nospec_begin+0x7/0x21
      [3922902.490474]  [<ffffffff8fccb550>] ? insert_kthread_work+0x40/0x40
      [3922902.496737] Code: 84 a9 0c 00 00 48 8b 40 18 48 8b 93 00 01 00 00 83 78 10 02 48 8b b8 08 01 00 00 0f 84 a9 04 00 00 48 8b 0a 48 63 c6 48 c1 e0 04 <80> 64 01 08 fb 48 8b 8b 00 01 00 00 48 8b 09 80 64 01 08 f7 48 
      [3922902.516965] RIP  [<ffffffffc1615081>] qmt_lvbo_update+0x271/0xf00 [lquota]
      [3922902.524029]  RSP <ffff903885dbfaf0>
      [3922902.527687] CR2: ffff9040df5e80c8
      

      Attaching full vmcore-dmesg.txt as vmcore-dmesg-oak-md1-s2_2024-04-26-20-06-13.txt . vmcore available.

      Any ideas or suggestions?
      Thanks!
      Stéphane

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: