Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.15.4
-
None
-
3.10.0-1160.108.1.el7_lustre.pl1.x86_64
-
3
-
9223372036854775807
Description
After 47 days of uptime with 2.15.x, we hit a MDS kernel BUG on Oak last Friday evening.
Kernel CentOS 7.9 (3.10.0-1160.108.1.el7 + old patch for LU-10709)
Our Lustre version is based on 2.15.4 plus the following patches:
$ git log --oneline 7e5cc33 LU-16771 llite: add statfs_project tunable cf4313e LU-16345 ofd: ofd_commitrw_read() with non-existing object 70ada39 LU-7668 utils: add lctl del_ost b57592e LU-15117 ofd: no lock for dt_bufs_get() in read path 158e06b LU-15117 ofd: don't take lock for dt_bufs_get() 594254b LU-16044 osd: discard pagecache in truncate's declaration dbc74df LU-15880 quota: fix insane grant quota 7a7a5d8 LU-15694 quota: keep grace time while setting default limits bc99828 LU-15880 quota: fix issues in reserving quota 02a0ce1 LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb cac870c New release 2.15.4
Note: this is also available at https://github.com/stanford-rc/lustre/commits/b2_15_4_stanford-rc/
MDS crash:
[3922902.137858] BUG: unable to handle kernel paging request at ffff9040df5e80c8 [3922902.145044] IP: [<ffffffffc1615081>] qmt_lvbo_update+0x271/0xf00 [lquota] [3922902.152063] PGD dc58a5067 PUD 3ffb259063 PMD 3fe02f3063 PTE 8000003fdf5e8061 [3922902.159361] Oops: 0003 [#1] SMP [3922902.162803] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache libcfs(OE) mpt2sas mptctl mptbase bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx4_ib(OE) ib_core(OE) mlx4_en(OE) mlx4_core(OE) sunrpc vfat fat dm_round_robin dm_service_time dm_multipath dm_mod dell_smbios dell_wmi_descriptor dcdbas kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr cdc_ether ses usbnet enclosure mii sg i2c_piix4 wmi ipmi_si ipmi_devintf ipmi_msghandler tpm_crb [3922902.234941] acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt mlx5_core(OE) fb_sys_fops ttm mlxfw(OE) ahci mpt3sas(OE) bnxt_en tg3 libahci mlx_compat(OE) crct10dif_pclmul crct10dif_common raid_class drm crc32c_intel ptp libata scsi_transport_sas devlink drm_panel_orientation_quirks megaraid_sas pps_core [3922902.269349] CPU: 11 PID: 33277 Comm: qmt_reba_oak-QM Kdump: loaded Tainted: G W OE ------------ 3.10.0-1160.108.1.el7_lustre.pl1.x86_64 #1 [3922902.282893] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.14.1 12/17/2023 [3922902.290710] task: ffff903882efc200 ti: ffff903885dbc000 task.ti: ffff903885dbc000 [3922902.298356] RIP: 0010:[<ffffffffc1615081>] [<ffffffffc1615081>] qmt_lvbo_update+0x271/0xf00 [lquota] [3922902.307767] RSP: 0018:ffff903885dbfaf0 EFLAGS: 00010216 [3922902.313252] RAX: 00000000000020c0 RBX: ffff9010ea3ed290 RCX: ffff9040df5e6000 [3922902.320549] RDX: ffff903ec1d0e420 RSI: 000000000000020c RDI: 0000000000000400 [3922902.327854] RBP: ffff903885dbfb88 R08: 0000000000000001 R09: 0000000000000004 [3922902.335152] R10: 0000000000000000 R11: f000000000000000 R12: 0000000000000001 [3922902.342449] R13: ffff9040e8e10488 R14: ffff9025ab640c00 R15: ffff901becc9d400 [3922902.349747] FS: 0000000000000000(0000) GS:ffff9020fecc0000(0000) knlGS:0000000000000000 [3922902.358007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3922902.363926] CR2: ffff9040df5e80c8 CR3: 0000003ff98fe000 CR4: 0000000000760ee0 [3922902.371230] PKRU: 00000000 [3922902.374108] Call Trace: [3922902.376743] [<ffffffffc0a8903e>] ? ktime_get_real_seconds+0xe/0x20 [libcfs] [3922902.384083] [<ffffffffc117553e>] ? at_measured+0x5e/0x370 [ptlrpc] [3922902.390564] [<ffffffffc180019b>] mdt_lvbo_update+0xbb/0x140 [mdt] [3922902.396925] [<ffffffffc1157eb7>] ? lustre_msg_set_transno+0x27/0xb0 [ptlrpc] [3922902.404236] [<ffffffffc112dd42>] ldlm_cb_interpret+0x122/0x740 [ptlrpc] [3922902.411118] [<ffffffffc1149898>] ptlrpc_check_set+0x3f8/0x2290 [ptlrpc] [3922902.417998] [<ffffffffc114b94b>] ptlrpc_set_wait+0x21b/0x840 [ptlrpc] [3922902.424698] [<ffffffff8fccc790>] ? wake_up_atomic_t+0x40/0x40 [3922902.430711] [<ffffffffc1105335>] ldlm_run_ast_work+0xd5/0x3e0 [ptlrpc] [3922902.437508] [<ffffffffc1129518>] ldlm_glimpse_locks+0x38/0x110 [ptlrpc] [3922902.444378] [<ffffffffc161357f>] qmt_glimpse_lock.isra.15+0x37f/0xa60 [lquota] [3922902.451857] [<ffffffffc1610610>] ? qmt_dqacq+0x880/0x880 [lquota] [3922902.458210] [<ffffffffc1614104>] qmt_reba_thread+0x4a4/0xa80 [lquota] [3922902.464909] [<ffffffffc1613c60>] ? qmt_glimpse_lock.isra.15+0xa60/0xa60 [lquota] [3922902.472559] [<ffffffff8fccb621>] kthread+0xd1/0xe0 [3922902.477603] [<ffffffff8fccb550>] ? insert_kthread_work+0x40/0x40 [3922902.483864] [<ffffffff903c51dd>] ret_from_fork_nospec_begin+0x7/0x21 [3922902.490474] [<ffffffff8fccb550>] ? insert_kthread_work+0x40/0x40 [3922902.496737] Code: 84 a9 0c 00 00 48 8b 40 18 48 8b 93 00 01 00 00 83 78 10 02 48 8b b8 08 01 00 00 0f 84 a9 04 00 00 48 8b 0a 48 63 c6 48 c1 e0 04 <80> 64 01 08 fb 48 8b 8b 00 01 00 00 48 8b 09 80 64 01 08 f7 48 [3922902.516965] RIP [<ffffffffc1615081>] qmt_lvbo_update+0x271/0xf00 [lquota] [3922902.524029] RSP <ffff903885dbfaf0> [3922902.527687] CR2: ffff9040df5e80c8
Attaching full vmcore-dmesg.txt as vmcore-dmesg-oak-md1-s2_2024-04-26-20-06-13.txt . vmcore available.
Any ideas or suggestions?
Thanks!
Stéphane
Attachments
Issue Links
- duplicates
-
LU-17034 memory corruption caused by bug in qmt_seed_glbe_all
- Resolved