[LU-14434] parallel-scale-nfsv4 test compilebench crashes in qmt_id_lock_cb Created: 15/Feb/21  Updated: 15/Jun/23  Resolved: 15/Jun/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Sergey Cheremencev
Resolution: Duplicate Votes: 0
Labels: quota

Issue Links:
Related
is related to LU-16772 Protect lqe_glbl_data in qmt_site_rec... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We’ve only seen this crash twice; only in parallel-scale-nfsv4 test compilebench:
2021-01-22: x86_64 clients - https://testing.whamcloud.com/test_sets/68ed07f9-eb1d-459c-b327-269bd996d449
2021-02-11: ARM clients - https://testing.whamcloud.com/test_sets/6aae8467-c5e3-4547-aefa-04f220cf4042

Looking at the first failure above, we see in the kernel-crash

[47446.646161] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench ============================================== 19:11:27 (1611342687)
[47447.194275] Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.compilebench.1394887 -i 2         -r 2 --makej
[47447.620295] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.1394887 -i 2 -r 2 --makej
[48153.651976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[48153.653435] PGD 0 P4D 0 
[48153.653864] Oops: 0000 [#1] SMP PTI
[48153.654527] CPU: 0 PID: 1485996 Comm: qmt_reba_lustre Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
[48153.656584] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[48153.657600] RIP: 0010:qmt_id_lock_cb+0x69/0x100 [lquota]
[48153.658462] Code: 48 8b 53 20 8b 4a 0c 85 c9 74 74 89 c1 48 8b 42 18 83 78 10 02 75 0a 83 e1 01 b8 01 00 00 00 74 17 48 63 44 24 04 48 c1 e0 04 <48> 03 45 00 f6 40 08 0c 0f 95 c0 0f b6 c0 48 8b 4c 24 08 65 48 33
[48153.661475] RSP: 0018:ffffbf43c0c5bde8 EFLAGS: 00010246
[48153.662317] RAX: 0000000000000000 RBX: ffff9fbe4b55e000 RCX: 0000000000000000
[48153.663454] RDX: ffff9fbe71e8f7a0 RSI: 0000000000000000 RDI: ffff9fbe47c2e862
[48153.664587] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004
[48153.665717] R10: 0000000000000010 R11: f000000000000000 R12: ffff9fbe4b55e000
[48153.666855] R13: ffff9fbe3133be60 R14: ffff9fbe4ebacb98 R15: ffff9fbe4ebacb40
[48153.667999] FS:  0000000000000000(0000) GS:ffff9fbe7fc00000(0000) knlGS:0000000000000000
[48153.669283] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[48153.670208] CR2: 0000000000000000 CR3: 0000000059c0a002 CR4: 00000000000606f0
[48153.671348] Call Trace:
[48153.671787]  ? cfs_cdebug_show.part.2.constprop.22+0x20/0x20 [lquota]
[48153.672831]  qmt_glimpse_lock.isra.19+0x27e/0xfb0 [lquota]
[48153.673726]  qmt_reba_thread+0x5da/0x9b0 [lquota]
[48153.674503]  ? qmt_glimpse_lock.isra.19+0xfb0/0xfb0 [lquota]
[48153.675454]  kthread+0x112/0x130
[48153.676009]  ? kthread_flush_work_fn+0x10/0x10
[48153.676745]  ret_from_fork+0x35/0x40
[48153.677349] Modules linked in: nfsd nfs_acl lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic 8139too ata_piix crc32c_intel libata serio_raw 8139cp mii virtio_blk [last unloaded: dm_flakey]
[48153.688461] CR2: 0000000000000000


 Comments   
Comment by Sergey Cheremencev [ 18/Apr/23 ]

+1 on master https://testing.whamcloud.com/test_sessions/b92d9d88-7878-447a-8d28-0f7e6e0a8c9b

This problem regularly appears during testing of https://review.whamcloud.com/c/fs/lustre-release/+/49239

I've added "+quota+trace" debug into replay-single.sh to gather more logs and move on.
If it will not help and the problem disappears I am going to add a fix with assertion in qmt_revalidate_lqes:

diff --git a/lustre/quota/qmt_entry.c b/lustre/quota/qmt_entry.c
index de3c891a1b..d776efa3ab 100644
--- a/lustre/quota/qmt_entry.c
+++ b/lustre/quota/qmt_entry.c
@@ -927,6 +927,7 @@ void qmt_revalidate_lqes(const struct lu_env *env,
                return;
        }
 
+       LASSERT(lqe_gl->lqe_glbl_data);
        mutex_lock(&lqe_gl->lqe_glbl_data_lock);
        if (lqe_gl->lqe_glbl_data)
                qmt_seed_glbe(env, lqe_gl->lqe_glbl_data);
 

I guess the issue comes from this place.

Comment by Sergey Cheremencev [ 13/Jun/23 ]

https://review.whamcloud.com/c/fs/lustre-release/+/50748 since patchset 8 should help with current failure

Comment by Peter Jones [ 15/Jun/23 ]

Duplicate of LU-16772

Generated at Sat Feb 10 03:09:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.