[LU-15503] Crash in qsd_upd_thread trying to print a debug message. Created: 29/Jan/22  Updated: 26/Feb/22  Resolved: 07/Feb/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15283 The quota reint thread maybe dead loc... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

maloo is hitting this crash:

[125926.340052] general protection fault: 0000 [#1] SMP PTI
[125926.341277] CPU: 1 PID: 825968 Comm: lquota_wb_lustr Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-240.22.1.el8_3.x86_64 #1
[125926.343699] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[125926.372056] RIP: 0010:string_nocheck+0x12/0x70
[125926.373076] Code: 00 00 4c 89 e2 be 20 00 00 00 48 89 ef e8 86 93 00 00 4c 01 e3 eb 81 90 49 89 f2 48 89 ce 48 89 f8 48 c1 fe 30 66 85 f6 74 4f <44> 0f b6 0a 45 84 c9 74 46 83 ee 01 41 b8 01 00 00 00 48 8d 7c 37
[125926.376578] RSP: 0018:ffffab2105dd3cb8 EFLAGS: 00010286
[125926.377625] RAX: ffff9afe29483d9f RBX: ffff9afe29484000 RCX: ffff0a00ffffff04
[125926.379021] RDX: 247c894800000028 RSI: ffffffffffffffff RDI: ffff9afe29483d9f
[125926.380429] RBP: 247c894800000028 R08: 0000000000000055 R09: 0000000000000001
[125926.381821] R10: ffff9afe29484000 R11: ffff9afe29483d4f R12: ffff0a00ffffff04
[125926.383218] R13: ffffffffc159a59a R14: 0000000000000261 R15: ffffffffc159a59a
[125926.384612] FS:  0000000000000000(0000) GS:ffff9afebfd00000(0000) knlGS:0000000000000000
[125926.386200] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[125926.387345] CR2: 00007f172b497000 CR3: 000000009ac0a003 CR4: 00000000001606e0
[125926.388746] Call Trace:
[125926.394666]  string+0x40/0x50
[125926.403630]  vsnprintf+0x33c/0x520
[125926.404461]  libcfs_debug_msg+0x83d/0xb00 [libcfs]
[125926.412242]  ? try_to_del_timer_sync+0x4d/0x80
[125926.413177]  ? __next_timer_interrupt+0xf0/0xf0
[125926.414185]  ? qsd_upd_thread+0x86e/0xd20 [lquota]
[125926.415176]  qsd_upd_thread+0x86e/0xd20 [lquota]
[125926.416136]  ? qsd_upd_add+0x100/0x100 [lquota]
[125926.417086]  kthread+0x112/0x130
[125926.417784]  ? kthread_flush_work_fn+0x10/0x10
[125926.418703]  ret_from_fork+0x35/0x40
[125926.419472] Modules linked in: dm_flakey nfsd nfs_acl obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel pcspkr joydev virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: dm_flakey] 

There's only a single print in that function so I can only assume list_entry returns garbage?:

                        if (count % 7 == 0) {
                                n = list_entry(&queue, struct qsd_upd_rec,
                                               qur_link);
                                CWARN("%s: The reintegration thread [%d] "
                                      "blocked more than %ld seconds\n",
                                      n->qur_qqi->qqi_qsd->qsd_svname,
                                      n->qur_qqi->qqi_qtype, count *
                                      cfs_time_seconds(QSD_WB_INTERVAL) / 10);
                        } 

Example reports:

https://testing.whamcloud.com/test_sets/785c0e7b-cd04-422a-8bc3-9eaacc47d4b0

https://testing.whamcloud.com/test_sets/43f81877-2c6c-411a-990a-911905b85a7f

https://testing.whamcloud.com/test_sets/44640986-5ef4-48cc-a468-beefa26fcd3a

 

So far this was only observed on rhel8 testing only



 Comments   
Comment by Gerrit Updater [ 29/Jan/22 ]

"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46380
Subject: LU-15503 quota: fix list entry usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 45dfceff5f472f65cd85bb07f61ee0c50ada0fec

Comment by Gerrit Updater [ 07/Feb/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46380/
Subject: LU-15503 quota: fix list entry usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a501deaf3930e9999ebbf476008a08ac6d5da1ec

Comment by Peter Jones [ 07/Feb/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:18:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.