Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.15.0
-
None
-
3
-
9223372036854775807
Description
maloo is hitting this crash:
[125926.340052] general protection fault: 0000 [#1] SMP PTI [125926.341277] CPU: 1 PID: 825968 Comm: lquota_wb_lustr Kdump: loaded Tainted: G W OE --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1 [125926.343699] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [125926.372056] RIP: 0010:string_nocheck+0x12/0x70 [125926.373076] Code: 00 00 4c 89 e2 be 20 00 00 00 48 89 ef e8 86 93 00 00 4c 01 e3 eb 81 90 49 89 f2 48 89 ce 48 89 f8 48 c1 fe 30 66 85 f6 74 4f <44> 0f b6 0a 45 84 c9 74 46 83 ee 01 41 b8 01 00 00 00 48 8d 7c 37 [125926.376578] RSP: 0018:ffffab2105dd3cb8 EFLAGS: 00010286 [125926.377625] RAX: ffff9afe29483d9f RBX: ffff9afe29484000 RCX: ffff0a00ffffff04 [125926.379021] RDX: 247c894800000028 RSI: ffffffffffffffff RDI: ffff9afe29483d9f [125926.380429] RBP: 247c894800000028 R08: 0000000000000055 R09: 0000000000000001 [125926.381821] R10: ffff9afe29484000 R11: ffff9afe29483d4f R12: ffff0a00ffffff04 [125926.383218] R13: ffffffffc159a59a R14: 0000000000000261 R15: ffffffffc159a59a [125926.384612] FS: 0000000000000000(0000) GS:ffff9afebfd00000(0000) knlGS:0000000000000000 [125926.386200] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [125926.387345] CR2: 00007f172b497000 CR3: 000000009ac0a003 CR4: 00000000001606e0 [125926.388746] Call Trace: [125926.394666] string+0x40/0x50 [125926.403630] vsnprintf+0x33c/0x520 [125926.404461] libcfs_debug_msg+0x83d/0xb00 [libcfs] [125926.412242] ? try_to_del_timer_sync+0x4d/0x80 [125926.413177] ? __next_timer_interrupt+0xf0/0xf0 [125926.414185] ? qsd_upd_thread+0x86e/0xd20 [lquota] [125926.415176] qsd_upd_thread+0x86e/0xd20 [lquota] [125926.416136] ? qsd_upd_add+0x100/0x100 [lquota] [125926.417086] kthread+0x112/0x130 [125926.417784] ? kthread_flush_work_fn+0x10/0x10 [125926.418703] ret_from_fork+0x35/0x40 [125926.419472] Modules linked in: dm_flakey nfsd nfs_acl obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel pcspkr joydev virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: dm_flakey]
There's only a single print in that function so I can only assume list_entry returns garbage?:
if (count % 7 == 0) { n = list_entry(&queue, struct qsd_upd_rec, qur_link); CWARN("%s: The reintegration thread [%d] " "blocked more than %ld seconds\n", n->qur_qqi->qqi_qsd->qsd_svname, n->qur_qqi->qqi_qtype, count * cfs_time_seconds(QSD_WB_INTERVAL) / 10); }
Example reports:
https://testing.whamcloud.com/test_sets/785c0e7b-cd04-422a-8bc3-9eaacc47d4b0
https://testing.whamcloud.com/test_sets/43f81877-2c6c-411a-990a-911905b85a7f
https://testing.whamcloud.com/test_sets/44640986-5ef4-48cc-a468-beefa26fcd3a
So far this was only observed on rhel8 testing only
Attachments
Issue Links
- is related to
-
LU-15283 The quota reint thread maybe dead lock with lquota_wb thread
- Resolved