[LU-15326] sanity-quota test 40 crashing in llog_cat_cancel_arr_rec() Created: 06/Dec/21 Updated: 06/Dec/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We’re actually seeing several sanity-quota tests crash with the following trace [29469.651591] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0 [29469.652967] PGD 0 P4D 0 [29469.653393] Oops: 0000 [#1] SMP PTI [29469.653951] CPU: 0 PID: 603214 Comm: dist_txn-1 Kdump: loaded Tainted: P OE --------- - - 4.18.0-305.19.1.el8_lustre.x86_64 #1 [29469.655910] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [29469.656976] RIP: 0010:llog_cat_cancel_arr_rec+0xcb/0x460 [obdclass] [29469.657976] Code: c3 48 b8 40 00 00 00 cb 02 00 00 c7 05 62 ad 0d 00 00 00 08 00 48 c7 c6 a8 3c 01 c1 48 c7 c7 20 44 06 c1 48 89 05 45 ad 0d 00 <49> 8b 84 24 b0 00 00 00 48 c7 05 42 ad 0d 00 00 00 00 00 48 c7 05 [29469.660898] RSP: 0018:ffffbb91053e3de8 EFLAGS: 00010202 [29469.661731] RAX: 000002cb00000040 RBX: 00000000fffffff7 RCX: ffff9cca74dcf880 [29469.662856] RDX: ffffbb91053e3de8 RSI: ffffffffc1013ca8 RDI: ffffffffc1064420 [29469.663978] RBP: ffff9cca8e182750 R08: ffff9cca74dcf89c R09: abcc77118461cefd [29469.665101] R10: ffffbb91053e3e70 R11: ffff9cca64c0596e R12: 0000000000000000 [29469.666235] R13: ffff9cca74dcf880 R14: 0000000000000001 R15: ffff9cca74dcf89c [29469.667360] FS: 0000000000000000(0000) GS:ffff9ccabfc00000(0000) knlGS:0000000000000000 [29469.668631] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [29469.669536] CR2: 00000000000000b0 CR3: 0000000016010006 CR4: 00000000001706f0 [29469.670666] Call Trace: [29469.671103] llog_cat_cancel_records+0x61/0x190 [obdclass] [29469.672388] distribute_txn_commit_thread+0x3f5/0xa80 [ptlrpc] [29469.673385] ? distribute_txn_commit_batchid_update+0x890/0x890 [ptlrpc] [29469.674453] kthread+0x116/0x130 [29469.674970] ? kthread_flush_work_fn+0x10/0x10 [29469.675680] ret_from_fork+0x35/0x40 [29469.676256] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) dm_mod intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic virtio_net ata_piix crc32c_intel libata serio_raw virtio_blk net_failover failover [29469.685081] CR2: 00000000000000b0 We saw crash for the first time with replay-dual test_22d crash for patch https://review.whamcloud.com/41424 on March 5, 2021. The following crash may be related: [76868.809616] BUG: unable to handle kernel paging request at 00000060000000f0 [76868.810911] PGD 0 P4D 0 [76868.811380] Oops: 0000 [#1] SMP PTI [76868.811999] CPU: 1 PID: 3223309 Comm: dist_txn-2 Kdump: loaded Tainted: P OE --------- - - 4.18.0-305.19.1.el8_lustre.x86_64 #1 [76868.814127] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [76868.815154] RIP: 0010:rwsem_down_write_slowpath+0x3e9/0x610 [76868.816134] Code: f8 01 0f 85 78 fe ff ff 48 8b 7c 24 18 41 be 02 00 00 00 e8 e9 1f 81 00 e9 55 fd ff ff 48 89 c2 48 83 e2 fc 74 21 a8 01 75 1d <8b> 42 38 85 c0 0f 84 75 fc ff ff 8b 7a 3c e8 f4 c5 f2 ff 66 90 84 [76868.819295] RSP: 0018:ffffb4944b173ce0 EFLAGS: 00010246 [76868.820190] RAX: 00000060000000b8 RBX: ffff8ecca011e950 RCX: 0000000000000000 [76868.821399] RDX: 00000060000000b8 RSI: 0000000000000002 RDI: ffff8ecc7615ec00 [76868.822607] RBP: ffffb4944b173d80 R08: 0000000000000bf8 R09: ffffb4944b173a84 [76868.823818] R10: ffffb4944b173d40 R11: ffff8eccb80e1bf6 R12: 0000000000000000 [76868.825026] R13: ffffffff00000000 R14: ffff8ecc7615ecb8 R15: ffff8ecc7615ec00 [76868.826288] FS: 0000000000000000(0000) GS:ffff8eccffd00000(0000) knlGS:0000000000000000 [76868.827708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [76868.828717] CR2: 00000060000000f0 CR3: 0000000087410004 CR4: 00000000000606e0 [76868.829937] Call Trace: [76868.830663] ? llog_init_handle+0x15d/0x9d0 [obdclass] [76868.831577] ? llog_init_handle+0x15d/0x9d0 [obdclass] [76868.832482] llog_cat_id2handle+0x3ce/0x6d0 [obdclass] [76868.833398] llog_cat_cancel_arr_rec+0x59/0x460 [obdclass] [76868.834365] llog_cat_cancel_records+0x61/0x190 [obdclass] [76868.835743] distribute_txn_commit_thread+0x3f5/0xa80 [ptlrpc] [76868.836814] ? distribute_txn_commit_batchid_update+0x890/0x890 [ptlrpc] [76868.837989] kthread+0x116/0x130 [76868.838581] ? kthread_flush_work_fn+0x10/0x10 [76868.839378] ret_from_fork+0x35/0x40 [76868.840040] Modules linked in: nfsd nfs_acl obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel virtio_net libata serio_raw virtio_blk net_failover failover [last unloaded: llog_test] [76868.850790] CR2: 00000060000000f0 August 6, 2021 - sanity-quota test_40d at https://testing.whamcloud.com/test_sets/ae2e9fb4-49cc-479b-a7a3-5d5651edcac7 |