[LU-14964] recovery-small: GPF in llog_exist after tests finished Created: 25/Aug/21 Updated: 05/May/22 Resolved: 01/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
there's a relatively new crash being observed in maloo testing on rhel8 testing for past several days in cleanup of recovery-small in review-dne-part-5 [ 753.814277] Lustre: DEBUG MARKER: == recovery-small test complete, duration 6075 sec ======= 10:32:41 (1629887561) [ 783.518588] general protection fault: 0000 [#1] SMP PTI [ 783.519513] CPU: 0 PID: 3045 Comm: mdt_rdpg00_000 Kdump: loaded Tainted: G OE --------- - - 4.18.0-240.22.1.el8_lustre.x86_64 #1 [ 783.521414] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 783.522494] RIP: 0010:llog_exist+0xd9/0x180 [obdclass] [ 783.523265] Code: c7 05 7f 0f 0c 00 01 00 00 00 e8 a2 53 ee ff 5b c3 48 85 ff 0f 84 aa 00 00 00 48 8b 87 08 01 00 00 48 85 c0 0f 84 9a 00 00 00 <48> 8b 40 50 48 85 c0 74 53 48 89 df e8 b6 b3 77 fb f6 05 2b f6 f0 [ 783.526007] RSP: 0018:ffffae2500ea3ae8 EFLAGS: 00010206 [ 783.526776] RAX: 5a5a5a5a5a5a5a5a RBX: ffff9f95b14bd000 RCX: 0000000000000000 [ 783.527839] RDX: 0000000000000ba5 RSI: 0000000000000000 RDI: ffff9f959e040900 [ 783.528890] RBP: ffff9f959091f0d0 R08: 000000d823bc83f1 R09: 0000000000000bc0 [ 783.530782] R10: ffffae2500ea3ae8 R11: ffff9f95a7d08b6c R12: ffff9f95b04c2080 [ 783.532083] R13: ffff9f95b10d7ec0 R14: ffff9f959091f0d0 R15: ffff9f959132c000 [ 783.533175] FS: 0000000000000000(0000) GS:ffff9f95bfc00000(0000) knlGS:0000000000000000 [ 783.534423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 783.535314] CR2: 00007fd080ad6000 CR3: 000000008960a005 CR4: 00000000003606f0 [ 783.536412] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 783.537505] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 783.538587] Call Trace: [ 783.539040] llog_cat_prep_log+0x4f/0x3c0 [obdclass] [ 783.539833] llog_cat_declare_add_rec+0x56/0x220 [obdclass] [ 783.540700] llog_declare_add+0x187/0x1d0 [obdclass] [ 783.541925] top_trans_start+0x212/0x940 [ptlrpc] [ 783.542820] mdd_attr_set+0x657/0xfe0 [mdd] [ 783.543538] ? panic_notifier+0x20/0x20 [libcfs] [ 783.544400] mdt_mfd_close+0x56c/0x8c0 [mdt] [ 783.545087] mdt_close_internal+0xc4/0x240 [mdt] [ 783.545820] mdt_close+0x47d/0x8b0 [mdt] [ 783.546470] tgt_request_handle+0xc90/0x1940 [ptlrpc] [ 783.547300] ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc] [ 783.548246] ptlrpc_main+0xba2/0x1490 [ptlrpc] [ 783.548964] ? __schedule+0x2cc/0x700 [ 783.549562] ? ptlrpc_wait_event+0x500/0x500 [ptlrpc] [ 783.550380] kthread+0x112/0x130 [ 783.550894] ? kthread_flush_work_fn+0x10/0x10 [ 783.551577] ret_from_fork+0x35/0x40 [ 783.552136] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) dm_flakey ptlrpc_gss(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm sunrpc ib_core intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel pcspkr joydev virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix 8139too libata 8139cp crc32c_intel serio_raw virtio_blk mii The 3 observed failures are: https://testing.whamcloud.com/test_sessions/d7bfd1a1-7ebf-40df-b80b-7e104a524f67 https://testing.whamcloud.com/test_sets/f93ef5ed-4963-4851-ac10-3e0477e9543c and https://testing.whamcloud.com/test_sets/1976e1d2-dc91-45b3-b29f-691f921ddeff the RAX is suspicious value so probably accessing some free memory? |
| Comments |
| Comment by Andreas Dilger [ 25/Aug/21 ] |
|
This might also relate to |
| Comment by Chris Horn [ 31/Aug/21 ] |
|
+1 on master - https://testing.whamcloud.com/test_sets/4824cef4-96bd-4fb8-9cbe-1e068d07016d |
| Comment by Andreas Dilger [ 11/Sep/21 ] |
|
Seems very likely this is the same problem as |
| Comment by Andreas Dilger [ 11/Sep/21 ] |
|
Closing this as a duplicate of |
| Comment by Andreas Dilger [ 11/Sep/21 ] |
|
Reopen this. While |
| Comment by Andreas Dilger [ 11/Sep/21 ] |
|
The first recent crash like this was on 2021-08-13, on a patch with parent v2_14_53-23-g29eabeb34c during cleanup of files after the end of the test. The failing patch https://review.whamcloud.com/44541 has not yet landed, so cannot be the source of the problem. This is similar to 29eabeb34c LU-14798 lustre: Support RDMA only pages a7a889f77c LU-14798 lnet: add LNet GPU Direct Support 644cb83921 LU-14893 lctl: check user for changelog_deregister bbd9646f91 LU-14881 libcfs: Complete testing for tcp_sock_set_* 4e1f9c4bd1 LU-14413 test: test for overstriping for sanity 27M d6a3e06cb0 LU-14740 quota: reject invalid project id on server side 6b31918565 LU-8066 obdclass: move lu_ref to debugfs d77e95cc6d LU-14790 lnet: Reflect ni_fatal in NI status 0b94a058fe LU-14694 mdt: do not remove orphans at umount 0a6beb2a50 LU-9859 libcfs: discard cfs_cap_t, use kernel_cap_t ba1fa08a0f LU-10973 lnet: LUTF Python infra a55b6dafea LU-10973 lnet: LUTF infrastructure updates 8c166f6bf4 LU-6142 lustre: use list_first_entry() in lustre subdirectory. 163870abfb LU-14382 mdt: implement fallocate in MDC/MDT dfeb63f2ee LU-14844 tests: make sure mgc_requeue_timeout_min exist. |
| Comment by Andreas Dilger [ 11/Sep/21 ] |
|
Test crashed 24/970 = 1/40 in review-dne[-zfs]-part-5 sessions, all of them on master/master-next. |
| Comment by Andreas Dilger [ 30/Sep/21 ] |
|
May be fixed by patch: https://review.whamcloud.com/44998 " |
| Comment by Andreas Dilger [ 01/Nov/21 ] |
|
All recent failures are due to patches with an old parent that does not contain the |