[LU-15988] conf-sanity test_38 crash Created: 04/Jul/22 Updated: 26/Sep/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/03680f4b-7efd-458f-ae96-3af76e0f1355 test_38 failed with the following error: trevis-78vm4 crashed during conf-sanity test_38 [11297.066787] LustreError: 166-1: MGC10.240.42.139@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail [11297.069837] LustreError: 582898:0:(osp_object.c:629:osp_attr_get()) lustre-MDT0001-osp-MDT0002: osp_attr_get update error [0x24000cf22:0x1:0x0]: rc = -5 [11297.072504] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 [11297.074000] PGD 0 P4D 0 [11297.074544] Oops: 0000 [#1] SMP PTI [11297.075256] CPU: 1 PID: 582898 Comm: dist_txn-2 Kdump: loaded Tainted: G OE --------- - - 4.18.0-348.23.1.el8_lustre.x86_64 #1 [11297.077597] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [11297.078838] RIP: 0010:osp_send_update_req+0x25b/0x6d0 [osp] [11297.079924] Code: 00 00 00 68 04 00 00 c7 05 b6 02 02 00 00 00 08 00 48 89 05 a7 02 02 00 48 8b 85 e0 01 00 00 48 c7 05 a5 02 02 00 00 00 00 00 <4c> 8b 60 40 e8 4c 15 11 ff 48 8d 7b 1c ba 40 00 00 00 48 89 c6 e8 [11297.083371] RSP: 0018:ffffbc0445aefd10 EFLAGS: 00010202 [11297.084397] RAX: 0000000000000000 RBX: ffff976a76bac960 RCX: 0000000000000000 [11297.085762] RDX: ffff976a43d985b0 RSI: 00000000ffffff8c RDI: ffff976a43d98a00 [11297.087117] RBP: ffff976a6ffb8000 R08: 00000000000005bb R09: ffffbc0445aefb78 [11297.088478] R10: ffffbc0445aefd10 R11: ffff976a4458155a R12: ffff976a43d98580 [11297.089842] R13: 00000000ffffff8c R14: 0000000000000000 R15: 0000000000000001 [11297.091207] FS: 0000000000000000(0000) GS:ffff976affd00000(0000) knlGS:0000000000000000 [11297.092744] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11297.093860] CR2: 0000000000000040 CR3: 000000006e210006 CR4: 00000000000606e0 [11297.095221] Call Trace: [11297.095783] osp_trans_stop+0x37c/0x434 [osp] [11297.096936] llog_cancel_arr_rec+0x7f0/0xc00 [obdclass] [11297.098013] llog_cat_cancel_arr_rec+0x1e3/0x460 [obdclass] [11297.099118] llog_cat_cancel_records+0x61/0x190 [obdclass] [11297.100640] distribute_txn_commit_thread+0x3cb/0xb50 [ptlrpc] [11297.101835] ? distribute_txn_commit_batchid_update+0x890/0x890 [ptlrpc] [11297.103202] kthread+0x116/0x130 [11297.103887] ? kthread_flush_work_fn+0x10/0x10 [11297.104792] ret_from_fork+0x35/0x40 [11297.105531] Modules linked in: dm_flakey ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) nfsv3 nfs_acl loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover failover serio_raw virtio_blk [last unloaded: dm_flakey] [11297.116029] CR2: 0000000000000040 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 25/Sep/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/7ab15c82-0b92-44f7-82b4-6e4f316d35c9 [ 6860.952778] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds2 [ 6861.121512] LustreError: 529868:0:(osp_object.c:637:osp_attr_get()) lustre-MDT0000-osp-MDT0001: osp_attr_get update error [0x20000ee62:0x1:0x0]: rc = -5 [ 6861.124105] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 [ 6861.126734] CPU: 0 PID: 529868 Comm: dist_txn-1 4.18.0-425.10.1.el8_lustre.x86_64 [ 6861.128934] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 6861.129982] RIP: 0010:osp_send_update_req+0x265/0x6e0 [osp] [ 6861.145557] Call Trace: [ 6861.146084] osp_trans_stop+0x383/0x43b [osp] [ 6861.146938] llog_cancel_arr_rec+0x7f2/0xc30 [obdclass] [ 6861.148189] llog_cat_cancel_arr_rec+0x1e0/0x450 [obdclass] [ 6861.149262] llog_cat_cancel_records+0x61/0x1a0 [obdclass] [ 6861.150318] distribute_txn_commit_thread+0x40c/0xb10 [ptlrpc] [ 6861.153283] kthread+0x10b/0x130 |
| Comment by Andreas Dilger [ 25/Sep/23 ] |
|
Looks like a NULL pointer dereference in osp_send_update_req(). Checking the address on a different build (so may not be exactly the same): (gdb) list *(osp_send_update_req+0x258) 0x27e18 is in osp_send_update_req (/usr/src/lustre-exa/lnet/include/uapi/linux/lnet/nidstr.h:85). 80 LNET_NIDSTR_SIZE); 81 } 82 char *libcfs_nid2str_r(lnet_nid_t nid, char *buf, __kernel_size_t buf_size); 83 static inline char *libcfs_nid2str(lnet_nid_t nid) 84 { 85 return libcfs_nid2str_r(nid, libcfs_next_nidstring(), 86 LNET_NIDSTR_SIZE); 87 } 88 __u32 libcfs_str2net(const char *str); 89 lnet_nid_t libcfs_str2nid(const char *str); (gdb) list *(osp_send_update_req+0x248) 0x27e08 is in osp_send_update_req (/usr/src/lustre-exa/lustre/osp/osp_trans.c:1124). 1119 if (ou && ou->ou_generation != our->our_generation) { 1120 const struct lnet_process_id *peer = 1121 &osp->opd_obd->u.cli.cl_import->imp_connection->c_peer; 1122 rc = -ESTALE; 1123 osp_trans_callback(env, oth, rc); 1124 CDEBUG(D_HA, "%s: stale tx to %s: gen %llu != %llu: rc = %d\n", 1125 osp->opd_obd->obd_name, libcfs_nid2str(peer->nid), 1126 osp->opd_update->ou_generation, our->our_generation, rc); 1127 RETURN(rc); 1128 } There are unfortunately a few different structure offsets at 0x40 that may apply here - opd_obd->obd_name and our->our_generation. Minor nit - osp->opd_update->ou_generation in the debug message can be replaced by ou->ou_generation, which is already validated before the error is printed, so could not be the culprit. Crash has only been hit 4x since ticket was originally filed on 2022-07-22. It might make sense to push a patch to validate these pointers before access, but the problem is hit so rarely that it wouldn't even be clear to know when the debugging was hit, so not a priority to investigate. |
| Comment by Andreas Dilger [ 26/Sep/23 ] |
|
May also relate to LU-17146. |