[LU-15988] conf-sanity test_38 crash Created: 04/Jul/22  Updated: 26/Sep/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17146 sanity-lfsck test_38: read should fail Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/03680f4b-7efd-458f-ae96-3af76e0f1355

test_38 failed with the following error:

trevis-78vm4 crashed during conf-sanity test_38
[11297.066787] LustreError: 166-1: MGC10.240.42.139@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
[11297.069837] LustreError: 582898:0:(osp_object.c:629:osp_attr_get()) lustre-MDT0001-osp-MDT0002: osp_attr_get update error [0x24000cf22:0x1:0x0]: rc = -5
[11297.072504] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[11297.074000] PGD 0 P4D 0 
[11297.074544] Oops: 0000 [#1] SMP PTI
[11297.075256] CPU: 1 PID: 582898 Comm: dist_txn-2 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-348.23.1.el8_lustre.x86_64 #1
[11297.077597] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[11297.078838] RIP: 0010:osp_send_update_req+0x25b/0x6d0 [osp]
[11297.079924] Code: 00 00 00 68 04 00 00 c7 05 b6 02 02 00 00 00 08 00 48 89 05 a7 02 02 00 48 8b 85 e0 01 00 00 48 c7 05 a5 02 02 00 00 00 00 00 <4c> 8b 60 40 e8 4c 15 11 ff 48 8d 7b 1c ba 40 00 00 00 48 89 c6 e8
[11297.083371] RSP: 0018:ffffbc0445aefd10 EFLAGS: 00010202
[11297.084397] RAX: 0000000000000000 RBX: ffff976a76bac960 RCX: 0000000000000000
[11297.085762] RDX: ffff976a43d985b0 RSI: 00000000ffffff8c RDI: ffff976a43d98a00
[11297.087117] RBP: ffff976a6ffb8000 R08: 00000000000005bb R09: ffffbc0445aefb78
[11297.088478] R10: ffffbc0445aefd10 R11: ffff976a4458155a R12: ffff976a43d98580
[11297.089842] R13: 00000000ffffff8c R14: 0000000000000000 R15: 0000000000000001
[11297.091207] FS:  0000000000000000(0000) GS:ffff976affd00000(0000) knlGS:0000000000000000
[11297.092744] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11297.093860] CR2: 0000000000000040 CR3: 000000006e210006 CR4: 00000000000606e0
[11297.095221] Call Trace:
[11297.095783]  osp_trans_stop+0x37c/0x434 [osp]
[11297.096936]  llog_cancel_arr_rec+0x7f0/0xc00 [obdclass]
[11297.098013]  llog_cat_cancel_arr_rec+0x1e3/0x460 [obdclass]
[11297.099118]  llog_cat_cancel_records+0x61/0x190 [obdclass]
[11297.100640]  distribute_txn_commit_thread+0x3cb/0xb50 [ptlrpc]
[11297.101835]  ? distribute_txn_commit_batchid_update+0x890/0x890 [ptlrpc]
[11297.103202]  kthread+0x116/0x130
[11297.103887]  ? kthread_flush_work_fn+0x10/0x10
[11297.104792]  ret_from_fork+0x35/0x40
[11297.105531] Modules linked in: dm_flakey ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) nfsv3 nfs_acl loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel virtio_net net_failover failover serio_raw virtio_blk [last unloaded: dm_flakey]
[11297.116029] CR2: 0000000000000040

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_38 - trevis-78vm4 crashed during conf-sanity test_38



 Comments   
Comment by Andreas Dilger [ 25/Sep/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/7ab15c82-0b92-44f7-82b4-6e4f316d35c9

[ 6860.952778] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds2
[ 6861.121512] LustreError: 529868:0:(osp_object.c:637:osp_attr_get()) lustre-MDT0000-osp-MDT0001: osp_attr_get update error [0x20000ee62:0x1:0x0]: rc = -5
[ 6861.124105] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[ 6861.126734] CPU: 0 PID: 529868 Comm: dist_txn-1 4.18.0-425.10.1.el8_lustre.x86_64
[ 6861.128934] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 6861.129982] RIP: 0010:osp_send_update_req+0x265/0x6e0 [osp]
[ 6861.145557] Call Trace:
[ 6861.146084]  osp_trans_stop+0x383/0x43b [osp]
[ 6861.146938]  llog_cancel_arr_rec+0x7f2/0xc30 [obdclass]
[ 6861.148189]  llog_cat_cancel_arr_rec+0x1e0/0x450 [obdclass]
[ 6861.149262]  llog_cat_cancel_records+0x61/0x1a0 [obdclass]
[ 6861.150318]  distribute_txn_commit_thread+0x40c/0xb10 [ptlrpc]
[ 6861.153283]  kthread+0x10b/0x130
Comment by Andreas Dilger [ 25/Sep/23 ]

Looks like a NULL pointer dereference in osp_send_update_req(). Checking the address on a different build (so may not be exactly the same):

(gdb) list *(osp_send_update_req+0x258)
0x27e18 is in osp_send_update_req (/usr/src/lustre-exa/lnet/include/uapi/linux/lnet/nidstr.h:85).
80                                      LNET_NIDSTR_SIZE);
81      }
82      char *libcfs_nid2str_r(lnet_nid_t nid, char *buf, __kernel_size_t buf_size);
83      static inline char *libcfs_nid2str(lnet_nid_t nid)
84      {
85              return libcfs_nid2str_r(nid, libcfs_next_nidstring(),
86                                      LNET_NIDSTR_SIZE);
87      }
88      __u32 libcfs_str2net(const char *str);
89      lnet_nid_t libcfs_str2nid(const char *str);
(gdb) list *(osp_send_update_req+0x248)
0x27e08 is in osp_send_update_req (/usr/src/lustre-exa/lustre/osp/osp_trans.c:1124).
1119            if (ou && ou->ou_generation != our->our_generation) {
1120                    const struct lnet_process_id *peer =
1121                            &osp->opd_obd->u.cli.cl_import->imp_connection->c_peer;
1122                    rc = -ESTALE;
1123                    osp_trans_callback(env, oth, rc);
1124                    CDEBUG(D_HA, "%s: stale tx to %s: gen %llu != %llu: rc = %d\n",
1125                           osp->opd_obd->obd_name, libcfs_nid2str(peer->nid),
1126                           osp->opd_update->ou_generation, our->our_generation, rc);
1127                    RETURN(rc);
1128            }

There are unfortunately a few different structure offsets at 0x40 that may apply here - opd_obd->obd_name and our->our_generation.

Minor nit - osp->opd_update->ou_generation in the debug message can be replaced by ou->ou_generation, which is already validated before the error is printed, so could not be the culprit.

Crash has only been hit 4x since ticket was originally filed on 2022-07-22. It might make sense to push a patch to validate these pointers before access, but the problem is hit so rarely that it wouldn't even be clear to know when the debugging was hit, so not a priority to investigate.

Comment by Andreas Dilger [ 26/Sep/23 ]

May also relate to LU-17146.

Generated at Sat Feb 10 03:23:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.