[LU-10451] sptlrpc_ctxs_lprocfs_seq_show crash in recovery-small test 57 Created: 03/Jan/18  Updated: 04/Jan/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After LU-10224 landed that fixed all crashes in that test it seemed like.

Well, I just had a very similar crash in a different place:

[128715.132365] Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash =================================== 10:05:20 (1514387120)
[128717.357108] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[128717.358256] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate jbd2 syscopyarea sysfillrect sysimgblt ttm ata_generic drm_kms_helper pata_acpi drm floppy i2c_piix4 virtio_console pcspkr virtio_balloon serio_raw virtio_blk ata_piix i2c_core libata nfsd ip_tables rpcsec_gss_krb5 [last unloaded: libcfs]
[128717.371008] CPU: 3 PID: 20280 Comm: lctl Tainted: P           OE  ------------   3.10.0-debug #2
[128717.372286] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[128717.372932] task: ffff8802cd528a80 ti: ffff8800a6838000 task.ti: ffff8800a6838000
[128717.382186] RIP: 0010:[<ffffffffa05b8a47>]  [<ffffffffa05b8a47>] sptlrpc_ctxs_lprocfs_seq_show+0x27/0x100 [ptlrpc]
[128717.383566] RSP: 0018:ffff8800a683be78  EFLAGS: 00010203
[128717.384213] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8802a1586700 RCX: 0000000000000004
[128717.385411] RDX: fffffffffffffff4 RSI: 0000000000000001 RDI: ffffffffa0636425
[128717.389241] RBP: ffff8800a683be90 R08: 0000000000000001 R09: ffff8802f092f000
[128717.390434] R10: 0000000000000000 R11: 0000000000000246 R12: ffff8800a284ef00
[128717.391627] R13: 0000000000000001 R14: ffff8800a683bf48 R15: ffff8800a284ef00
[128717.394109] FS:  00007f93a5430740(0000) GS:ffff88033e460000(0000) knlGS:0000000000000000
[128717.395374] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[128717.396187] CR2: 00007f93a4aa7000 CR3: 0000000095e38000 CR4: 00000000000006e0
[128717.403348] Lustre: Unmounted lustre-client
[128717.413399] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[128717.414405] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[128717.415645] Stack:
[128717.416340]  0000000000000000 ffff880096912e00 0000000000000001 ffff8800a683bf00
[128717.417877]  ffffffff81212c85 0000000000001000 0000000001e19af0 ffff8800a284ef38
[128717.419405]  0000000000001000 0000000000000000 ffff880096912e00 0000000001ff0c13
[128717.420901] Call Trace:
[128717.421503]  [<ffffffff81212c85>] seq_read+0x105/0x3e0
[128717.422165]  [<ffffffff811ed1dc>] vfs_read+0x9c/0x170
[128717.422623]  [<ffffffff811edd44>] SyS_read+0x84/0xf0
[128717.423149]  [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b
[128717.423877] Code: 1f 44 00 00 0f 1f 44 00 00 55 b9 04 00 00 00 48 89 e5 41 55 41 54 49 89 fc 53 48 8b 9f d8 00 00 00 48 c7 c7 25 64 63 a0 48 8b 03 <4c> 8b 68 40 4c 89 ee f3 a6 75 42 48 8b bb 58 08 00 00 48 85 ff 
[128717.425876] RIP  [<ffffffffa05b8a47>] sptlrpc_ctxs_lprocfs_seq_show+0x27/0x100 [ptlrpc]
(gdb) l *(sptlrpc_ctxs_lprocfs_seq_show+0x27)
0x8ca47 is in sptlrpc_ctxs_lprocfs_seq_show (/home/green/git/lustre-release/lustre/ptlrpc/sec_lproc.c:122).
117	{
118	        struct obd_device *dev = seq->private;
119	        struct client_obd *cli = &dev->u.cli;
120	        struct ptlrpc_sec *sec = NULL;
121
122		LASSERT(strcmp(dev->obd_type->typ_name, LUSTRE_OSC_NAME) == 0 ||
123			strcmp(dev->obd_type->typ_name, LUSTRE_MDC_NAME) == 0 ||
124			strcmp(dev->obd_type->typ_name, LUSTRE_MGC_NAME) == 0 ||
125			strcmp(dev->obd_type->typ_name, LUSTRE_LWP_NAME) == 0 ||
126			strcmp(dev->obd_type->typ_name, LUSTRE_OSP_NAME) == 0);

It's not as frequent as all those othe failures, but still needs to be looked at I guess.



 Comments   
Comment by Oleg Drokin [ 04/Jan/18 ]

hm, fresh on the heel of this crash I just had another one in the same area:

[635642.285323] Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash =================================== 18:39:59 (1515022799)
[635644.361797] BUG: unable to handle kernel paging request at ffff8802d00b1fd0
[635644.363434] IP: [<ffffffffa07fa405>] osc_stats_seq_show+0x65/0xb0 [osc]
[635644.364258] PGD 2e75067 PUD 33e9f9067 PMD 33e978067 PTE 80000002d00b1060
[635644.365053] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[635644.365787] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate jbd2 syscopyarea sysfillrect sysimgblt ttm drm_kms_helper serio_raw virtio_blk ata_generic pcspkr floppy virtio_balloon virtio_console pata_acpi drm ata_piix i2c_piix4 i2c_core libata nfsd ip_tables rpcsec_gss_krb5 [last unloaded: libcfs]
[635644.372330] CPU: 10 PID: 7201 Comm: lctl Tainted: P           OE  ------------   3.10.0-debug #2
[635644.373754] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[635644.375666] task: ffff88008cf2e580 ti: ffff88006f360000 task.ti: ffff88006f360000
[635644.377090] RIP: 0010:[<ffffffffa07fa405>]  [<ffffffffa07fa405>] osc_stats_seq_show+0x65/0xb0 [osc]
[635644.378526] RSP: 0018:ffff88006f363e68  EFLAGS: 00010246
[635644.380324] Lustre: Unmounted lustre-client
[635644.382857] RAX: 0000000000000000 RBX: ffff8802a5459f00 RCX: 0000000000000000
[635644.383690] RDX: 0000000000001000 RSI: ffffffffa082113c RDI: 0000000000000000
[635644.384543] RBP: ffff88006f363e90 R08: 000000000000000a R09: 000000000000fffe
[635644.385353] R10: 0000000000000000 R11: ffff88006f363cfe R12: ffff8802d00b1f80
[635644.386173] R13: 0000000000000001 R14: ffff88006f363f48 R15: ffff8802a5459f00
[635644.387021] FS:  00007f35a45bf740(0000) GS:ffff88033e540000(0000) knlGS:0000000000000000
[635644.387869] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[635644.388306] CR2: ffff8802d00b1fd0 CR3: 000000009d8e9000 CR4: 00000000000006e0
[635644.389378] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[635644.390328] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[635644.391156] Stack:
[635644.391554]  000000005a4d69d1 000000002a30b8e8 00000000399a6e25 0000000000000000
[635644.392407]  ffff88008fa0ce00 ffff88006f363f00 ffffffff81212c85 0000000000001000
[635644.393615]  00000000009f6010 ffff8802a5459f38 0000000000001000 0000000000000000
[635644.395054] Call Trace:
[635644.395743]  [<ffffffff81212c85>] seq_read+0x105/0x3e0
[635644.396544]  [<ffffffff811ed1dc>] vfs_read+0x9c/0x170
[635644.397203]  [<ffffffff811edd44>] SyS_read+0x84/0xf0
[635644.397850]  [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b
[635644.398509] Code: 48 8b 55 d8 48 c7 c6 c8 32 82 a0 48 89 df 31 c0 e8 c1 8d a1 e0 49 8b 54 24 48 48 c7 c6 21 11 82 a0 48 89 df 31 c0 e8 ab 8d a1 e0 <49> 8b 54 24 50 48 c7 c6 3d 11 82 a0 48 89 df 31 c0 e8 95 8d a1 
[635644.401054] RIP  [<ffffffffa07fa405>] osc_stats_seq_show+0x65/0xb0 [osc]
[635644.401766]  RSP <ffff88006f363e68>
[635644.402351] CR2: ffff8802d00b1fd0
(gdb) l *(osc_stats_seq_show+0x65)
0xe405 is in osc_stats_seq_show (/home/green/git/lustre-release/lustre/osc/lproc_osc.c:798).
793
794		seq_printf(seq, "snapshot_time:         %lld.%09lu (secs.nsecs)\n",
795			   (s64)now.tv_sec, now.tv_nsec);
796		seq_printf(seq, "lockless_write_bytes\t\t%llu\n",
797			   stats->os_lockless_writes);
798		seq_printf(seq, "lockless_read_bytes\t\t%llu\n",
799			   stats->os_lockless_reads);
800		seq_printf(seq, "lockless_truncate\t\t%llu\n",
801			   stats->os_lockless_truncates);
802		return 0;

So it looks like the problem became less severe, but is still there.

Generated at Sat Feb 10 02:35:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.