[LU-12013] Crashes in sanity-lfsck test 13 Created: 25/Feb/19  Updated: 07/Jul/21  Resolved: 25/May/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13535 Files truncated/corruption due to lfsck Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I hit this with master now too.

There are crashes for ldiskfs and zfs.

[  895.544205] Lustre: DEBUG MARKER: == sanity-lfsck test 13: LFSCK can repair crashed lmm_oi ============================================= 22:43:00 (1551066180)
[  895.793345] Lustre: *** cfs_fail_loc=160f, val=0***
[  896.187391] BUG: unable to handle kernel paging request at ffff8801118bf000
[  896.188949] IP: [<ffffffff813f045d>] memcpy+0xd/0x110
[  896.189624] PGD 241b067 PUD 241e067 PMD 11ed5e067 PTE 80000001118bf060
[  896.190647] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  896.191421] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) dm_flakey dm_mod libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic crct10dif_common squashfs i2c_piix4 pcspkr i2c_core binfmt_misc ip_tables rpcsec_gss_krb5 ata_generic pata_acpi ata_piix serio_raw libata virtio_blk floppy [last unloaded: libcfs]
[  896.199259] CPU: 3 PID: 14106 Comm: lfsck Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #2
[  896.200466] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  896.201149] task: ffff8800b30d8280 ti: ffff880099650000 task.ti: ffff880099650000
[  896.202024] RIP: 0010:[<ffffffff813f045d>]  [<ffffffff813f045d>] memcpy+0xd/0x110
[  896.202958] RSP: 0018:ffff880099653878  EFLAGS: 00010246
[  896.203598] RAX: ffff8800bc11f2cc RBX: ffff8800996539c8 RCX: 000000000000000e
[  896.204454] RDX: 0000000000000000 RSI: ffff8801118bf000 RDI: ffff8800bc11f34c
[  896.205310] RBP: ffff880099653938 R08: 00000000000000f0 R09: 00000000000000f0
[  896.206166] R10: 0000000000000228 R11: 00000000000000f0 R12: ffff880098af4b78
[  896.206842] R13: 0000000000000210 R14: ffff880099653978 R15: 0000000000000000
[  896.207485] FS:  0000000000000000(0000) GS:ffff88011e2c0000(0000) knlGS:0000000000000000
[  896.208190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  896.208653] CR2: ffff8801118bf000 CR3: 0000000001c10000 CR4: 00000000000006e0
[  896.209260] Call Trace:
[  896.209517]  [<ffffffffa0787d8b>] ? ldiskfs_xattr_set_entry+0x21b/0x8e0 [ldiskfs]
[  896.210165]  [<ffffffffa07867bf>] ? ldiskfs_xattr_find_entry+0x9f/0x130 [ldiskfs]
[  896.211227]  [<ffffffffa0788ef5>] ldiskfs_xattr_ibody_set+0x35/0x90 [ldiskfs]
[  896.212044]  [<ffffffffa0789257>] ldiskfs_xattr_set_handle+0x1a7/0x510 [ldiskfs]
[  896.212715]  [<ffffffffa078970e>] ldiskfs_xattr_set+0x14e/0x1c0 [ldiskfs]
[  896.213396]  [<ffffffffa07c0e5d>] ldiskfs_xattr_trusted_set+0x2d/0x30 [ldiskfs]
[  896.214042]  [<ffffffff8125eaa5>] generic_setxattr+0x65/0x80
[  896.214540]  [<ffffffffa0845f6b>] osd_xattr_set+0x51b/0x1440 [osd_ldiskfs]
[  896.215150]  [<ffffffffa083f253>] ? osd_xattr_get+0x5d3/0x800 [osd_ldiskfs]
[  896.215751]  [<ffffffffa066f566>] dt_xattr_set+0xa6/0x120 [lfsck]
[  896.216348]  [<ffffffffa067a1e8>] ? lfsck_layout_get_lovea+0xa8/0x240 [lfsck]
[  896.217021]  [<ffffffffa067e075>] lfsck_layout_master_exec_oit+0x995/0xef0 [lfsck]
[  896.217651]  [<ffffffffa064c01f>] lfsck_master_oit_engine+0x7ff/0x14d0 [lfsck]
[  896.218277]  [<ffffffff8102a59d>] ? __switch_to+0xcd/0x4e0
[  896.218745]  [<ffffffffa064d676>] lfsck_master_engine+0x986/0x1390 [lfsck]
[  896.219351]  [<ffffffff810caae0>] ? wake_up_state+0x20/0x20
[  896.219817]  [<ffffffffa064ccf0>] ? lfsck_master_oit_engine+0x14d0/0x14d0 [lfsck]
[  896.220427]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[  896.221063]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[  896.221836]  [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21
[  896.222386]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140

Full report: http://testing.linuxhacker.ru:3333/lustre-reports/76/testresults/sanity-lfsck-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

ZFS:

[  683.952824] Lustre: DEBUG MARKER: == sanity-lfsck test 13: LFSCK can repair crashed lmm_oi ============================================= 00:40:02 (1551246002)
[  684.270083] Lustre: *** cfs_fail_loc=160f, val=0***
[  685.583055] BUG: unable to handle kernel paging request at ffff8800873d9000
[  685.584785] IP: [<ffffffff813f05f7>] memmove+0x37/0x1a0
[  685.585967] PGD 241b067 PUD 11e90a067 PMD 11e8d0067 PTE 80000000873d9060
[  685.587513] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  685.588561] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common zfs(PO) zunicode(PO) zavl(PO) icp(PO) squashfs zcommon(PO) znvpair(PO) spl(O) i2c_piix4 pcspkr i2c_core binfmt_misc ip_tables rpcsec_gss_krb5 ata_generic pata_acpi ata_piix serio_raw libata virtio_blk floppy [last unloaded: libcfs]
[  685.600265] CPU: 4 PID: 29012 Comm: lfsck Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #2
[  685.603647] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  685.604986] task: ffff8800ce75adc0 ti: ffff880099f38000 task.ti: ffff880099f38000
[  685.606917] RIP: 0010:[<ffffffff813f05f7>]  [<ffffffff813f05f7>] memmove+0x37/0x1a0
[  685.608845] RSP: 0018:ffff880099f3ba20  EFLAGS: 00010206
[  685.610316] RAX: ffff88008aeeee30 RBX: 0000000000000000 RCX: 00000000000000f0
[  685.612151] RDX: 0000000000000030 RSI: ffff8800873d9000 RDI: ffff88008aeeeeb0
[  685.613733] RBP: ffff880099f3ba88 R08: 0000000000000000 R09: 0000000000000000
[  685.615370] R10: ffffffff00000000 R11: 0000000000000000 R12: ffff8800b177d440
[  685.617138] R13: ffff88008aeeee00 R14: 000000000000000a R15: ffffffffa0dd8494
[  685.618774] FS:  0000000000000000(0000) GS:ffff88011e300000(0000) knlGS:0000000000000000
[  685.620546] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  685.621817] CR2: ffff8800873d9000 CR3: 00000000dac78000 CR4: 00000000000006e0
[  685.623362] Call Trace:
[  685.623906]  [<ffffffffa01355fc>] ? nvlist_add_common.part.51+0x2cc/0x430 [znvpair]
[  685.625574]  [<ffffffffa0135d06>] nvlist_add_byte_array+0x26/0x30 [znvpair]
[  685.627059]  [<ffffffffa0eb65fb>] __osd_sa_xattr_set+0xbb/0x370 [osd_zfs]
[  685.628556]  [<ffffffffa0eb739a>] osd_xattr_set+0x50a/0x880 [osd_zfs]
[  685.630123]  [<ffffffffa01364b6>] ? nvlist_lookup_byte_array+0x26/0x30 [znvpair]
[  685.631909]  [<ffffffffa0eb4e59>] ? osd_xattr_get_internal+0xa9/0x210 [osd_zfs]
[  685.633947]  [<ffffffffa0eb5195>] ? osd_xattr_get+0x1d5/0x5e0 [osd_zfs]
[  685.635821]  [<ffffffffa0da9566>] dt_xattr_set+0xa6/0x120 [lfsck]
[  685.637365]  [<ffffffffa0db41e8>] ? lfsck_layout_get_lovea+0xa8/0x240 [lfsck]
[  685.638938]  [<ffffffffa0db8075>] lfsck_layout_master_exec_oit+0x995/0xef0 [lfsck]
[  685.640757]  [<ffffffffa0d8601f>] lfsck_master_oit_engine+0x7ff/0x14d0 [lfsck]
[  685.642397]  [<ffffffff8102a59d>] ? __switch_to+0xcd/0x4e0
[  685.644138]  [<ffffffffa0d87676>] lfsck_master_engine+0x986/0x1390 [lfsck]
[  685.646189]  [<ffffffff810caae0>] ? wake_up_state+0x20/0x20
[  685.647686]  [<ffffffffa0d86cf0>] ? lfsck_master_oit_engine+0x14d0/0x14d0 [lfsck]
[  685.649597]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[  685.650977]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[  685.652596]  [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21
[  685.654416]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140

full report: http://testing.linuxhacker.ru:3333/lustre-reports/110/testresults/sanity-lfsck-zfs-centos7_x86_64-centos7_x86_64/



 Comments   
Comment by Oleg Drokin [ 25/Feb/19 ]

Ok, so this is actually some bad patch in master-next, zfs also crashes and it's a 100% failure now.

Comment by Oleg Drokin [ 27/Feb/19 ]

Ok, I was wrong about that, since I hit it again on without the EA patch I was suspecting

A very similar thing crashes in zfs so likely it's some common Lustre/lfsck issue instead?

Comment by Alex Zhuravlev [ 15/May/19 ]

BUG: unable to handle kernel paging request at ffff88016f73e000
IP: [<ffffffff8137392d>] memcpy+0xd/0x110
PGD 2b05067 PUD 19f5fe067 PMD 19f482067 PTE 800000016f73e060
Oops: 0000 1 SMP DEBUG_PAGEALLOC
Call Trace:
[<ffffffffa0331f79>] ? ldiskfs_xattr_set_entry+0x849/0x880 [ldiskfs]
[<ffffffffa0330b8f>] ? ldiskfs_xattr_find_entry.isra.4+0x8f/0x140 [ldiskfs]
[<ffffffffa033294b>] ldiskfs_xattr_ibody_set+0x2b/0x90 [ldiskfs]
[<ffffffffa0332d04>] ldiskfs_xattr_set_handle+0x134/0x4d0 [ldiskfs]
[<ffffffffa03331ae>] ldiskfs_xattr_set+0x10e/0x1a0 [ldiskfs]
[<ffffffffa037b803>] ldiskfs_xattr_trusted_set+0x23/0x30 [ldiskfs]
[<ffffffff811dd84c>] generic_setxattr+0x5c/0x70
[<ffffffffa03f928c>] osd_xattr_set+0x44c/0x1650 [osd_ldiskfs]
[<ffffffffa0234819>] ? lfsck_layout_get_lovea.part.16+0x89/0x360 [lfsck]
[<ffffffffa0236d47>] lfsck_layout_master_exec_oit+0xd87/0x11b0 [lfsck]
[<ffffffffa03ec9f6>] ? osd_attr_get+0x96/0x1d0 [osd_ldiskfs]
[<ffffffffa01fb650>] lfsck_master_oit_engine+0x1070/0x2190 [lfsck]
[<ffffffff810ce54a>] ? finish_task_switch+0x3a/0x120
[<ffffffffa01fd415>] lfsck_master_engine+0xca5/0x1560 [lfsck]

I'm hitting this very frequently.

Comment by Alex Zhuravlev [ 17/May/19 ]

I think the root cause is the following lines in lfsck_layout_master_exec_oit():

	size = rc;
	lmm = buf->lb_buf;
...
		for (i = 0; i < count; i++) {
			lcme = &lcm->lcm_entries[i];
			lmm = buf->lb_buf + le32_to_cpu(lcme->lcme_offset);
			if (memcmp(oi, &lmm->lmm_oi, sizeof(*oi)) != 0)
				goto fix;
		}
...
		lfsck_buf_init(&ea_buf, lmm, size);

i.e. we move lmm, but do not fix size to accomodate that. and then subsequent call to ldiskfs/zfs may access data outside of the original buffer (which can be non-allocated).
I guess in this case LFSCK can fill LOVEA with garbage.

Comment by Gerrit Updater [ 18/May/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34901
Subject: LU-12013 lfsck: use correct buffer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bcc1212c8e427e6560b37917e21ff0c0efe14eb8

Comment by Gerrit Updater [ 25/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34901/
Subject: LU-12013 lfsck: use correct buffer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e803c18a9e6018b75f027e8f1f01c0dc07bbccc6

Comment by Peter Jones [ 25/May/19 ]

Landed for 2.13

Generated at Sat Feb 10 08:56:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.