[LU-11737] LustreError: 11060:0:(osd_handler.c:3985:osd_xattr_set()) ASSERTION( handle ) failed: Created: 06/Dec/18 Updated: 19/Mar/19 Resolved: 10/Jan/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 1 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This is related to
Hit LBUG. [ 1781.324368] LustreError: 11060:0:(osd_handler.c:3985:osd_xattr_set()) ASSERTION( handle ) failed:
I tried to mount with noscurb still hitting the lbug.
Filesystem keeps crashing with lbug.
|
| Comments |
| Comment by Alex Zhuravlev [ 06/Dec/18 ] |
|
please, show the full stack trace. |
| Comment by Mahmoud Hanafi [ 06/Dec/18 ] |
|
I was running lfsck in dryrun lctl lfsc_start -r -n -o -t layout
PID: 11060 TASK: ffff882f2ed10fd0 CPU: 5 COMMAND: "lfsck_layout"
#0 [ffff882bcca3f868] machine_kexec at ffffffff8105b64b
#1 [ffff882bcca3f8c8] __crash_kexec at ffffffff81105342
#2 [ffff882bcca3f998] panic at ffffffff81689aad
#3 [ffff882bcca3fa18] lbug_with_loc at ffffffffa06e08cb [libcfs]
#4 [ffff882bcca3fa38] osd_xattr_set at ffffffffa10dfcdf [osd_ldiskfs]
#5 [ffff882bcca3fab8] lfsck_layout_refill_lovea at ffffffffa11de3c2 [lfsck]
#6 [ffff882bcca3fb40] lfsck_layout_recreate_lovea at ffffffffa11f20db [lfsck]
#7 [ffff882bcca3fc88] lfsck_layout_assistant_handler_p2 at ffffffffa11f4ada [lfsck]
#8 [ffff882bcca3fd80] lfsck_assistant_engine at ffffffffa11b77ce [lfsck]
#9 [ffff882bcca3fec8] kthread at ffffffff810b1131
#10 [ffff882bcca3ff50] ret_from_fork at ffffffff816a14dd
|
| Comment by Alex Zhuravlev [ 06/Dec/18 ] |
|
ok, thanks. working on this.. |
| Comment by Alex Zhuravlev [ 06/Dec/18 ] |
|
please, try to mount MDS with skip_lfsck mount option. |
| Comment by Mahmoud Hanafi [ 06/Dec/18 ] |
|
Do you want me to try re-running lfsck after remounting? |
| Comment by Alex Zhuravlev [ 06/Dec/18 ] |
|
my understanding is that we want to get the filesystem back first? in the mean time I'll try to get a fix for this issue. |
| Comment by Mahmoud Hanafi [ 06/Dec/18 ] |
|
ok will do. some additional info. Not sure if the delete record is relevant.
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.906822] Lustre: 16159:0:(lfsck_layout.c:1618:lfsck_layout_del_dangling_rec()) nbp13-MDT0000-osd: delete the dangling record for [0x20000205c:0x15dc1:0x0], comp_id = 4, ea_off = 15 from the trace file: rc = -2
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.906827] LustreError: 16159:0:(osd_handler.c:3985:osd_xattr_set()) ASSERTION( handle ) failed:
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.933775] LustreError: 16159:0:(osd_handler.c:3985:osd_xattr_set()) LBUG
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954442] Pid: 16159, comm: lfsck_layout 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954442] Call Trace:
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954458] [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954467] [<ffffffffa06d17cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954470] [<ffffffffa06d187c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954479] [<ffffffffa110acdf>] osd_xattr_set+0x95f/0xc10 [osd_ldiskfs]
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954492] [<ffffffffa12093c2>] lfsck_layout_refill_lovea.isra.63+0x1c2/0x480 [lfsck]
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954499] [<ffffffffa121d0db>] lfsck_layout_recreate_lovea+0xd8b/0x2370 [lfsck]
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954505] [<ffffffffa121fada>] lfsck_layout_assistant_handler_p2+0x141a/0x1650 [lfsck]
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954509] [<ffffffffa11e27ce>] lfsck_assistant_engine+0xfce/0x20b0 [lfsck]
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954511] [<ffffffff810b1131>] kthread+0xd1/0xe0
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954513] [<ffffffff816a14dd>] ret_from_fork+0x5d/0xb0
Dec 6 12:09:17 nbp13-srv1 kernel: [ 5780.954528] [<ffffffffffffffff>] 0xffffffffffffffff
|
| Comment by Alex Zhuravlev [ 06/Dec/18 ] |
|
this is definitely useful, thanks. |
| Comment by Mahmoud Hanafi [ 06/Dec/18 ] |
|
Here is the object info
Inode: 1762289 Type: regular Mode: 07666 Flags: 0x80000
Generation: 2205332265 Version: 0x00000000:00000000
User: 30757 Group: 41548 Project: 0 Size: 0
File ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5bd00854:00000000 -- Tue Oct 23 22:51:16 2018
atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
mtime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
crtime: 0x5bd00837:b8168050 -- Tue Oct 23 22:50:47 2018
Size of extra inode fields: 32
Extended attributes:
trusted.lma (24) = 08 00 00 00 00 00 00 00 00 00 14 00 01 00 00 00 60 1b 21 00 00 00 00 00
lma: fid=[0x100140000:0x211b60:0x0] compat=8 incompat=0
trusted.fid (44)
fid: parent=[0x20000205c:0x15dc1:0x0] stripe=0 stripe_size=1048576 stripe_count=8 component_id=3 component_start=17179869184 component_end=68719476736
EXTENTS:
\I scanned the MDT this 0x20000205c:0x15dc1:0x0 doesn't exist. |
| Comment by Mahmoud Hanafi [ 06/Dec/18 ] |
|
We can lower the prio down to 2. Found 0x20000205c:0x15dc1:0x0 tpfe2 /nobackupp13/.lustre/lost+found/MDT0000 # stat '[0x20000205c:0x15dc1:0x0]-R-0' File: '[0x20000205c:0x15dc1:0x0]-R-0' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: d7d58d2eh/3621096750d Inode: 144115327058402753 Links: 1 Access: (0400/-r--------) Uid: (30757/ spocops) Gid: (41548/ s1548) Access: 2018-12-06 12:09:17.000000000 -0800 Modify: 2018-12-06 12:09:17.000000000 -0800 Change: 2018-12-06 12:09:17.000000000 -0800 Birth: - |
| Comment by Andreas Dilger [ 06/Dec/18 ] |
|
Mahmoud, I also filed LUDOC-421 to add documentation for the files in .lustre/lost+found. In this case, the "-R-" means "The orphan OST-object knows its parent MDT-object FID, but does not know the position (the file name) in the layout." Just to confirm that the file is what we expect, you could do "lfs getstripe -F /nobackupp13/.lustre/lost+found/MDT0000/[0x20000205c:0x15dc1:0x0]-R-0" to get the FID from that file (it should probably be [0x20000205c:0x15dc1:0x0]), and given this object has no data in it (size=0 and blocks=0, and no "filename" exists for it, you can probably just delete it. |
| Comment by Mahmoud Hanafi [ 06/Dec/18 ] |
|
These are probably the objects of the corrupted files that I deleted. |
| Comment by Mahmoud Hanafi [ 11/Dec/18 ] |
|
Any update on getting a patch to deal with the LBUG? |
| Comment by Alex Zhuravlev [ 11/Dec/18 ] |
|
I found one possible cause.. trying to reproduce locally. |
| Comment by Gerrit Updater [ 11/Dec/18 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33826 |
| Comment by Gerrit Updater [ 11/Dec/18 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33827 |
| Comment by Gerrit Updater [ 10/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33826/ |
| Comment by Peter Jones [ 10/Jan/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 19/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33827/ |
| Comment by Gerrit Updater [ 25/Feb/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34312 |
| Comment by Gerrit Updater [ 19/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34312/ |