[LU-11737] LustreError: 11060:0:(osd_handler.c:3985:osd_xattr_set()) ASSERTION( handle ) failed: Created: 06/Dec/18  Updated: 19/Mar/19  Resolved: 10/Jan/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11584 kernel BUG at ldiskfs.h:1907! Resolved
Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

This is related to LU-11584.

  1. umounted the targets
  2. mount mdt as ldiskfs
  3. deleted all the quarantined files
  4. reboot and remount targets

Hit LBUG.

[ 1781.324368] LustreError: 11060:0:(osd_handler.c:3985:osd_xattr_set()) ASSERTION( handle ) failed:
[ 1781.351312] LustreError: 11060:0:(osd_handler.c:3985:osd_xattr_set()) LBUG

 

I tried to mount with noscurb still hitting the lbug.

 

Filesystem keeps crashing with lbug.

 



 Comments   
Comment by Alex Zhuravlev [ 06/Dec/18 ]

please, show the full stack trace.

Comment by Mahmoud Hanafi [ 06/Dec/18 ]

I was running lfsck in dryrun

lctl lfsc_start -r -n -o -t layout

 PID: 11060  TASK: ffff882f2ed10fd0  CPU: 5   COMMAND: "lfsck_layout"
 #0 [ffff882bcca3f868] machine_kexec at ffffffff8105b64b
 #1 [ffff882bcca3f8c8] __crash_kexec at ffffffff81105342
 #2 [ffff882bcca3f998] panic at ffffffff81689aad
 #3 [ffff882bcca3fa18] lbug_with_loc at ffffffffa06e08cb [libcfs]
 #4 [ffff882bcca3fa38] osd_xattr_set at ffffffffa10dfcdf [osd_ldiskfs]
 #5 [ffff882bcca3fab8] lfsck_layout_refill_lovea at ffffffffa11de3c2 [lfsck]
 #6 [ffff882bcca3fb40] lfsck_layout_recreate_lovea at ffffffffa11f20db [lfsck]
 #7 [ffff882bcca3fc88] lfsck_layout_assistant_handler_p2 at ffffffffa11f4ada [lfsck]
 #8 [ffff882bcca3fd80] lfsck_assistant_engine at ffffffffa11b77ce [lfsck]
 #9 [ffff882bcca3fec8] kthread at ffffffff810b1131
#10 [ffff882bcca3ff50] ret_from_fork at ffffffff816a14dd

 

Comment by Alex Zhuravlev [ 06/Dec/18 ]

ok, thanks. working on this..

Comment by Alex Zhuravlev [ 06/Dec/18 ]

please, try to mount MDS with skip_lfsck mount option.

Comment by Mahmoud Hanafi [ 06/Dec/18 ]

Do you want me to try re-running lfsck after remounting?

Comment by Alex Zhuravlev [ 06/Dec/18 ]

my understanding is that we want to get the filesystem back first? in the mean time I'll try to get a fix for this issue.
it seems a part of the code ignores read-only option.

Comment by Mahmoud Hanafi [ 06/Dec/18 ]

ok will do.

some additional info. Not sure if the delete record is relevant.

Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.906822] Lustre: 16159:0:(lfsck_layout.c:1618:lfsck_layout_del_dangling_rec()) nbp13-MDT0000-osd: delete the dangling record for [0x20000205c:0x15dc1:0x0], comp_id = 4, ea_off = 15 from the trace file: rc = -2
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.906827] LustreError: 16159:0:(osd_handler.c:3985:osd_xattr_set()) ASSERTION( handle ) failed: 
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.933775] LustreError: 16159:0:(osd_handler.c:3985:osd_xattr_set()) LBUG
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954442] Pid: 16159, comm: lfsck_layout 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954442] Call Trace:
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954458]  [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954467]  [<ffffffffa06d17cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954470]  [<ffffffffa06d187c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954479]  [<ffffffffa110acdf>] osd_xattr_set+0x95f/0xc10 [osd_ldiskfs]
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954492]  [<ffffffffa12093c2>] lfsck_layout_refill_lovea.isra.63+0x1c2/0x480 [lfsck]
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954499]  [<ffffffffa121d0db>] lfsck_layout_recreate_lovea+0xd8b/0x2370 [lfsck]
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954505]  [<ffffffffa121fada>] lfsck_layout_assistant_handler_p2+0x141a/0x1650 [lfsck]
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954509]  [<ffffffffa11e27ce>] lfsck_assistant_engine+0xfce/0x20b0 [lfsck]
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954511]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954513]  [<ffffffff816a14dd>] ret_from_fork+0x5d/0xb0
Dec  6 12:09:17 nbp13-srv1 kernel: [ 5780.954528]  [<ffffffffffffffff>] 0xffffffffffffffff
Comment by Alex Zhuravlev [ 06/Dec/18 ]

this is definitely useful, thanks.

Comment by Mahmoud Hanafi [ 06/Dec/18 ]

Here is the object info

Inode: 1762289   Type: regular    Mode:  07666   Flags: 0x80000
Generation: 2205332265    Version: 0x00000000:00000000
User: 30757   Group: 41548   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5bd00854:00000000 -- Tue Oct 23 22:51:16 2018
 atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
 mtime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
crtime: 0x5bd00837:b8168050 -- Tue Oct 23 22:50:47 2018
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 08 00 00 00 00 00 00 00 00 00 14 00 01 00 00 00 60 1b 21 00 00 00 00 00 
  lma: fid=[0x100140000:0x211b60:0x0] compat=8 incompat=0
  trusted.fid (44)
  fid: parent=[0x20000205c:0x15dc1:0x0] stripe=0 stripe_size=1048576 stripe_count=8 component_id=3 component_start=17179869184 component_end=68719476736
EXTENTS:

\I scanned the MDT this 0x20000205c:0x15dc1:0x0 doesn't exist.

Comment by Mahmoud Hanafi [ 06/Dec/18 ]

We can lower the prio down to 2.

Found 0x20000205c:0x15dc1:0x0
it was in lost+found
.lustre/lost+found/MDT0000/'[0x20000205c:0x15dc1:0x0]-R-0'

tpfe2 /nobackupp13/.lustre/lost+found/MDT0000 # stat '[0x20000205c:0x15dc1:0x0]-R-0' 
  File: '[0x20000205c:0x15dc1:0x0]-R-0'
  Size: 0               Blocks: 0          IO Block: 4194304 regular empty file
Device: d7d58d2eh/3621096750d   Inode: 144115327058402753  Links: 1
Access: (0400/-r--------)  Uid: (30757/ spocops)   Gid: (41548/   s1548)
Access: 2018-12-06 12:09:17.000000000 -0800
Modify: 2018-12-06 12:09:17.000000000 -0800
Change: 2018-12-06 12:09:17.000000000 -0800
 Birth: -
Comment by Andreas Dilger [ 06/Dec/18 ]

Mahmoud, I also filed LUDOC-421 to add documentation for the files in .lustre/lost+found. In this case, the "-R-" means "The orphan OST-object knows its parent MDT-object FID, but does not know the position (the file name) in the layout."

Just to confirm that the file is what we expect, you could do "lfs getstripe -F /nobackupp13/.lustre/lost+found/MDT0000/[0x20000205c:0x15dc1:0x0]-R-0" to get the FID from that file (it should probably be [0x20000205c:0x15dc1:0x0]), and given this object has no data in it (size=0 and blocks=0, and no "filename" exists for it, you can probably just delete it.

Comment by Mahmoud Hanafi [ 06/Dec/18 ]

These are probably the objects of the corrupted files that I deleted.

Comment by Mahmoud Hanafi [ 11/Dec/18 ]

Any update on getting a patch to deal with the LBUG?

Comment by Alex Zhuravlev [ 11/Dec/18 ]

I found one possible cause.. trying to reproduce locally.

Comment by Gerrit Updater [ 11/Dec/18 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33826
Subject: LU-11737 lfsck: do not ignore dryrun
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e7a7d9836e8b3d51b917e84e425195130db7d2c5

Comment by Gerrit Updater [ 11/Dec/18 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33827
Subject: LU-11737 lfsck: do not ignore dryrun
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 5ed85a4a03eb39f04d7b7379acc8b52f046eeb28

Comment by Gerrit Updater [ 10/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33826/
Subject: LU-11737 lfsck: do not ignore dryrun
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 875f3fc03aa15049892fe19d6a4fc1132848fced

Comment by Peter Jones [ 10/Jan/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 19/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33827/
Subject: LU-11737 lfsck: do not ignore dryrun
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 231649a4e93e5a630af8c0b5715da51a92cfb679

Comment by Gerrit Updater [ 25/Feb/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34312
Subject: LU-11737 lfsck: do not ignore dryrun
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: cf75e5bfd2661e3162355637e5ca1b2100b3a0ea

Comment by Gerrit Updater [ 19/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34312/
Subject: LU-11737 lfsck: do not ignore dryrun
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 0b012bcab5426e9e1fd9f3570dd7704c8129d032

Generated at Sat Feb 10 02:46:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.