[LU-11584] kernel BUG at ldiskfs.h:1907! Created: 29/Oct/18  Updated: 08/Feb/20  Resolved: 25/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Attachments: File debug-lfsck-nbp15-MDT0000.gz     File dumpe2fs.out     File nbp13.debug.gz     File nbp13.lfsck.debug.out1.gz     File nbp13.lfsck.debug.out2.gz     File oi_scrub.out    
Issue Links:
Cloners
is cloned by LU-11589 kernel BUG at ldiskfs.h:1907! Open
Related
is related to LU-9836 Issues with 2.10 upgrade and files mi... Resolved
is related to LU-11588 ofd_create_hdl() shouldn't overwrite ... Resolved
is related to LU-11583 unsupported incompat LMA feature(s) 0... Reopened
is related to LU-11737 LustreError: 11060:0:(osd_handler.c:3... Resolved
Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

server keeps crashing with the following error.

[  981.957669] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
[  981.989579] Lustre: Skipped 11 previous similar messages
[ 1045.404615] ------------[ cut here ]------------
[ 1045.418484] kernel BUG at /tmp/rpmbuild-lustre-jlan-ItUrr9b3/BUILD/lustre-2.10.5/ldiskfs/ldiskfs.h:1907!
[ 1045.446989] invalid opcode: 0000 [#1] SMP 
[ 1045.459302] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) dm_service_time ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) lpfc ib_iser(OE) libiscsi scsi_transport_iscsi crct10dif_generic scsi_transport_fc scsi_tgt rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) bonding ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) sunrpc dm_mirror dm_region_hash dm_log mlx5_ib(OE) ib_core(OE) intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel i2c_algo_bit ttm dm_multipath aesni_intel drm_kms_helper lrw syscopyarea gf128mul sysfillrect sysimgblt glue_helper fb_sys_fops ablk_helper mlx5_core(OE) mlxfw(OE) tg3 ses cryptd mlx_compat(OE) drm ptp ipmi_si enclosure mei_me i2c_core pps_core hpwdt hpilo ipmi_devintf lpc_ich dm_mod mfd_core mei shpchp pcspkr wmi ipmi_msghandler acpi_power_meter binfmt_misc tcp_bic ip_tables virtio_scsi virtio_ring virtio xfs libcrc32c ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_common sg usb_storage smartpqi(E) crc32c_intel scsi_transport_sas [last unloaded: pps_core]
[ 1045.776428] CPU: 5 PID: 11348 Comm: lfsck Tainted: G           OE  ------------   3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1
[ 1045.811992] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/15/2018
[ 1045.837624] task: ffff882ddca23f40 ti: ffff882bd280c000 task.ti: ffff882bd280c000
[ 1045.860117] RIP: 0010:[<ffffffffa10fbd04>]  [<ffffffffa10fbd04>] ldiskfs_rec_len_to_disk.part.9+0x4/0x10 [ldiskfs]
[ 1045.891259] RSP: 0018:ffff882bd280f980  EFLAGS: 00010207
[ 1045.907218] RAX: 0000000000000000 RBX: ffff882bd280fb58 RCX: ffff882bd280f994
[ 1045.928666] RDX: 00000000ffffffac RSI: ffffffffffffff81 RDI: 00000000ffffff81
[ 1045.950113] RBP: ffff882bd280f980 R08: 00000000ffffff81 R09: ffffffffa10fded0
[ 1045.971560] R10: ffff88303f803b00 R11: 0000000000ffffff R12: 000000000000003c
[ 1045.993006] R13: ffff881e2eae7708 R14: ffff881e2eae7690 R15: 0000000000000000
[ 1046.014452] FS:  0000000000000000(0000) GS:ffff882f7ef40000(0000) knlGS:0000000000000000
[ 1046.038775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1046.056039] CR2: 00007ffff20df034 CR3: 0000002ef4268000 CR4: 00000000003607e0
[ 1046.077485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1046.098932] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1046.120378] Call Trace:
[ 1046.127717]  [<ffffffffa10fe245>] htree_inlinedir_to_tree+0x445/0x450 [ldiskfs]
[ 1046.149690]  [<ffffffff8123002e>] ? __generic_file_splice_read+0x4ee/0x5e0
[ 1046.170356]  [<ffffffff81234cdd>] ? __getblk+0x2d/0x2e0
[ 1046.186052]  [<ffffffff81234c4c>] ? __find_get_block+0xbc/0x120
[ 1046.203841]  [<ffffffff81234cdd>] ? __getblk+0x2d/0x2e0
[ 1046.219541]  [<ffffffffa10cdfa0>] ? __ldiskfs_get_inode_loc+0x110/0x3e0 [ldiskfs]
[ 1046.242039]  [<ffffffffa10c89ef>] ? ldiskfs_xattr_find_entry+0x9f/0x130 [ldiskfs]
[ 1046.264536]  [<ffffffffa10c0277>] ldiskfs_htree_fill_tree+0x137/0x2f0 [ldiskfs]
[ 1046.286507]  [<ffffffff811df826>] ? kmem_cache_alloc_trace+0x1d6/0x200
[ 1046.306126]  [<ffffffffa10ae5ec>] ldiskfs_readdir+0x61c/0x850 [ldiskfs]
[ 1046.326012]  [<ffffffffa1147640>] ? osd_declare_ref_del+0x130/0x130 [osd_ldiskfs]
[ 1046.348507]  [<ffffffff812256b2>] ? generic_getxattr+0x52/0x70
[ 1046.366036]  [<ffffffffa1145cde>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[ 1046.387747]  [<ffffffffa1145eb7>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
[ 1046.408158]  [<ffffffffa122808c>] lfsck_open_dir+0x11c/0x3a0 [lfsck]
[ 1046.427257]  [<ffffffffa1228cb2>] lfsck_master_oit_engine+0x9a2/0x1190 [lfsck]
[ 1046.448969]  [<ffffffff816946f7>] ? __schedule+0x477/0xa30
[ 1046.465453]  [<ffffffffa1229d96>] lfsck_master_engine+0x8f6/0x1360 [lfsck]
[ 1046.486120]  [<ffffffff810c4d40>] ? wake_up_state+0x20/0x20
[ 1046.502865]  [<ffffffffa12294a0>] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck]
[ 1046.525360]  [<ffffffff810b1131>] kthread+0xd1/0xe0
[ 1046.540011]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
[ 1046.558323]  [<ffffffff816a14dd>] ret_from_fork+0x5d/0xb0
[ 1046.574540]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
[ 1046.592852] Code: 44 04 02 48 8d 44 03 c8 48 01 c7 e8 b7 f6 22 e0 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b 0f 1f 40 00 55 48 89 e5 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 85 f6 48 
[ 1046.650192] RIP  [<ffffffffa10fbd04>] ldiskfs_rec_len_to_disk.part.9+0x4/0x10 [ldiskfs]



 Comments   
Comment by Mahmoud Hanafi [ 29/Oct/18 ]

some logs before crash

[  373.561429] Lustre: 9389:0:(osd_handler.c:7051:osd_mount()) MGS-osd: device /dev/mapper/nbp13_1-MGS0 was upgraded from Lustre-1.x without enabling the dirdata feature. If you do not want to downgrade to Lustre-1.x again, you can enable it via 'tune2fs -O dirdata device'
[  374.897846] Lustre: 9489:0:(osd_handler.c:371:osd_get_lma()) dm-1: unsupported incompat LMA feature(s) 0xffffffe1 for fid = [0x0:0x20af:0x2], ino = 153397641
[  401.375821] Lustre: nbp13-OST0004: Will be in recovery for at least 5:00, or until 25 clients reconnect
[  473.539046] Lustre: nbp13-MDT0000: Will be in recovery for at least 5:00, or until 24 clients reconnect
[  473.567385] Lustre: Skipped 3 previous similar messages
[  478.625631] Lustre: nbp13-OST0005: Will be in recovery for at least 5:00, or until 25 clients reconnect
[  519.958976] LNet: 4020:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.151.26.154@o2ib: 96 seconds
[  519.989838] LNet: 4020:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 5 previous similar messages
[  530.053761] Lustre: 7860:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1540855068/real 1540855068]  req@ffff882da135b600 x1615703345988272/t0(0) o8->nbp13-OST0004-osc-MDT0000@0@lo:28/4 lens 520/544 e 0 to 1 dl 1540855223 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  667.563723] LustreError: 10029:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
[  667.566809] Lustre: 10692:0:(osd_handler.c:759:osd_check_lma()) nbp13-MDT0000: unsupported incompat LMA feature(s) 0xffffffe1 for fid = [0x200001db5:0x19764:0x0], ino = 162675645
[  667.642235] LustreError: 9617:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116
[  667.695067] LustreError: 9617:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116
[  677.552789] LustreError: 10453:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
[  677.583422] LustreError: 9617:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116
[  677.636261] LustreError: 9617:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116
[  687.545335] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
[  687.577251] LustreError: 10029:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
[  687.607875] LustreError: 9617:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116
Comment by Peter Jones [ 29/Oct/18 ]

Dongyang is looking into this

Comment by Mahmoud Hanafi [ 29/Oct/18 ]

got the crash dump also. if they need to pull something from it

crash> bt
PID: 10665  TASK: ffff882f0e410fd0  CPU: 5   COMMAND: "lfsck"
 #0 [ffff882909ccf630] machine_kexec at ffffffff8105b64b
 #1 [ffff882909ccf690] __crash_kexec at ffffffff81105342
 #2 [ffff882909ccf760] crash_kexec at ffffffff81105430
 #3 [ffff882909ccf778] oops_end at ffffffff81699778
 #4 [ffff882909ccf7a0] die at ffffffff8102e8ab
 #5 [ffff882909ccf7d0] do_trap at ffffffff81698ec0
 #6 [ffff882909ccf820] do_invalid_op at ffffffff8102b124
 #7 [ffff882909ccf8d0] invalid_op at ffffffff816a487e
    [exception RIP: ldiskfs_rec_len_to_disk+4]
    RIP: ffffffffa1167d04  RSP: ffff882909ccf980  RFLAGS: 00010207
    RAX: 0000000000000000  RBX: ffff882909ccfb58  RCX: ffff882909ccf994
    RDX: 00000000ffffffac  RSI: ffffffffffffff81  RDI: 00000000ffffff81
    RBP: ffff882909ccf980   R8: 00000000ffffff81   R9: ffffffffa1169ed0
    R10: ffff88303f803b00  R11: 0000000000ffffff  R12: 000000000000003c
    R13: ffff882387ee3388  R14: ffff882387ee3310  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff882909ccf988] htree_inlinedir_to_tree at ffffffffa116a245 [ldiskfs]
 #9 [ffff882909ccfb28] ldiskfs_htree_fill_tree at ffffffffa112c277 [ldiskfs]
#10 [ffff882909ccfbf0] ldiskfs_readdir at ffffffffa111a5ec [ldiskfs]
#11 [ffff882909ccfca0] osd_ldiskfs_it_fill at ffffffffa11b1cde [osd_ldiskfs]
#12 [ffff882909ccfce8] osd_it_ea_load at ffffffffa11b1eb7 [osd_ldiskfs]
#13 [ffff882909ccfd10] lfsck_open_dir at ffffffffa123f08c [lfsck]
#14 [ffff882909ccfd50] lfsck_master_oit_engine at ffffffffa123fcb2 [lfsck]
#15 [ffff882909ccfdf0] lfsck_master_engine at ffffffffa1240d96 [lfsck]
#16 [ffff882909ccfec8] kthread at ffffffff810b1131
#17 [ffff882909ccff50] ret_from_fork at ffffffff816a14dd
Comment by Peter Jones [ 29/Oct/18 ]

Could you please supply version of Lustre details?

Comment by Dongyang Li [ 29/Oct/18 ]

I can see inline_data is enabled for the OST:

htree_inlinedir_to_tree+0x445/0x450 [ldiskfs]

currently we don't support inline_data on the targets, and mkfs.lustre should not enabled them.

How was the OST created?

Comment by Jay Lan (Inactive) [ 29/Oct/18 ]

I have these LU patches on top of 2.10.5:
LU-10055 mdt: use max_mdsize in reply for layout intent
LU-11187 ldiskfs: don't mark mmp buffer head dirty
LU-9230 ldlm: speed up preparation for list of lock cancel
LU-10830 utils: fix create mode for lfs setstripe
LU-10003 lnet: clarify lctl deprecation message
LU-10003 lnet: deprecate lctl net commands
LU-9810 lnd: use less CQ entries for each connection
LU-9810 lnet: fix build with M-OFED 4.1

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

Normal ldiskfs format operation.

Here is typical lustre.csv like

service432-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp13_1-OST22,/mnt/lustre/nbp13_1-OST22,ost,nbp13,"10.151.26.183@o2ib:10.151.26.185@o2ib",22,,"-m 0 -i 10485760 -G 64 -t ext4 -E packed_meta_blocks=1","acl,errors=panic,user_xattr,max_sectors_kb=0",10.151.26.185@o2ib:10.151.26.183@o2ib
nbp13_1-MGS0: Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota
nbp13_1-MDT0000: Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata inline_data sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0003: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0005: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0006: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0008: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST000A: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0000: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0001: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0002: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0004: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0007: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
nbp13_1-OST0009: Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
Comment by Dongyang Li [ 30/Oct/18 ]

Just saw your updated comment, Looks like nbp13_1-MDT0000 has inline_data enabled.

If it was created with e2fsprogs-1.44.3.wc1 then mke2fs will stop and give an error saying dirdata and inline_data can not be enabled at the same time,

If it was created with the earlier version of e2fsprogs it doesn't even know about inline_data feature.

Was inline_data enabled by tune2fs some point after the target was created?

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

FYI we had hardware issue on this filesystem on friday and we had to run fsck all targets. It had found/fix issues. This could be a side effect of that.

dumpe2fs.out

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

it was created with e2fsprogs-1.42.13.wc6-7.el7.x86_64. Then  e2fsprogs-1.44.3.wc1 fsck was ran it this weekend.

during fsck had issue with quota file so I disabled and renabled it.

tune2fs -O^quota

tune2fs -Oquota

 

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

Should I remove the inline_data feature?

Comment by Dongyang Li [ 30/Oct/18 ]

Do we still have the output of the e2fsck?

I think there is  bug in the e2fsck,

which a corrupted inode flag made e2fsck set the inline_data feature in the superblock.

if that's the case then we need to clear the inline_data feature bit and rerun the e2fsck with a patch to fix the inode.

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

Don't have the fsck output.

i can run

tune2fs -O^inline_data

what do you mean 'e2fsck with a patch'

 

Comment by Andreas Dilger [ 30/Oct/18 ]

Yes, the inline_data feature is not currently supported with Lustre.

As you wrote, "tune2fs -O ^inline_data" will disable the feature, but e2fsck will automatically enable the feature if it finds an inode with the EXT4_INLINE_DATA_FL set. If there is only a handful of inodes with this flag set, you could run e2fsck -f /dev/XXX (note no 'y' option) and then when it asks to enable the inline_data feature answer 'n' and 'y' to clearing the inode. This would erase the whole inode, but it is also likely that these inodes just contain garbage anyway.

If these are critical files, instead of e2fsck clearing the whole inode, it is also possible to run e2fsck -fn /dev/XXX after disabling the inline_data feature to get a list of inodes affected by this issue, and then use debugfs -w /dev/XXX on the unmounted filesystem, and then stat <inum>|/ROOT/path/to/inode to print the flags on each inode and set_inode_field <inum>|/ROOT/path/to/inode to clear the EXT4_INLINE_DATA_FL = 0x10000000 flag. Unfortunately, there is no debugfs interface to just clear a single flag from an inode, so the existing value is needed to know what to set.

Comment by Dongyang Li [ 30/Oct/18 ]

I agree with Andreas, Just want to mention that "tune2fs -O ^inline_data" won't work

to disable inline_data, we need to "debugfs -w /dev/XXX" and then "feature -inline_data"

the patch I mentioned is to make e2fsck clear the inode rather than enabling inline_data feature,

e2fsck currently trusts the inode flag if it has inline_data flag set, however for us that inode is highly like to contain garbage.

You can disable inline_data and clear the inode or clear EXT4_INLINE_DATA_FL flag for the inode like Andreas said above, without the patch. The patch is just to prevent this from happening again.

DY

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

tune2fs -O^inline_data /dev/mapper/nbp13_1-MDT0
tune2fs 1.44.3.wc1 (23-July-2018)
Clearing filesystem feature 'inline_data' not supported.

 1. I will run the debugfs command

2. run fsck -fn to get list of files.

 

Comment by Mahmoud Hanafi [ 30/Oct/18 ]
 root@nbp13-srv1 ~]# e2fsck -fn /dev/mapper/nbp13_1-MDT0 | tee /tmp/fsck.out
e2fsck 1.44.3.wc1 (23-July-2018)
Pass 1: Checking inodes, blocks, and sizes
Inode 140572827 has inline data, but superblock is missing INLINE_DATA feature
Clear? noInode 140572827 has INLINE_DATA_FL flag on filesystem without inline data support.
Clear? noInode 140572828 has inline data, but superblock is missing INLINE_DATA feature
Clear? noInode 140572828 has INLINE_DATA_FL flag on filesystem without inline data support.
Clear? noPass 2: Checking directory structure
Pass 3: Checking directory connectivity
'..' in /ROOT/pkolano (140572827) is <The NULL inode> (0), should be /ROOT (140569473).
Fix? noUnconnected directory inode 140572828 (/ROOT/pkolano/tmp)
Connect to /lost+found? noUnconnected directory inode 140572829 (/ROOT/pkolano/tmp/64.3)
Connect to /lost+found? no'..' in ... (140572829) is /ROOT/pkolano/tmp (140572828), should be <The NULL inode> (0).
Fix? noUnconnected directory inode 140572894 (/ROOT/pkolano/tmp/64.2)
Connect to /lost+found? no'..' in ... (140572894) is /ROOT/pkolano/tmp (140572828), should be <The NULL inode> (0).
Fix? noPass 4: Checking reference counts
Inode 140569473 ref count is 9, should be 8.  Fix? noInode 140572827 ref count is 3, should be 1.  Fix? noInode 140572828 ref count is 4, should be 2.  Fix? noInode 140572829 ref count is 2, should be 1.  Fix? noInode 140572894 ref count is 2, should be 1.  Fix? noPass 5: Checking group summary informationnbp13-MDT0000: ********** WARNING: Filesystem still has errors **********nbp13-MDT0000: 28251917/317769600 files (0.1% non-contiguous), 83952122/3106406400 blocks

both 2 inode can be delete
debugfs: ncheck 140572828
Inode Pathname
140572828 /ROOT/pkolano/tmp
debugfs: ncheck 140572827
Inode Pathname
140572827 /ROOT/pkolano

Comment by Andreas Dilger [ 30/Oct/18 ]

You should be able to disable the inline_data feature via "debugfs -w 'feature ^inline_data' /dev/XXX" to bypass the tune2fs checks.

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

I did disable the feature via debugfs. How do i clear the INLINE_DATA_FL from the inodes?

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

I got past the two inodes. and mounted the filesystem. I see these errors.

[17342.023159] LustreError: 26378:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
[17342.053760] LustreError: 26378:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 59 previous similar messages
[17342.082037] LustreError: 25151:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116
[17342.135124] LustreError: 25151:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 59 previous similar messages
[17342.165732] LustreError: 25151:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116

Comment by Andreas Dilger [ 30/Oct/18 ]

This looks like it only affects creating files on the one OST0008, the rest of the filesystem should be usable at this point, including reading data on the affected OSTs. If there are multiple OSTs similarly affected then that could be problematic over time, but not immediately except for reduced performance. It should be possible to restart use of the OSTs by deleting the file lov_objids and lov_objseq on the MDT.

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

I umounted the MDT and remount using ldiskfs. Removed the 2 files and remounted using lustre. Still seeing the errors. Do I need to remount all OSTs?

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

This filesystem is having additional issues.

ls -l is hanging on some dir and some dir owner and group are showing up as "?"


 tpfe2 /nobackupp13/spocops/git/sector/spoc/code/dist/logs # ls
metrics-dump-0.txt  metrics-dump-0.txt.old  tmq.wrapper.log  tmq.wrapper.log.1  tmq.wrapper.log.2  worker.wrapper.log
tpfe2 /nobackupp13/spocops/git/sector/spoc/code/dist/logs # ls -l
ls: cannot access 'tmq.wrapper.log.1': No such file or directory
ls: cannot access 'metrics-dump-0.txt': No such file or directory

 

Comment by Andreas Dilger [ 30/Oct/18 ]

This typically indicates that the OST objects for those files are missing. OI Scrub on the OSTs should have already moved any objects from the OST's local lost+found directory back into the right place, but it wouldn't hurt to take a look (you could run "debugfs -c -R 'ls -l lost+found' /dev/XXXX" on the respective OSTs, there should only be "." and ".." and a few empty directory blocks reported).

Other than that, if the OST objects are lost due to hardware corruption, then there isn't much that can be done for those files beyond deleting them (with "unlink" instead of "rm") and restoring them from backup.

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

How do we clear the

[17342.082037] LustreError: 25151:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116

issue.

delete lov_objids and lov_objseq didn't work

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

There are files listed in lost+found but looks like empty dirctory blocks.

debugfs: ls -l
 11 40700 (2) 0 0 139264 7-Aug-2018 21:08 .
 2 40755 (2) 0 0 4096 7-Aug-2018 21:09 ..
 0 0 (1) 0 0 0 #75852
 0 0 (1) 0 0 0 #113934
 0 0 (1) 0 0 0 #184111
 0 0 (1) 0 0 0 #266679
 0 0 (1) 0 0 0 #331827
 0 0 (1) 0 0 0 #385401
 0 0 (1) 0 0 0 #444954
 0 0 (1) 0 0 0 #496838
 0 0 (1) 0 0 0 #567511
 0 0 (1) 0 0 0 #605846
 0 0 (1) 0 0 0 #649369
 0 0 (1) 0 0 0 #687206
 0 0 (1) 0 0 0 #732707
 0 0 (1) 0 0 0 #769520
 0 0 (1) 0 0 0 #815218
 0 0 (1) 0 0 0 #875528
 0 0 (1) 0 0 0 #915005
 0 0 (1) 0 0 0 #955684
 0 0 (1) 0 0 0 #993221
 0 0 (1) 0 0 0 #1028775
 0 0 (1) 0 0 0 #1073199
 0 0 (1) 0 0 0 #1111095
 0 0 (1) 0 0 0 #1148688
 0 0 (1) 0 0 0 #1191718
 0 0 (1) 0 0 0 #1230579
 0 0 (1) 0 0 0 #1273743
 0 0 (1) 0 0 0 #1312334
 0 0 (1) 0 0 0 #1353029
 0 0 (1) 0 0 0 #1431710
 0 0 (1) 0 0 0 #1472117
 0 0 (1) 0 0 0 #1524449
 0 0 (1) 0 0 0 #1605063
 0 0 (1) 0 0 0 #1666014
Comment by Alex Zhuravlev [ 30/Oct/18 ]

can you please try to mount again with full debug enabled and attach logs from MDS and that OST?

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

Do you want me to remount the OST and MDT or just the MDT?

Comment by Alex Zhuravlev [ 30/Oct/18 ]

ideally - both, please: MDS, then OST.

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

Both ost and mdt are on the same host

nbp13.debug.gz

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

Filesystem level issues are:

1. files with ? for user and gid
2. directories where ls -l hangs.
3. No such file or directory
4. unsupported incompat LMA feature(s) 0xffffffe1 (https://jira.whamcloud.com/browse/LU-11583) I tried setfattr -x trusted.lma /mnt/XXX/ROOT/path/to/file it didn't work.

How do we find and clear all these?

Comment by Alex Zhuravlev [ 30/Oct/18 ]

thanks, it will take some time to study the logs. can you please also check OI scrub status:

lctl get_param osd*.*OST*.oi_scrub
Comment by Mahmoud Hanafi [ 30/Oct/18 ]

Attaching oi_scrub.out

oi_scrub.out

Comment by Andreas Dilger [ 30/Oct/18 ]

It looks like OST0008 is currently running an OI Scrub triggered by the object precreate from the MDS:

1540934094.152104:0:1589:0:(ofd_dev.c:1588:ofd_create_hdl()) ofd_create(0x0:2195196)
1540934094.152114:0:1589:0:(ofd_dev.c:1750:ofd_create_hdl()) nbp13-OST0008: reserve 32 objects in group 0x0 at 2195165
1540934094.152122:0:1589:0:(osd_handler.c:1003:osd_fid_lookup()) Process entered
1540934094.165749:0:1589:0:(osd_handler.c:728:osd_check_lma()) Process entered
1540934094.165750:0:1589:0:(osd_handler.c:793:osd_check_lma()) Process leaving (rc=-78)                <************  -78 = -EREMCHG
1540934094.165757:0:1589:0:(osd_scrub.c:2654:osd_scrub_start()) Process entered
540934094.165790:0:1589:0:(osd_scrub.c:2661:osd_scrub_start()) Process leaving (rc=0 : 0 : 0)
1540934094.165791:0:1589:0:(osd_handler.c:1139:osd_fid_lookup()) nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
1540934094.213780:0:1589:0:(ofd_dev.c:446:ofd_object_free()) object free, fid = [0x100080000:0x217edd:0x0]
1540934094.213783:0:1589:0:(ofd_objects.c:253:ofd_precreate_objects()) Process leaving via out (rc=-115)
1540934094.213785:0:1589:0:(ofd_objects.c:402:ofd_precreate_objects()) created 0/32 objects: -115
1540934094.213785:0:1589:0:(ofd_objects.c:405:ofd_precreate_objects()) Process leaving (rc=-115)
1540934094.213786:0:1589:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
1540934094.272318:0:11192:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116

Based on the speed of the scrub of the other OSTs, this process should only take about 15s and should have completed already for OST0008, but it looks like it is either stuck or restarting the scrub repeatedly due to some inconsistency it is finding with the OST objects.

osd-ldiskfs.nbp13-OST0008.oi_scrub=
name: OI_scrub
magic: 0x4c5fd252
oi_files: 64
status: scanning
flags: auto
param:
time_since_last_completed: 9 seconds
time_since_latest_start: 8 seconds
time_since_last_checkpoint: 8 seconds
latest_start_position: 12
last_checkpoint_position: 11
first_failure_position: N/A
checked: 1170405
updated: 0
failed: 0
prior_updated: 0
noscrub: 0
igif: 0
success_count: 11061
run_time: 8 seconds
average_speed: 146300 objects/sec
real-time_speed: 155205 objects/sec
current_position: 1457233
lf_scanned: 0
lf_repaired: 0
lf_failed: 0
inodes_per_group: 16
current_iit_group: 91077
current_iit_base: 1457233
current_iit_offset: 1
scrub_in_prior: no
scrub_full_speed: yes
partial_scan: no

As for resolving the outstanding issues:
1. files with ? are missing OST objects: Unless there is some expectation that these files can be recovered by some other means, they should probably be deleted. This could either be done by manually scanning the filesystem with e.g. "find" or by running a full layout LFSCK. However, until the OI Scrub issue on OST0008 is resolved then the full LFSCK will likely also not complete.
2. directories where "ls -l" hangs: may be caused by the IO Scrub ongoing on OST0008. You could check if lfs getstripe on hanging files include only files on OST0008
3. no such file or directory: is the same cause as #1 - OST objects are missing and stat() on those objects returns -ENOENT
4. unsupported incompat LMA feature(s) 0xffffffe1: did you do the setfattr -x trusted.lma on the ldiskfs-mounted MDT filesystem and the correct file? that should have removed the LMA xattr to clear the flag. According to LU-11583 you deleted that file already?

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

So... we need to resolve the nbp13-OST0008 issue first. The oi scrub keep restart due to the same fid.

[  766.323537] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
[  766.355463] Lustre: Skipped 3 previous similar messages
[  766.371175] LustreError: 8836:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
[  766.401518] LustreError: 8836:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 3 previous similar messages
[  766.401539] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116
[  766.401540] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 3 previous similar messages
[  766.401543] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116
[  766.401544] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) Skipped 3 previous similar messages
[  836.271099] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
[  836.303036] Lustre: Skipped 6 previous similar messages
[  836.318743] LustreError: 8836:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
[  836.349088] LustreError: 8836:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 6 previous similar messages
[  836.349107] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116
[  836.349108] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 6 previous similar messages
[  836.349111] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116
[  836.349112] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) Skipped 6 previous similar messages
[  867.763998] LNet: 3774:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.151.26.144@o2ib: 36 seconds
[  867.794860] LNet: 3774:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 4 previous similar messages
[  966.173700] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
[  966.205625] Lustre: Skipped 12 previous similar messages
[  966.221594] LustreError: 8837:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
[  966.251939] LustreError: 8837:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 12 previous similar messages
[  966.251958] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116
[  966.251960] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 12 previous similar messages
[  966.251962] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116
[  966.251963] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) Skipped 12 previous similar messages
[ 1225.994890] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
[ 1226.026820] Lustre: Skipped 25 previous similar messages
[ 1226.042790] LustreError: 8837:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115
[ 1226.073134] LustreError: 8837:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 25 previous similar messages
[ 1226.073159] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116
[ 1226.073161] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 25 previous similar messages
[ 1226.073164] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116
[ 1226.073165] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) Skipped 25 previous similar messages

How do i find this inode?

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

for #4. yes I delete one of the files. but there are more, which do contain user data.

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

I located [0x100080000:0x217edd:0x0] on the OST
it is just a empty inode.

# debugfs -c -R "stat <2195165>" /dev/mapper/nbp13_1-OST8
debugfs 1.44.3.wc1 (23-July-2018)
/dev/mapper/nbp13_1-OST8: catastrophic mode - not reading inode or group bitmaps
Inode: 2195165   Type: bad type    Mode:  0000   Flags: 0x0
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 0
File ACL: 0
Links: 0   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x00000000 -- Wed Dec 31 16:00:00 1969
atime: 0x00000000 -- Wed Dec 31 16:00:00 1969
mtime: 0x00000000 -- Wed Dec 31 16:00:00 1969
Size of extra inode fields: 0
BLOCKS:
Comment by Mahmoud Hanafi [ 31/Oct/18 ]

Any updates?

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

Some info;
file with ? for uid and gid are the onces that get called out on the mdt as

 unsupported incompat LMA feature(s) 0x70687320 for fid = [0x0:0x2bae:0x2], ino = 100026353
Comment by Mahmoud Hanafi [ 31/Oct/18 ]

Here is an example of a inode that ls will hang.


 [18623.900347] Lustre: 31365:0:(osd_handler.c:371:osd_get_lma()) dm-1: unsupported incompat LMA feature(s) 0x73746960 for fid = [0x0:0x13af:0x2], ino = 236893545
[18623.942973] Lustre: 31365:0:(osd_handler.c:371:osd_get_lma()) Skipped 138971 previous similar messages

 

nbp15-srv1 ~ # debugfs -c -R 'stat <236893545> ' /dev/mapper/nbp15_1-MDT0
debugfs 1.44.3.wc1 (23-July-2018)
/dev/mapper/nbp15_1-MDT0: catastrophic mode - not reading inode or group bitmaps
Inode: 236893545   Type: regular    Mode:  0640   Flags: 0x0
Generation: 20109448    Version: 0x00000003:10887bb2
User: 522602360   Group:  1179   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5bbf9603:00000000 -- Thu Oct 11 11:27:15 2018
 atime: 0x5b9814ba:00000000 -- Tue Sep 11 12:17:14 2018
 mtime: 0x565df0af:00000000 -- Tue Dec  1 11:10:39 2015
crtime: 0x5b99a181:a7491e2c -- Wed Sep 12 16:30:09 2018
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 6c 6c 63 2e 66 69 74 73 00 00 00 00 00 00 00 00 af 13 00 00 02 00 00 00 
  lma: fid=[0:0x13af:0x2] compat=2e636c6c incompat=73746966
  trusted.link (80)
  trusted.lov (128)
BLOCKS: 

We need a way find and clear these errros.

Comment by Andreas Dilger [ 31/Oct/18 ]

The information in the "lma" xattr looks to be total garbage. The compat=2e636c6c and incompat=73746966 flags are full of unknown values - only a small number of values are defined. It looks like the trusted.fid has been clobbered by ASCII text, which includes "6c 6c 63 2e 66 69 74 73 == llc.fit", "2e636c6c = .cll", and "73746966 = stif" (or the reverse, depending on byte ordering). One option is clearing the "lma" xattr, in case the "lov" xattr still contains a valid LOV_MAGIC value and a valid layout. The "trusted.lma" xattr can be rebuilt by OI Scrub if needed.

To delete the trusted.lma xattr, the MDT needs to be mounted as type ldiskfs, since the MDS blocks direct access/modification to this xattr. Then "setfattr -x trusted.lma /path/to/file" to delete the xattr.

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

But looks like there are 1000's of these inodes. How can we easily find them?
What about the nbp13-OST8 issues. We have a second filesystem with the same issue.

Comment by Andreas Dilger [ 31/Oct/18 ]

I located [0x100080000:0x217edd:0x0] on the OST, it is just a empty inode.

Is this the object O/0/d29/2195165 or how did you map this FID to that inode number? If it is, then that would imply directory corruption on the OST, since the directory entry shouldn't be pointing at an unused inode. Ah, to clarify, the 0x217edd part of the FID does not map directly to the inode number, it is just the OID part of the FID, an arbitrary sequential number. If O/0/d29/2195165 exists on OST0008, what does "stat" report for it?

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

RE: [0x100080000:0x217edd:0x0], OK I did that mapping incorrectly. Is there a way to find out what that object inode is?

are you saying [0x100080000:0x217edd:0x0] -> maps to O/0/d29/2195165

debugfs:  stat O/0/d29/2195165
Inode: 1762634   Type: regular    Mode:  07666   Flags: 0x80000
Generation: 3301012751    Version: 0x00000000:00000000
User:     0   Group:     0   Project:     0   Size: 0
File ACL: 0
Links: 2   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
 atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
 mtime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
crtime: 0x5bd254c2:a90f833c -- Thu Oct 25 16:41:54 2018
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 08 00 00 00 00 00 00 00 00 00 08 00 01 00 00 00 9d 7e 21 00 00 00 00 00 
  lma: fid=[0x100080000:0x217e9d:0x0] compat=8 incompat=0
EXTENTS:
Comment by Andreas Dilger [ 31/Oct/18 ]

Correct. The 0x100080000 part of the FID identifies it as an OST FID (0x1 part) on OST0008. The second part is the Object ID, which (in decimal) is the filename, and modulo 32 is the subdirectory.

Comment by Andreas Dilger [ 31/Oct/18 ]

So it looks like there is a hard link to this object, likely from O/0/d29/2195101, which is probably the correct object for that inode due to the FID in the lma xattr, and O/0/d29/2195165 should be removed.

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

should i delete it via ldiskfs mount or debugfs -w mi ?

how should we scan for bad lma xattr?

Comment by Andreas Dilger [ 31/Oct/18 ]

The OST object should be deleted via ldiskfs.

As for the bad lma xattr, I don't think that LFSCK can fix that problem right now, since the incompat flag is specifically intended to block old Lustre versions that don't understand particular feature flags from modifying the inode. For finding the objects, probably the easiest way is to run a namespace walk to find inodes that show errors when accessed. It may be that "lfs find <mountpoint>" might be enough to generate an error message for a file with the bad LMA. Unfortunately, we can't use e.g. "lfs fid2path" on the FIDs reported in the error message since they are not valid FIDs.

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

ls -l will find these but it will hang on some.
I have a modified version of lester backend scan tool or can e2scan be used to get the bad inodes

Comment by Andreas Dilger [ 31/Oct/18 ]

Probably Lester would be fastest. If it is already able to decode the LMA (which is probably yes, since that is how it finds the FID) it shouldn't be too hard to check the compat and incompat flags at the same time. The current known compat and incompat flags are:

enum lma_compat {
        LMAC_HSM         = 0x00000001,
/*      LMAC_SOM         = 0x00000002, obsolete since 2.8.0 */
        LMAC_NOT_IN_OI   = 0x00000004, /* the object does NOT need OI mapping */
        LMAC_FID_ON_OST  = 0x00000008, /* For OST-object, its OI mapping is
                                       * under /O/<seq>/d<x>. */
        LMAC_STRIPE_INFO = 0x00000010, /* stripe info in the LMA EA. */
        LMAC_COMP_INFO   = 0x00000020, /* Component info in the LMA EA. */
        LMAC_IDX_BACKUP  = 0x00000040, /* Has index backup. */
};

/**
 * Masks for all features that should be supported by a Lustre version to
 * access a specific file.
 * This information is stored in lustre_mdt_attrs::lma_incompat.
 */
enum lma_incompat {
        LMAI_RELEASED           = 0x00000001, /* file is released */
        LMAI_AGENT              = 0x00000002, /* agent inode */
        LMAI_REMOTE_PARENT      = 0x00000004, /* the parent of the object
                                                 is on the remote MDT */
        LMAI_STRIPED            = 0x00000008, /* striped directory inode */
        LMAI_ORPHAN             = 0x00000010, /* inode is orphan */
        LMA_INCOMPAT_SUPP       = (LMAI_AGENT | LMAI_REMOTE_PARENT | \
                                   LMAI_STRIPED | LMAI_ORPHAN)
};
Comment by Mahmoud Hanafi [ 31/Oct/18 ]

how should I delete O/0/d29/2195101

Comment by Andreas Dilger [ 31/Oct/18 ]

Strictly speaking, if O/0/d29/2195101 exists, then if the O/0/d29/2195165 link is deleted it should be OK again. That said, this object has no data and has not been used by an MDT inode yet (or it would report a "parent" FID sa well), so there is probably no huge risk to delete it as well, but I also don't think it is totally necessary.

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

any chance we can get a tool to scan the MDT for the bad lma? I think that is our only change find or ls -l just hangs

Comment by Andreas Dilger [ 31/Oct/18 ]

Do you have an idea of many bad objects exist in the filesystem? Have you been able to access file data for some files, but only some relatively small fraction (e.g. 1% or 5%) of the files are exhibiting the bad lma problem? Is this problem only happening on the MDT or also on the OST?

The "right" tool for this would be to modify LFSCK to be able to detect an "obviously" corrupt LMA and erase and rebuild it, for some definition of "obviously correct", while preserving the original meaning of the incompat flag. However, that is not something that should be rushed, as we would need to test it fairly well to ensure it does not quickly and automatically do the wrong thing for the filesystem and cause more problems.

Have you tried using something like "lfs find -uid 0 /mnt/XXX" to scan the mounted filesystem? It does not try to instantiate the file inodes on the client (to avoid cache pollution), but rather just fetches the inode attributes to the client and returns them to userspace. However, it does need to access the directory inodes. so there would still be some chance of the client hanging.

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

Don’t know for sure how many I am guessing +5000 or more. If you run lfs find it will not find any. you need to do at least a ls -l. Just like a ls won’t work.

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

If we can clear at lma xattr can we not read all the bad xattr mounted as ldiskfs

Comment by Andreas Dilger [ 31/Oct/18 ]

If it is mounted as ldiskfs, then there would need to be a userspace tool written to decode the lma xattr from disk, since it is a binary structure. The debugfs utility decodes this for us for debugging purposes.

Comment by Andreas Dilger [ 31/Oct/18 ]

Alex is investigating a change to LFSCK to rewrite the LMA and clear the bad flags and incorrect FID.

In the meantime, if it is possible I would suggest to make a device-level backup of the MDT filesystem in case there are any problems. This should be possible in a few hours if there is a suitable device available to hold it.

Comment by Alex Zhuravlev [ 31/Oct/18 ]

yes, I've been working on a patch for OI scrub to fix wrong names in /O/.. which seem to be the blocking point.
also, I've got a test simulating the problem - essentially a single extra hardlink in /O/.. exposes the problem with endless precreate.

Comment by Alex Zhuravlev [ 31/Oct/18 ]

as for duplicated hardlinks (have to you tried to remove O/0/d29/2195165 manually?) I think you can use the following command on a directly mounted OST filesystem:

find O -type f ! -links 1

as that that is object index and it's not supposed to have hardlinks at all. this way you can estimate how objects may need recovery.

Comment by Andreas Dilger [ 31/Oct/18 ]

mhanafi were you able to clear the bad (hard-linked) inode(s) on OST0008 to get beyond the precreate problem?

For the LMA issue, Alex is still working on a patch. It would be useful to also dump the "trusted.lov" xattr on one of the inodes that have the LMA error to see if it still contains a valid layout. This would need to be done via If the LOV does not contain a valid layout then it needs to be removed as well.

My understanding is that beyond the files impacted by the LMA issue, the filesystem should be usable at this point. Peter was mentioning that there were several filesystems affected at this time? Are they all hitting the same problems? How did multiple filesystems become corrupted at the same time?

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

Yes i deleted the O/0/d29/2195165 and that got us past that.

We had 5 filesystem on similar raid backends. They experienced the same issue during firmware update. During firmware update the RAID t10pi setting got turned off this caused errors on the hosts side.

I was able to do a backend scan of all MDT inodes and dump out the lma_compat and lma_incompat. It looks like they are zero except for the bad inodes.

On nbp10 I ran setfattr -x trusted.lma on the list of bad inodes. And mounted via lustre and start lfsck. it is still running.

This is an example of a bad inode. I'll try to get you some more examples.

nbp15-srv1 ~ # debugfs -c -R 'stat <236893545> ' /dev/mapper/nbp15_1-MDT0
debugfs 1.44.3.wc1 (23-July-2018)
/dev/mapper/nbp15_1-MDT0: catastrophic mode - not reading inode or group bitmaps
Inode: 236893545   Type: regular    Mode:  0640   Flags: 0x0
Generation: 20109448    Version: 0x00000003:10887bb2
User: 522602360   Group:  1179   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5bbf9603:00000000 -- Thu Oct 11 11:27:15 2018
 atime: 0x5b9814ba:00000000 -- Tue Sep 11 12:17:14 2018
 mtime: 0x565df0af:00000000 -- Tue Dec  1 11:10:39 2015
crtime: 0x5b99a181:a7491e2c -- Wed Sep 12 16:30:09 2018
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 6c 6c 63 2e 66 69 74 73 00 00 00 00 00 00 00 00 af 13 00 00 02 00 00 00 
  lma: fid=[0:0x13af:0x2] compat=2e636c6c incompat=73746966
  trusted.link (80)
  trusted.lov (128)
BLOCKS: 

How do I figure what file this is

Lustre: nbp10-MDT0000: trigger OI scrub by RPC for the [0x2000033ce:0x2f:0x0] with flags 0x4a, rc = 0

fid2path is hanging

tpfe2 ~ # lfs fid2path /nobackupp10 0x2000033ce:0x2f:0x0

The filesystems may be usable for the most part, but we have taking them it offline make sure all issues are resolved before releasing back to the users.

 

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

nbp10 current lfsck status:

name: lfsck_namespace
magic: 0xa0621a0b
version: 2
status: scanning-phase1
flags: inconsistent,incomplete
param: all_targets,orphan,create_ostobj,create_mdtobj
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1541009812
time_since_latest_start: 2281 seconds
last_checkpoint_time: 1541012058
time_since_last_checkpoint: 35 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 116707678, N/A, N/A
first_failure_position: 81903790, [0x200003393:0x149:0x0], 0x21ece37a
checked_phase1: 72944086
checked_phase2: 0
updated_phase1: 1020
updated_phase2: 0
failed_phase1: 38
failed_phase2: 0
directories: 615982
dirent_repaired: 217
linkea_repaired: 802
nlinks_repaired: 0
multiple_linked_checked: 35712
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 1
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
success_count: 0
run_time_phase1: 4136 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 17636 items/sec
average_speed_phase2: N/A
average_speed_total: 17636 items/sec
real_time_speed_phase1: 274 items/sec
real_time_speed_phase2: N/A
current_position: 116925445, N/A, N/A
 
tpfe2 ~ # lfs fid2path /nobackupp10 0x200003393:0x149:0x0
/nobackupp10/hhashimo/data/GDM/data

This the directory where ls -l is hanging.

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

more example of bad inode


 debugfs:  stat cmst_file
Inode: 229115909   Type: regular    Mode:  0664   Flags: 0x0
Generation: 2422773946    Version: 0x00000001:00000028
User: 10376   Group:  1987   Project:     0   Size: 0
File ACL: 2239771779
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5b75bb84:00000000 -- Thu Aug 16 10:59:32 2018
 atime: 0x5bd267f4:00000000 -- Thu Oct 25 18:03:48 2018
 mtime: 0x5b75bb73:3e55a3d4 -- Thu Aug 16 10:59:15 2018
crtime: 0x5b75bb73:3e55a3d4 -- Thu Aug 16 10:59:15 2018
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 73 74 5f 66 69 6c 65 00 00 00 00 00 00 00 00 00 06 04 00 00 02 00 00 00 
  lma: fid=[0:0x406:0x2] compat=665f7473 incompat=656c69
  trusted.link (51)
  system.posix_acl_access (28) = 00 00 00 00 00 00 00 00 01 00 00 00 01 00 06 00 02 00 06 00 fb 28 00 00 04 00 04 00 
  trusted.lov (1688)
BLOCKS:

 

Comment by Andreas Dilger [ 31/Oct/18 ]

It is possible that fid2path is hanging because LFSCK is still running and rebuilding the OI files, so it is getting a return code if "-EINPROGRESS" for which the client will wait indefinitely until the MDS completes LFSCK and locates the respective FID it returns an error. That said, given the FID is corrupt in the LMA, then it is possible that the requested FID will no longer exist.

Typically, LFSCK will trust the FID stored in the inode LMA over a fid in the directory entry, since the chance of the LMA FID xattr being corrupted without actually corrupting the xattr structure itself (which are stored within a few bytes of each other) was considered to be extremely unlikely, though I guess we may have to reconsider this assumption. I'd need to check the LFSCK code to see if it does a validity check on the dirent FID vs. the LMA FID and excludes one if it is not valid.

For the files where you removed the LMA xattr, are those files now accessible?

It is water under the bridge at this point, but in the future I'd suggest a staged rollout of changes like this so that any issues seem during the upgrade are contained to a single filesystem.

Comment by Mahmoud Hanafi [ 31/Oct/18 ]

after removing the lma xattr, an ls -l will hang and trigger a oi_scrub.

 

 

Comment by Mahmoud Hanafi [ 01/Nov/18 ]

More examples:

ls -l output

 -????????? ? ? ? ? ? PrfToolParametersTest.class

 

debugfs:  stat PrfToolParametersTest.class
Inode: 168298272   Type: regular    Mode:  0640   Flags: 0x0
Generation: 1296031430    Version: 0x00000003:3e688f5d
User: 30757   Group: 41548   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5bd11d80:79fa43d0 -- Wed Oct 24 18:33:52 2018
 atime: 0x5bd11d80:79fa43d0 -- Wed Oct 24 18:33:52 2018
 mtime: 0x5bd11d80:79fa43d0 -- Wed Oct 24 18:33:52 2018
crtime: 0x5bd11d80:79fa43d0 -- Wed Oct 24 18:33:52 2018
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 01 21 00 00 02 00 00 00 b3 6f 00 00 00 00 00 00 
  lma: fid=[0x200002101:0x6fb3:0x0] compat=0 incompat=0
  trusted.lov (448)
  trusted.link (69)
BLOCKS:

 

ls -l output 

ls: cannot access './spocops/git/sector/spoc/code/commissioning-tools/build/src/main/matlab/write_LsqParameters.m': No such file or directory
debugfs:  stat /ROOT/./spocops/git/sector/spoc/code/commissioning-tools/build/src/main/matlab/write_LsqParameters.m
Inode: 168297875   Type: regular    Mode:  0640   Flags: 0x0
Generation: 1296029279    Version: 0x00000003:3e6662b8
User: 30757   Group: 41548   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5bd11cf9:00000000 -- Wed Oct 24 18:31:37 2018
 atime: 0x5bd11d80:00000000 -- Wed Oct 24 18:33:52 2018
 mtime: 0x5bd11cf9:00000000 -- Wed Oct 24 18:31:37 2018
crtime: 0x5bd11cf9:212fdac8 -- Wed Oct 24 18:31:37 2018
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 01 21 00 00 02 00 00 00 3c 67 00 00 00 00 00 00 
  lma: fid=[0x200002101:0x673c:0x0] compat=0 incompat=0
  trusted.lov (448)
  trusted.link (63)
BLOCKS:
Comment by Andreas Dilger [ 01/Nov/18 ]

Here files look like the LMA xattr is valid. Can you check "lfs getstripe" for the files to get the objects, then on the respective OSTs you can use "objid=NNNN; debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/XXX" to see if the object is missing or maybe broken (wrong parent FID)?

Comment by Mahmoud Hanafi [ 01/Nov/18 ]
lfs getstripe write_LsqParameters.m
 write_LsqParameters.m
 lcm_layout_gen: 4
 lcm_entry_count: 4
 lcme_id: 1
 lcme_flags: init
 lcme_extent.e_start: 0
 lcme_extent.e_end: 8388608
 lmm_stripe_count: 1
 lmm_stripe_size: 1048576
 lmm_pattern: 1
 lmm_layout_gen: 0
 lmm_stripe_offset: 21
 lmm_objects:

0: { l_ost_idx: 21, l_fid: [0x100150000:0x215567:0x0] }

lcme_id: 2
 lcme_flags: 0
 lcme_extent.e_start: 8388608
 lcme_extent.e_end: 17179869184
 lmm_stripe_count: 4
 lmm_stripe_size: 1048576
 lmm_pattern: 1
 lmm_layout_gen: 65535
 lmm_stripe_offset: -1
 lcme_id: 3
 lcme_flags: 0
 lcme_extent.e_start: 17179869184
 lcme_extent.e_end: 68719476736
 lmm_stripe_count: 8
 lmm_stripe_size: 1048576
 lmm_pattern: 1
 lmm_layout_gen: 65535
 lmm_stripe_offset: -1
 lcme_id: 4
 lcme_flags: 0
 lcme_extent.e_start: 68719476736
 lcme_extent.e_end: EOF
 lmm_stripe_count: 16
 lmm_stripe_size: 1048576
 lmm_pattern: 1
 lmm_layout_gen: 65535
 lmm_stripe_offset: -1

=========================================
nbp13-srv2 ~ # objid=`printf "%i\n" 0x215567
nbp13-srv2 ~ # debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/mapper/nbp13_1-OST21
 debugfs 1.44.3.wc1 (23-July-2018)
 /dev/mapper/nbp13_1-OST21: catastrophic mode - not reading inode or group bitmaps
 O/0/d7/2184551: File not found by ext2_lookup

 

 lfs getstripe PrfToolParametersTest.class
PrfToolParametersTest.class
  lcm_layout_gen:  4
  lcm_entry_count: 4
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   8388608
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 13
      lmm_objects:
      - 0: { l_ost_idx: 13, l_fid: [0x1000d0000:0x2156d1:0x0] }

    lcme_id:             2
    lcme_flags:          0
    lcme_extent.e_start: 8388608
    lcme_extent.e_end:   17179869184
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    65535
      lmm_stripe_offset: -1
    lcme_id:             3
    lcme_flags:          0
    lcme_extent.e_start: 17179869184
    lcme_extent.e_end:   68719476736
      lmm_stripe_count:  8
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    65535
      lmm_stripe_offset: -1
    lcme_id:             4
    lcme_flags:          0
    lcme_extent.e_start: 68719476736
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  16
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    65535
      lmm_stripe_offset: -1
=================================
objid=`printf "%i\n" 0x2156d1`
 debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/mapper/nbp13_1-OST13 
debugfs 1.44.3.wc1 (23-July-2018)
/dev/mapper/nbp13_1-OST13: catastrophic mode - not reading inode or group bitmaps
O/0/d17/2184913: File not found by ext2_lookup 

so missing objects.

 

Comment by Gerrit Updater [ 01/Nov/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33546
Subject: LU-11584 osd-ldiskfs: fix lost+found object replace
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8f98c54f9d7207c2a3f10f06cb913359cbf65a6d

Comment by Gerrit Updater [ 01/Nov/18 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33547
Subject: LU-11584 osd: OI scrub to remove corrupted LMA
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 14674945f442e1dccdf2ed4cfd81eb2a2a55e1cf

Comment by Mahmoud Hanafi [ 01/Nov/18 ]

are they patches ready for us to try?

 

Comment by Alex Zhuravlev [ 01/Nov/18 ]

mhanafi not yet, still in testing..

Comment by Jay Lan (Inactive) [ 02/Nov/18 ]

File lustre/include/uapi/linux/lustre/lustre_user.h does not exist in 2.10.5.
It looks like the file resides at /lustre/include/lustre/lustre_user.h in b2_10. Is it safe for me to apply the change to that file or should I wait for your back port?

Comment by Alex Zhuravlev [ 02/Nov/18 ]

I'm making a port right now.

Comment by Gerrit Updater [ 02/Nov/18 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33560
Subject: LU-11584 osd: OI scrub to ignore object with broken LMA
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 98b64203037c12c9df93b8d17e07370d05372b9c

Comment by Alex Zhuravlev [ 02/Nov/18 ]

first of all, you need to apply the patch and rebuild Lustre. the packages need to be installed on MDS, and OSS in case there is similar (but unseen) corruption there.

Then it makes sense to estimate amount of recovery needed (i.e. identify files with broken LMA).

Do the follow steps (on the MDS):

1) mount MDS with OI scrub disabled:

 mount -t lustre -o user_xattr,noscrub <mdt device> <mdt mountpoint>

2) set debug level for subsequent analysis:

lctl set_param debug=+lfsck

3) start LFSCK in read-only mode:

 lctl lfsck_start -M nbp13_1-MDT0000 -t namespace -r --dryrun

4) wait for LFSCK completion checking status:

lctl get_param -n  mdd.*.lfsck_namespace | egrep "^status|inconsistent"

5) grab and post last lfsck_namespace status
6) attach Lustre kernel debug log from the MDS as well:

lctl dk | gzip -9 > /tmp/debug-lfsck-nbp_1-MDT0000.log.gz

Thanks

Comment by Andreas Dilger [ 02/Nov/18 ]

I've reviewed the patch and the backport. The LFSCK/scrub testing failed on the master version of the patch due to a known (unrelated) intermittent test error. The testing on the backported patch was delayed because we just added ARM builds to all patches and this was misconfigured for b2_10, but that has been resolved. The testing on the backported patch is expected to complete in about an hour, and no problems are expected.

It should be noted that this patch to LFSCK is intended to repair the specific LMA corruption that is seen on this system, and is not intended for long-term inclusion in your production release. There is no expectation of problems in the short term, but the fix bypasses specific consistency checks in the code that should be restored before the system is upgraded, and a different patch will be landed for long-term production use.

The above procedure is running LFSCK in "dry run" mode, so no fixes will be made to the filesystem, only a report of the number of files that will be repaired. If the dry run is successful and the number of files being repaired is consistent with expectations, I'd recommend to run in fixing mode (remove "--dryrun" option) on the "test" filesystem and/or MDT backup image to ensure it fixes the problem. Please attach logs to the ticket when LFSCK is finished, or if you have problems.

Comment by Jay Lan (Inactive) [ 02/Nov/18 ]

Thanks for the update, Andreas and Alex~

Comment by Mahmoud Hanafi [ 03/Nov/18 ]

We haven't ran the new code but here is one more example: Is this bad lma on the OST object?

[325981.396812] Lustre: Skipped 3 previous similar messages
[326747.450553] Lustre: nbp13-OST0001: trigger OI scrub by RPC for the [0x100010000:0x2155af:0x0] with flags 0x4a, rc = 0
[326747.482740] Lustre: Skipped 3 previous similar messages
[327512.978588] Lustre: nbp13-OST0001: trigger OI scrub by RPC for the [0x100010000:0x2155af:0x0] with flags 0x4a, rc = 0
[327513.010762] Lustre: Skipped 3 previous similar messages
[328279.688198] Lustre: nbp13-OST0001: trigger OI scrub by RPC for the [0x100010000:0x2155af:0x0] with flags 0x4a, rc = 0
[328279.720378] Lustre: Skipped 3 previous similar messages
nbp13-srv1 ~ # objid=`printf "%i" 0x2155af`
nbp13-srv1 ~ # debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/mapper/nbp13_1-OST1
debugfs 1.44.3.wc1 (23-July-2018)
/dev/mapper/nbp13_1-OST1: catastrophic mode - not reading inode or group bitmaps
Inode: 1673602   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 2828099384    Version: 0x00000003:005e7593
User: 30757   Group: 41548   Project:     0   Size: 2180
File ACL: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5bd11ce7:00000000 -- Wed Oct 24 18:31:19 2018
 atime: 0x5bd11ce8:00000000 -- Wed Oct 24 18:31:20 2018
 mtime: 0x5bd11ce7:00000000 -- Wed Oct 24 18:31:19 2018
crtime: 0x5bd11c77:03872348 -- Wed Oct 24 18:29:27 2018
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 08 00 00 00 00 00 00 00 00 00 01 00 01 00 00 00 ae 55 21 00 00 00 00 00 
  lma: fid=[0x100010000:0x2155ae:0x0] compat=8 incompat=0
  trusted.fid (44)
  fid: parent=[0x200002101:0x66b8:0x0] stripe=0 stripe_size=1048576 stripe_count=1 component_id=1 component_start=0 component_end=8388608
EXTENTS:
(0):3426781548



tpfe2 ~ # lfs fid2path /nobackupp13 0x200002101:0x66b8:0x0
/nobackupp13/quarantine/spocops/git/sector/spoc/code/dist/dist/classes/java/main/gov/nasa/tess/dv/outputs/DvAbstractTargetTableData$Builder.class

tpfe2 ~ # ls -l /nobackupp13/quarantine/spocops/git/sector/spoc/code/dist/dist/classes/java/main/gov/nasa/tess/dv/outputs/DvAbstractTargetTableData
ls: cannot access '/nobackupp13/quarantine/spocops/git/sector/spoc/code/dist/dist/classes/java/main/gov/nasa/tess/dv/outputs/DvAbstractTargetTableData': No such file or directory
Comment by Mahmoud Hanafi [ 03/Nov/18 ]

i ran lfsck on nbp15 which has the same issues as 13. We are planing on reformatting it.

 

debug-lfsck-nbp15-MDT0000.gz

Comment by Andreas Dilger [ 03/Nov/18 ]
nbp13-srv1 ~ # objid=`printf "%i" 0x2155af`
nbp13-srv1 ~ # debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/mapper/nbp13_1-OST1

FYI, if you have the hex value for the object ID, you could directly use:

debugfs -c -R "stat O/0/d$((0x2155af % 32))/$((0x2155af))" /dev/mapper/nbp13_1-OST1

In any case, what is strange is that this is object ID being looked up is 0x2155af, but the object that is found reports itself to be 0x2155ae:

Extended attributes:
  lma: fid=[0x100010000:0x2155ae:0x0] compat=8 incompat=0

Based on the fid2path output, it looks like this object is actually 0x2155ae, so it should be renamed from "/O/0/d15/218463" to "/O/0/d14/2184622". It isn't clear why OI Scrub is not repairing this automatically.

Comment by Andreas Dilger [ 03/Nov/18 ]

I did see something interesting in the debug log... One of the files that LFSCK complained about was:

osd_handler.c:6401:osd_dirent_check_repair()) nbp15-MDT0000: the target inode does not recognize the dirent, dir = 237857984/19940587,  name = kplr011027624-2012004120508_llc.fits, ino = 237860402, [0x2000013af:0x8f13:0x0]: rc = -61
osd_handler.c:6401:osd_dirent_check_repair()) nbp15-MDT0000: the target inode does not recognize the dirent, dir = 238340766/19942571,  name = kplr005385471-2009259160929_llc.fits, ino = 238345690, [0x2000013ae:0x8f1c:0x0]: rc = -61

The filenames both end in "llc.fits" which is the same ASCII string that was corrupting the LMA FID. This is returning "-61 = -ENODATA" which Alex's patch is supposed to do when it finds a corrupted LMA FID, but it doesn't look like it repaired them:

        rc = osd_get_lma(info, inode, dentry, &info->oti_ost_attrs);
        if (rc == -ENODATA || !fid_is_sane(&lma->lma_self_fid))
                lma = NULL;
        :
        :
        if (!fid_is_zero(fid)) {
                rc = osd_verify_ent_by_linkea(env, inode, pfid, ent->oied_name,
                                              ent->oied_namelen);
                if (rc == -ENOENT ||
                    (rc == -ENODATA &&
                     !(dev->od_scrub.os_scrub.os_file.sf_flags & SF_UPGRADE))) {
                        /*
                         * linkEA does not recognize the dirent entry,
                         * it may because the dirent entry corruption
                         * and points to other's inode.
                         */
                        CDEBUG(D_LFSCK, "%s: the target inode does not "
                               "recognize the dirent, dir = %lu/%u, "
                               " name = %.*s, ino = %llu, "
                               DFID": rc = %d\n", devname, dir->i_ino,
                               dir->i_generation, ent->oied_namelen,
                               ent->oied_name, ent->oied_ino, PFID(fid), rc);
                        *attr |= LUDA_UNKNOWN;

                        GOTO(out, rc = 0);
                }

I'd suspect that this is because the linkEA ("link" xattr which is also stored in the inode) is also missing? It looks like we need to set the SF_UPGRADE flag (maybe renamed to "SF_REBUILD_LMA") if the LMA has been removed (rc = -ENODATA) so that we fall through to the LMA repair code further down? We can't check for the LMAC_INIT_FID flag, since it is stored in the LMA itself, which is missing here.

Comment by Mahmoud Hanafi [ 03/Nov/18 ]

Here is the nbp13 lfsck runs.


 nbp13-srv1 ~ # lctl get_param -n  mdd.*.lfsck_namespace
name: lfsck_namespace
magic: 0xa0621a0b
version: 2
status: completed
flags: inconsistent
param: dryrun
last_completed_time: 1541281433
time_since_last_completed: 341 seconds
latest_start_time: 1541281072
time_since_latest_start: 702 seconds
last_checkpoint_time: 1541281433
time_since_last_checkpoint: 341 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 317719759, N/A, N/A
first_failure_position: 153388517, [0x2000020af:0x39d9:0x0], 0x753a410c57f07b3
checked_phase1: 30987846
checked_phase2: 111
inconsistent_phase1: 2
inconsistent_phase2: 3
failed_phase1: 21
failed_phase2: 3
directories: 2709152
dirent_inconsistent: 0
linkea_inconsistent: 2
nlinks_inconsistent: 0
multiple_linked_checked: 5
multiple_linked_inconsistent: 0
unknown_inconsistency: 0
unmatched_pairs_inconsistent: 0
dangling_inconsistent: 0
multiple_referenced_inconsistent: 3
bad_file_type_inconsistent: 0
lost_dirent_inconsistent: 0
local_lost_found_scanned: 3
local_lost_found_moved: 3
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_inconsistent: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_inconsistent: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_inconsistent: 0
linkea_overflow_inconsistent: 0
success_count: 3
run_time_phase1: 362 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 85601 items/sec
average_speed_phase2: 111 objs/sec
average_speed_total: 85366 items/sec
real_time_speed_phase1: N/A
real_time_speed_phase2: N/A
current_position: N/A

nbp13.lfsck.debug.out2.gz nbp13.lfsck.debug.out1.gz

Comment by Gerrit Updater [ 05/Nov/18 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/33576
Subject: LU-11584 e2fsck: check xattr 'system.data' before setting inline_data feature
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 64b71635ffa84a01946199e3cd31b1ee9fd9a15f

Comment by Mahmoud Hanafi [ 05/Nov/18 ]

Any comments on the output of nbp13.lfsck?

Comment by Alex Zhuravlev [ 05/Nov/18 ]

I'm modifying the test to simulate additional broken LinkEA, going to report results ASAP.

Comment by Alex Zhuravlev [ 05/Nov/18 ]

I still don't understand why the nbp13 log doesn't contain "unsupported incompat LMA feature" message.

Comment by Mahmoud Hanafi [ 18/Nov/18 ]

I was able to find all the inodes with bad LMA and delete them via ldiskfs. So what we have left are files that trigger OI scrub and that report "?" for size/uid/etc. The user has been able recover all the effected files, so we just need a way to delete the files.

If we delete the files via ldiskfs how can we make sure that the objects will be cleaned up.

 

Comment by Andreas Dilger [ 21/Nov/18 ]

Mahmoud, the orphan OST objects can be cleaned up with LFSCK layout checking. The orphans are linked into the $MOUNT/.lustre/lost+found directory if "lctl lfsck_start -o -t layout" is used (the "-o" option can be used as part of a full LFSCK run as well).

Comment by Mahmoud Hanafi [ 06/Dec/18 ]

Open new prio1 case LU-11737

after delete quarantined files hitting lbug.

Comment by Gerrit Updater [ 27/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33546/
Subject: LU-11584 osd-ldiskfs: fix lost+found object replace
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 900352f2bc15906a8fba9cb889df4b166a53bade

Comment by Joseph Gmitter (Inactive) [ 25/Nov/19 ]

Patch landed to master.

Generated at Sat Feb 10 02:45:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.