[LU-11584] kernel BUG at ldiskfs.h:1907! Created: 29/Oct/18 Updated: 08/Feb/20 Resolved: 25/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 1 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
server keeps crashing with the following error.
[ 981.957669] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
[ 981.989579] Lustre: Skipped 11 previous similar messages
[ 1045.404615] ------------[ cut here ]------------
[ 1045.418484] kernel BUG at /tmp/rpmbuild-lustre-jlan-ItUrr9b3/BUILD/lustre-2.10.5/ldiskfs/ldiskfs.h:1907!
[ 1045.446989] invalid opcode: 0000 [#1] SMP
[ 1045.459302] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) dm_service_time ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) lpfc ib_iser(OE) libiscsi scsi_transport_iscsi crct10dif_generic scsi_transport_fc scsi_tgt rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) bonding ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) sunrpc dm_mirror dm_region_hash dm_log mlx5_ib(OE) ib_core(OE) intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel i2c_algo_bit ttm dm_multipath aesni_intel drm_kms_helper lrw syscopyarea gf128mul sysfillrect sysimgblt glue_helper fb_sys_fops ablk_helper mlx5_core(OE) mlxfw(OE) tg3 ses cryptd mlx_compat(OE) drm ptp ipmi_si enclosure mei_me i2c_core pps_core hpwdt hpilo ipmi_devintf lpc_ich dm_mod mfd_core mei shpchp pcspkr wmi ipmi_msghandler acpi_power_meter binfmt_misc tcp_bic ip_tables virtio_scsi virtio_ring virtio xfs libcrc32c ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_common sg usb_storage smartpqi(E) crc32c_intel scsi_transport_sas [last unloaded: pps_core]
[ 1045.776428] CPU: 5 PID: 11348 Comm: lfsck Tainted: G OE ------------ 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1
[ 1045.811992] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/15/2018
[ 1045.837624] task: ffff882ddca23f40 ti: ffff882bd280c000 task.ti: ffff882bd280c000
[ 1045.860117] RIP: 0010:[<ffffffffa10fbd04>] [<ffffffffa10fbd04>] ldiskfs_rec_len_to_disk.part.9+0x4/0x10 [ldiskfs]
[ 1045.891259] RSP: 0018:ffff882bd280f980 EFLAGS: 00010207
[ 1045.907218] RAX: 0000000000000000 RBX: ffff882bd280fb58 RCX: ffff882bd280f994
[ 1045.928666] RDX: 00000000ffffffac RSI: ffffffffffffff81 RDI: 00000000ffffff81
[ 1045.950113] RBP: ffff882bd280f980 R08: 00000000ffffff81 R09: ffffffffa10fded0
[ 1045.971560] R10: ffff88303f803b00 R11: 0000000000ffffff R12: 000000000000003c
[ 1045.993006] R13: ffff881e2eae7708 R14: ffff881e2eae7690 R15: 0000000000000000
[ 1046.014452] FS: 0000000000000000(0000) GS:ffff882f7ef40000(0000) knlGS:0000000000000000
[ 1046.038775] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1046.056039] CR2: 00007ffff20df034 CR3: 0000002ef4268000 CR4: 00000000003607e0
[ 1046.077485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1046.098932] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1046.120378] Call Trace:
[ 1046.127717] [<ffffffffa10fe245>] htree_inlinedir_to_tree+0x445/0x450 [ldiskfs]
[ 1046.149690] [<ffffffff8123002e>] ? __generic_file_splice_read+0x4ee/0x5e0
[ 1046.170356] [<ffffffff81234cdd>] ? __getblk+0x2d/0x2e0
[ 1046.186052] [<ffffffff81234c4c>] ? __find_get_block+0xbc/0x120
[ 1046.203841] [<ffffffff81234cdd>] ? __getblk+0x2d/0x2e0
[ 1046.219541] [<ffffffffa10cdfa0>] ? __ldiskfs_get_inode_loc+0x110/0x3e0 [ldiskfs]
[ 1046.242039] [<ffffffffa10c89ef>] ? ldiskfs_xattr_find_entry+0x9f/0x130 [ldiskfs]
[ 1046.264536] [<ffffffffa10c0277>] ldiskfs_htree_fill_tree+0x137/0x2f0 [ldiskfs]
[ 1046.286507] [<ffffffff811df826>] ? kmem_cache_alloc_trace+0x1d6/0x200
[ 1046.306126] [<ffffffffa10ae5ec>] ldiskfs_readdir+0x61c/0x850 [ldiskfs]
[ 1046.326012] [<ffffffffa1147640>] ? osd_declare_ref_del+0x130/0x130 [osd_ldiskfs]
[ 1046.348507] [<ffffffff812256b2>] ? generic_getxattr+0x52/0x70
[ 1046.366036] [<ffffffffa1145cde>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[ 1046.387747] [<ffffffffa1145eb7>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
[ 1046.408158] [<ffffffffa122808c>] lfsck_open_dir+0x11c/0x3a0 [lfsck]
[ 1046.427257] [<ffffffffa1228cb2>] lfsck_master_oit_engine+0x9a2/0x1190 [lfsck]
[ 1046.448969] [<ffffffff816946f7>] ? __schedule+0x477/0xa30
[ 1046.465453] [<ffffffffa1229d96>] lfsck_master_engine+0x8f6/0x1360 [lfsck]
[ 1046.486120] [<ffffffff810c4d40>] ? wake_up_state+0x20/0x20
[ 1046.502865] [<ffffffffa12294a0>] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck]
[ 1046.525360] [<ffffffff810b1131>] kthread+0xd1/0xe0
[ 1046.540011] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
[ 1046.558323] [<ffffffff816a14dd>] ret_from_fork+0x5d/0xb0
[ 1046.574540] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
[ 1046.592852] Code: 44 04 02 48 8d 44 03 c8 48 01 c7 e8 b7 f6 22 e0 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b 0f 1f 40 00 55 48 89 e5 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 85 f6 48
[ 1046.650192] RIP [<ffffffffa10fbd04>] ldiskfs_rec_len_to_disk.part.9+0x4/0x10 [ldiskfs]
|
| Comments |
| Comment by Mahmoud Hanafi [ 29/Oct/18 ] |
|
some logs before crash [ 373.561429] Lustre: 9389:0:(osd_handler.c:7051:osd_mount()) MGS-osd: device /dev/mapper/nbp13_1-MGS0 was upgraded from Lustre-1.x without enabling the dirdata feature. If you do not want to downgrade to Lustre-1.x again, you can enable it via 'tune2fs -O dirdata device' [ 374.897846] Lustre: 9489:0:(osd_handler.c:371:osd_get_lma()) dm-1: unsupported incompat LMA feature(s) 0xffffffe1 for fid = [0x0:0x20af:0x2], ino = 153397641 [ 401.375821] Lustre: nbp13-OST0004: Will be in recovery for at least 5:00, or until 25 clients reconnect [ 473.539046] Lustre: nbp13-MDT0000: Will be in recovery for at least 5:00, or until 24 clients reconnect [ 473.567385] Lustre: Skipped 3 previous similar messages [ 478.625631] Lustre: nbp13-OST0005: Will be in recovery for at least 5:00, or until 25 clients reconnect [ 519.958976] LNet: 4020:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.151.26.154@o2ib: 96 seconds [ 519.989838] LNet: 4020:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 5 previous similar messages [ 530.053761] Lustre: 7860:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1540855068/real 1540855068] req@ffff882da135b600 x1615703345988272/t0(0) o8->nbp13-OST0004-osc-MDT0000@0@lo:28/4 lens 520/544 e 0 to 1 dl 1540855223 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 667.563723] LustreError: 10029:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 [ 667.566809] Lustre: 10692:0:(osd_handler.c:759:osd_check_lma()) nbp13-MDT0000: unsupported incompat LMA feature(s) 0xffffffe1 for fid = [0x200001db5:0x19764:0x0], ino = 162675645 [ 667.642235] LustreError: 9617:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 [ 667.695067] LustreError: 9617:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116 [ 677.552789] LustreError: 10453:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 [ 677.583422] LustreError: 9617:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 [ 677.636261] LustreError: 9617:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116 [ 687.545335] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0 [ 687.577251] LustreError: 10029:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 [ 687.607875] LustreError: 9617:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 |
| Comment by Peter Jones [ 29/Oct/18 ] |
|
Dongyang is looking into this |
| Comment by Mahmoud Hanafi [ 29/Oct/18 ] |
|
got the crash dump also. if they need to pull something from it
crash> bt
PID: 10665 TASK: ffff882f0e410fd0 CPU: 5 COMMAND: "lfsck"
#0 [ffff882909ccf630] machine_kexec at ffffffff8105b64b
#1 [ffff882909ccf690] __crash_kexec at ffffffff81105342
#2 [ffff882909ccf760] crash_kexec at ffffffff81105430
#3 [ffff882909ccf778] oops_end at ffffffff81699778
#4 [ffff882909ccf7a0] die at ffffffff8102e8ab
#5 [ffff882909ccf7d0] do_trap at ffffffff81698ec0
#6 [ffff882909ccf820] do_invalid_op at ffffffff8102b124
#7 [ffff882909ccf8d0] invalid_op at ffffffff816a487e
[exception RIP: ldiskfs_rec_len_to_disk+4]
RIP: ffffffffa1167d04 RSP: ffff882909ccf980 RFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff882909ccfb58 RCX: ffff882909ccf994
RDX: 00000000ffffffac RSI: ffffffffffffff81 RDI: 00000000ffffff81
RBP: ffff882909ccf980 R8: 00000000ffffff81 R9: ffffffffa1169ed0
R10: ffff88303f803b00 R11: 0000000000ffffff R12: 000000000000003c
R13: ffff882387ee3388 R14: ffff882387ee3310 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff882909ccf988] htree_inlinedir_to_tree at ffffffffa116a245 [ldiskfs]
#9 [ffff882909ccfb28] ldiskfs_htree_fill_tree at ffffffffa112c277 [ldiskfs]
#10 [ffff882909ccfbf0] ldiskfs_readdir at ffffffffa111a5ec [ldiskfs]
#11 [ffff882909ccfca0] osd_ldiskfs_it_fill at ffffffffa11b1cde [osd_ldiskfs]
#12 [ffff882909ccfce8] osd_it_ea_load at ffffffffa11b1eb7 [osd_ldiskfs]
#13 [ffff882909ccfd10] lfsck_open_dir at ffffffffa123f08c [lfsck]
#14 [ffff882909ccfd50] lfsck_master_oit_engine at ffffffffa123fcb2 [lfsck]
#15 [ffff882909ccfdf0] lfsck_master_engine at ffffffffa1240d96 [lfsck]
#16 [ffff882909ccfec8] kthread at ffffffff810b1131
#17 [ffff882909ccff50] ret_from_fork at ffffffff816a14dd
|
| Comment by Peter Jones [ 29/Oct/18 ] |
|
Could you please supply version of Lustre details? |
| Comment by Dongyang Li [ 29/Oct/18 ] |
|
I can see inline_data is enabled for the OST: htree_inlinedir_to_tree+0x445/0x450 [ldiskfs] currently we don't support inline_data on the targets, and mkfs.lustre should not enabled them. How was the OST created? |
| Comment by Jay Lan (Inactive) [ 29/Oct/18 ] |
|
I have these LU patches on top of 2.10.5: |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
Normal ldiskfs format operation. Here is typical lustre.csv like service432-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp13_1-OST22,/mnt/lustre/nbp13_1-OST22,ost,nbp13,"10.151.26.183@o2ib:10.151.26.185@o2ib",22,,"-m 0 -i 10485760 -G 64 -t ext4 -E packed_meta_blocks=1","acl,errors=panic,user_xattr,max_sectors_kb=0",10.151.26.185@o2ib:10.151.26.183@o2ib nbp13_1-MGS0: Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota nbp13_1-MDT0000: Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata inline_data sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0003: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0005: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0006: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0008: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST000A: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0000: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0001: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0002: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0004: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0007: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota nbp13_1-OST0009: Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota |
| Comment by Dongyang Li [ 30/Oct/18 ] |
|
Just saw your updated comment, Looks like nbp13_1-MDT0000 has inline_data enabled. If it was created with e2fsprogs-1.44.3.wc1 then mke2fs will stop and give an error saying dirdata and inline_data can not be enabled at the same time, If it was created with the earlier version of e2fsprogs it doesn't even know about inline_data feature. Was inline_data enabled by tune2fs some point after the target was created? |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
FYI we had hardware issue on this filesystem on friday and we had to run fsck all targets. It had found/fix issues. This could be a side effect of that. |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
it was created with e2fsprogs-1.42.13.wc6-7.el7.x86_64. Then e2fsprogs-1.44.3.wc1 fsck was ran it this weekend. during fsck had issue with quota file so I disabled and renabled it. tune2fs -O^quota tune2fs -Oquota
|
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
Should I remove the inline_data feature? |
| Comment by Dongyang Li [ 30/Oct/18 ] |
|
Do we still have the output of the e2fsck? I think there is bug in the e2fsck, which a corrupted inode flag made e2fsck set the inline_data feature in the superblock. if that's the case then we need to clear the inline_data feature bit and rerun the e2fsck with a patch to fix the inode. |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
Don't have the fsck output. i can run tune2fs -O^inline_data what do you mean 'e2fsck with a patch'
|
| Comment by Andreas Dilger [ 30/Oct/18 ] |
|
Yes, the inline_data feature is not currently supported with Lustre. As you wrote, "tune2fs -O ^inline_data" will disable the feature, but e2fsck will automatically enable the feature if it finds an inode with the EXT4_INLINE_DATA_FL set. If there is only a handful of inodes with this flag set, you could run e2fsck -f /dev/XXX (note no 'y' option) and then when it asks to enable the inline_data feature answer 'n' and 'y' to clearing the inode. This would erase the whole inode, but it is also likely that these inodes just contain garbage anyway. If these are critical files, instead of e2fsck clearing the whole inode, it is also possible to run e2fsck -fn /dev/XXX after disabling the inline_data feature to get a list of inodes affected by this issue, and then use debugfs -w /dev/XXX on the unmounted filesystem, and then stat <inum>|/ROOT/path/to/inode to print the flags on each inode and set_inode_field <inum>|/ROOT/path/to/inode to clear the EXT4_INLINE_DATA_FL = 0x10000000 flag. Unfortunately, there is no debugfs interface to just clear a single flag from an inode, so the existing value is needed to know what to set. |
| Comment by Dongyang Li [ 30/Oct/18 ] |
|
I agree with Andreas, Just want to mention that "tune2fs -O ^inline_data" won't work to disable inline_data, we need to "debugfs -w /dev/XXX" and then "feature -inline_data" the patch I mentioned is to make e2fsck clear the inode rather than enabling inline_data feature, e2fsck currently trusts the inode flag if it has inline_data flag set, however for us that inode is highly like to contain garbage. You can disable inline_data and clear the inode or clear EXT4_INLINE_DATA_FL flag for the inode like Andreas said above, without the patch. The patch is just to prevent this from happening again. DY |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
tune2fs -O^inline_data /dev/mapper/nbp13_1-MDT0 1. I will run the debugfs command 2. run fsck -fn to get list of files.
|
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
root@nbp13-srv1 ~]# e2fsck -fn /dev/mapper/nbp13_1-MDT0 | tee /tmp/fsck.out e2fsck 1.44.3.wc1 (23-July-2018) Pass 1: Checking inodes, blocks, and sizes Inode 140572827 has inline data, but superblock is missing INLINE_DATA feature Clear? noInode 140572827 has INLINE_DATA_FL flag on filesystem without inline data support. Clear? noInode 140572828 has inline data, but superblock is missing INLINE_DATA feature Clear? noInode 140572828 has INLINE_DATA_FL flag on filesystem without inline data support. Clear? noPass 2: Checking directory structure Pass 3: Checking directory connectivity '..' in /ROOT/pkolano (140572827) is <The NULL inode> (0), should be /ROOT (140569473). Fix? noUnconnected directory inode 140572828 (/ROOT/pkolano/tmp) Connect to /lost+found? noUnconnected directory inode 140572829 (/ROOT/pkolano/tmp/64.3) Connect to /lost+found? no'..' in ... (140572829) is /ROOT/pkolano/tmp (140572828), should be <The NULL inode> (0). Fix? noUnconnected directory inode 140572894 (/ROOT/pkolano/tmp/64.2) Connect to /lost+found? no'..' in ... (140572894) is /ROOT/pkolano/tmp (140572828), should be <The NULL inode> (0). Fix? noPass 4: Checking reference counts Inode 140569473 ref count is 9, should be 8. Fix? noInode 140572827 ref count is 3, should be 1. Fix? noInode 140572828 ref count is 4, should be 2. Fix? noInode 140572829 ref count is 2, should be 1. Fix? noInode 140572894 ref count is 2, should be 1. Fix? noPass 5: Checking group summary informationnbp13-MDT0000: ********** WARNING: Filesystem still has errors **********nbp13-MDT0000: 28251917/317769600 files (0.1% non-contiguous), 83952122/3106406400 blocks both 2 inode can be delete |
| Comment by Andreas Dilger [ 30/Oct/18 ] |
|
You should be able to disable the inline_data feature via "debugfs -w 'feature ^inline_data' /dev/XXX" to bypass the tune2fs checks. |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
I did disable the feature via debugfs. How do i clear the INLINE_DATA_FL from the inodes? |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
I got past the two inodes. and mounted the filesystem. I see these errors. [17342.023159] LustreError: 26378:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 [17342.053760] LustreError: 26378:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 59 previous similar messages [17342.082037] LustreError: 25151:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 [17342.135124] LustreError: 25151:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 59 previous similar messages [17342.165732] LustreError: 25151:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116 |
| Comment by Andreas Dilger [ 30/Oct/18 ] |
|
This looks like it only affects creating files on the one OST0008, the rest of the filesystem should be usable at this point, including reading data on the affected OSTs. If there are multiple OSTs similarly affected then that could be problematic over time, but not immediately except for reduced performance. It should be possible to restart use of the OSTs by deleting the file lov_objids and lov_objseq on the MDT. |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
I umounted the MDT and remount using ldiskfs. Removed the 2 files and remounted using lustre. Still seeing the errors. Do I need to remount all OSTs? |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
This filesystem is having additional issues. ls -l is hanging on some dir and some dir owner and group are showing up as "?" tpfe2 /nobackupp13/spocops/git/sector/spoc/code/dist/logs # ls metrics-dump-0.txt metrics-dump-0.txt.old tmq.wrapper.log tmq.wrapper.log.1 tmq.wrapper.log.2 worker.wrapper.log tpfe2 /nobackupp13/spocops/git/sector/spoc/code/dist/logs # ls -l ls: cannot access 'tmq.wrapper.log.1': No such file or directory ls: cannot access 'metrics-dump-0.txt': No such file or directory
|
| Comment by Andreas Dilger [ 30/Oct/18 ] |
|
This typically indicates that the OST objects for those files are missing. OI Scrub on the OSTs should have already moved any objects from the OST's local lost+found directory back into the right place, but it wouldn't hurt to take a look (you could run "debugfs -c -R 'ls -l lost+found' /dev/XXXX" on the respective OSTs, there should only be "." and ".." and a few empty directory blocks reported). Other than that, if the OST objects are lost due to hardware corruption, then there isn't much that can be done for those files beyond deleting them (with "unlink" instead of "rm") and restoring them from backup. |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
How do we clear the [17342.082037] LustreError: 25151:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 issue. delete lov_objids and lov_objseq didn't work |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
There are files listed in lost+found but looks like empty dirctory blocks. debugfs: ls -l 11 40700 (2) 0 0 139264 7-Aug-2018 21:08 . 2 40755 (2) 0 0 4096 7-Aug-2018 21:09 .. 0 0 (1) 0 0 0 #75852 0 0 (1) 0 0 0 #113934 0 0 (1) 0 0 0 #184111 0 0 (1) 0 0 0 #266679 0 0 (1) 0 0 0 #331827 0 0 (1) 0 0 0 #385401 0 0 (1) 0 0 0 #444954 0 0 (1) 0 0 0 #496838 0 0 (1) 0 0 0 #567511 0 0 (1) 0 0 0 #605846 0 0 (1) 0 0 0 #649369 0 0 (1) 0 0 0 #687206 0 0 (1) 0 0 0 #732707 0 0 (1) 0 0 0 #769520 0 0 (1) 0 0 0 #815218 0 0 (1) 0 0 0 #875528 0 0 (1) 0 0 0 #915005 0 0 (1) 0 0 0 #955684 0 0 (1) 0 0 0 #993221 0 0 (1) 0 0 0 #1028775 0 0 (1) 0 0 0 #1073199 0 0 (1) 0 0 0 #1111095 0 0 (1) 0 0 0 #1148688 0 0 (1) 0 0 0 #1191718 0 0 (1) 0 0 0 #1230579 0 0 (1) 0 0 0 #1273743 0 0 (1) 0 0 0 #1312334 0 0 (1) 0 0 0 #1353029 0 0 (1) 0 0 0 #1431710 0 0 (1) 0 0 0 #1472117 0 0 (1) 0 0 0 #1524449 0 0 (1) 0 0 0 #1605063 0 0 (1) 0 0 0 #1666014 |
| Comment by Alex Zhuravlev [ 30/Oct/18 ] |
|
can you please try to mount again with full debug enabled and attach logs from MDS and that OST? |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
Do you want me to remount the OST and MDT or just the MDT? |
| Comment by Alex Zhuravlev [ 30/Oct/18 ] |
|
ideally - both, please: MDS, then OST. |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
Both ost and mdt are on the same host |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
Filesystem level issues are: 1. files with ? for user and gid How do we find and clear all these? |
| Comment by Alex Zhuravlev [ 30/Oct/18 ] |
|
thanks, it will take some time to study the logs. can you please also check OI scrub status: lctl get_param osd*.*OST*.oi_scrub |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
Attaching oi_scrub.out |
| Comment by Andreas Dilger [ 30/Oct/18 ] |
|
It looks like OST0008 is currently running an OI Scrub triggered by the object precreate from the MDS: 1540934094.152104:0:1589:0:(ofd_dev.c:1588:ofd_create_hdl()) ofd_create(0x0:2195196) 1540934094.152114:0:1589:0:(ofd_dev.c:1750:ofd_create_hdl()) nbp13-OST0008: reserve 32 objects in group 0x0 at 2195165 1540934094.152122:0:1589:0:(osd_handler.c:1003:osd_fid_lookup()) Process entered 1540934094.165749:0:1589:0:(osd_handler.c:728:osd_check_lma()) Process entered 1540934094.165750:0:1589:0:(osd_handler.c:793:osd_check_lma()) Process leaving (rc=-78) <************ -78 = -EREMCHG 1540934094.165757:0:1589:0:(osd_scrub.c:2654:osd_scrub_start()) Process entered 540934094.165790:0:1589:0:(osd_scrub.c:2661:osd_scrub_start()) Process leaving (rc=0 : 0 : 0) 1540934094.165791:0:1589:0:(osd_handler.c:1139:osd_fid_lookup()) nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0 1540934094.213780:0:1589:0:(ofd_dev.c:446:ofd_object_free()) object free, fid = [0x100080000:0x217edd:0x0] 1540934094.213783:0:1589:0:(ofd_objects.c:253:ofd_precreate_objects()) Process leaving via out (rc=-115) 1540934094.213785:0:1589:0:(ofd_objects.c:402:ofd_precreate_objects()) created 0/32 objects: -115 1540934094.213785:0:1589:0:(ofd_objects.c:405:ofd_precreate_objects()) Process leaving (rc=-115) 1540934094.213786:0:1589:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 1540934094.272318:0:11192:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 Based on the speed of the scrub of the other OSTs, this process should only take about 15s and should have completed already for OST0008, but it looks like it is either stuck or restarting the scrub repeatedly due to some inconsistency it is finding with the OST objects. osd-ldiskfs.nbp13-OST0008.oi_scrub= name: OI_scrub magic: 0x4c5fd252 oi_files: 64 status: scanning flags: auto param: time_since_last_completed: 9 seconds time_since_latest_start: 8 seconds time_since_last_checkpoint: 8 seconds latest_start_position: 12 last_checkpoint_position: 11 first_failure_position: N/A checked: 1170405 updated: 0 failed: 0 prior_updated: 0 noscrub: 0 igif: 0 success_count: 11061 run_time: 8 seconds average_speed: 146300 objects/sec real-time_speed: 155205 objects/sec current_position: 1457233 lf_scanned: 0 lf_repaired: 0 lf_failed: 0 inodes_per_group: 16 current_iit_group: 91077 current_iit_base: 1457233 current_iit_offset: 1 scrub_in_prior: no scrub_full_speed: yes partial_scan: no As for resolving the outstanding issues: |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
So... we need to resolve the nbp13-OST0008 issue first. The oi scrub keep restart due to the same fid. [ 766.323537] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0 [ 766.355463] Lustre: Skipped 3 previous similar messages [ 766.371175] LustreError: 8836:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 [ 766.401518] LustreError: 8836:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 3 previous similar messages [ 766.401539] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 [ 766.401540] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 3 previous similar messages [ 766.401543] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116 [ 766.401544] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) Skipped 3 previous similar messages [ 836.271099] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0 [ 836.303036] Lustre: Skipped 6 previous similar messages [ 836.318743] LustreError: 8836:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 [ 836.349088] LustreError: 8836:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 6 previous similar messages [ 836.349107] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 [ 836.349108] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 6 previous similar messages [ 836.349111] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116 [ 836.349112] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) Skipped 6 previous similar messages [ 867.763998] LNet: 3774:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.151.26.144@o2ib: 36 seconds [ 867.794860] LNet: 3774:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 4 previous similar messages [ 966.173700] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0 [ 966.205625] Lustre: Skipped 12 previous similar messages [ 966.221594] LustreError: 8837:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 [ 966.251939] LustreError: 8837:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 12 previous similar messages [ 966.251958] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 [ 966.251960] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 12 previous similar messages [ 966.251962] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116 [ 966.251963] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) Skipped 12 previous similar messages [ 1225.994890] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0 [ 1226.026820] Lustre: Skipped 25 previous similar messages [ 1226.042790] LustreError: 8837:0:(ofd_dev.c:1784:ofd_create_hdl()) nbp13-OST0008: unable to precreate: rc = -115 [ 1226.073134] LustreError: 8837:0:(ofd_dev.c:1784:ofd_create_hdl()) Skipped 25 previous similar messages [ 1226.073159] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) nbp13-OST0008-osc-MDT0000: precreate fid [0x100080000:0x217edc:0x0] < local used fid [0x100080000:0x217edc:0x0]: rc = -116 [ 1226.073161] LustreError: 8115:0:(osp_precreate.c:657:osp_precreate_send()) Skipped 25 previous similar messages [ 1226.073164] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) nbp13-OST0008-osc-MDT0000: cannot precreate objects: rc = -116 [ 1226.073165] LustreError: 8115:0:(osp_precreate.c:1289:osp_precreate_thread()) Skipped 25 previous similar messages How do i find this inode? |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
for #4. yes I delete one of the files. but there are more, which do contain user data. |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
I located [0x100080000:0x217edd:0x0] on the OST # debugfs -c -R "stat <2195165>" /dev/mapper/nbp13_1-OST8 debugfs 1.44.3.wc1 (23-July-2018) /dev/mapper/nbp13_1-OST8: catastrophic mode - not reading inode or group bitmaps Inode: 2195165 Type: bad type Mode: 0000 Flags: 0x0 Generation: 0 Version: 0x00000000 User: 0 Group: 0 Size: 0 File ACL: 0 Links: 0 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x00000000 -- Wed Dec 31 16:00:00 1969 atime: 0x00000000 -- Wed Dec 31 16:00:00 1969 mtime: 0x00000000 -- Wed Dec 31 16:00:00 1969 Size of extra inode fields: 0 BLOCKS: |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
Any updates? |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
Some info;
unsupported incompat LMA feature(s) 0x70687320 for fid = [0x0:0x2bae:0x2], ino = 100026353
|
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
Here is an example of a inode that ls will hang.
[18623.900347] Lustre: 31365:0:(osd_handler.c:371:osd_get_lma()) dm-1: unsupported incompat LMA feature(s) 0x73746960 for fid = [0x0:0x13af:0x2], ino = 236893545
[18623.942973] Lustre: 31365:0:(osd_handler.c:371:osd_get_lma()) Skipped 138971 previous similar messages
nbp15-srv1 ~ # debugfs -c -R 'stat <236893545> ' /dev/mapper/nbp15_1-MDT0 debugfs 1.44.3.wc1 (23-July-2018) /dev/mapper/nbp15_1-MDT0: catastrophic mode - not reading inode or group bitmaps Inode: 236893545 Type: regular Mode: 0640 Flags: 0x0 Generation: 20109448 Version: 0x00000003:10887bb2 User: 522602360 Group: 1179 Project: 0 Size: 0 File ACL: 0 Links: 1 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5bbf9603:00000000 -- Thu Oct 11 11:27:15 2018 atime: 0x5b9814ba:00000000 -- Tue Sep 11 12:17:14 2018 mtime: 0x565df0af:00000000 -- Tue Dec 1 11:10:39 2015 crtime: 0x5b99a181:a7491e2c -- Wed Sep 12 16:30:09 2018 Size of extra inode fields: 32 Extended attributes: trusted.lma (24) = 6c 6c 63 2e 66 69 74 73 00 00 00 00 00 00 00 00 af 13 00 00 02 00 00 00 lma: fid=[0:0x13af:0x2] compat=2e636c6c incompat=73746966 trusted.link (80) trusted.lov (128) BLOCKS: We need a way find and clear these errros. |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
The information in the "lma" xattr looks to be total garbage. The compat=2e636c6c and incompat=73746966 flags are full of unknown values - only a small number of values are defined. It looks like the trusted.fid has been clobbered by ASCII text, which includes "6c 6c 63 2e 66 69 74 73 == llc.fit", "2e636c6c = .cll", and "73746966 = stif" (or the reverse, depending on byte ordering). One option is clearing the "lma" xattr, in case the "lov" xattr still contains a valid LOV_MAGIC value and a valid layout. The "trusted.lma" xattr can be rebuilt by OI Scrub if needed. To delete the trusted.lma xattr, the MDT needs to be mounted as type ldiskfs, since the MDS blocks direct access/modification to this xattr. Then "setfattr -x trusted.lma /path/to/file" to delete the xattr. |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
But looks like there are 1000's of these inodes. How can we easily find them? |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
Is this the object O/0/d29/2195165 or how did you map this FID to that inode number? If it is, then that would imply directory corruption on the OST, since the directory entry shouldn't be pointing at an unused inode. Ah, to clarify, the 0x217edd part of the FID does not map directly to the inode number, it is just the OID part of the FID, an arbitrary sequential number. If O/0/d29/2195165 exists on OST0008, what does "stat" report for it? |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
RE: [0x100080000:0x217edd:0x0], OK I did that mapping incorrectly. Is there a way to find out what that object inode is? are you saying [0x100080000:0x217edd:0x0] -> maps to O/0/d29/2195165
debugfs: stat O/0/d29/2195165
Inode: 1762634 Type: regular Mode: 07666 Flags: 0x80000
Generation: 3301012751 Version: 0x00000000:00000000
User: 0 Group: 0 Project: 0 Size: 0
File ACL: 0
Links: 2 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
mtime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
crtime: 0x5bd254c2:a90f833c -- Thu Oct 25 16:41:54 2018
Size of extra inode fields: 32
Extended attributes:
trusted.lma (24) = 08 00 00 00 00 00 00 00 00 00 08 00 01 00 00 00 9d 7e 21 00 00 00 00 00
lma: fid=[0x100080000:0x217e9d:0x0] compat=8 incompat=0
EXTENTS:
|
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
Correct. The 0x100080000 part of the FID identifies it as an OST FID (0x1 part) on OST0008. The second part is the Object ID, which (in decimal) is the filename, and modulo 32 is the subdirectory. |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
So it looks like there is a hard link to this object, likely from O/0/d29/2195101, which is probably the correct object for that inode due to the FID in the lma xattr, and O/0/d29/2195165 should be removed. |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
should i delete it via ldiskfs mount or debugfs -w mi ? how should we scan for bad lma xattr? |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
The OST object should be deleted via ldiskfs. As for the bad lma xattr, I don't think that LFSCK can fix that problem right now, since the incompat flag is specifically intended to block old Lustre versions that don't understand particular feature flags from modifying the inode. For finding the objects, probably the easiest way is to run a namespace walk to find inodes that show errors when accessed. It may be that "lfs find <mountpoint>" might be enough to generate an error message for a file with the bad LMA. Unfortunately, we can't use e.g. "lfs fid2path" on the FIDs reported in the error message since they are not valid FIDs. |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
ls -l will find these but it will hang on some. |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
Probably Lester would be fastest. If it is already able to decode the LMA (which is probably yes, since that is how it finds the FID) it shouldn't be too hard to check the compat and incompat flags at the same time. The current known compat and incompat flags are: enum lma_compat { LMAC_HSM = 0x00000001, /* LMAC_SOM = 0x00000002, obsolete since 2.8.0 */ LMAC_NOT_IN_OI = 0x00000004, /* the object does NOT need OI mapping */ LMAC_FID_ON_OST = 0x00000008, /* For OST-object, its OI mapping is * under /O/<seq>/d<x>. */ LMAC_STRIPE_INFO = 0x00000010, /* stripe info in the LMA EA. */ LMAC_COMP_INFO = 0x00000020, /* Component info in the LMA EA. */ LMAC_IDX_BACKUP = 0x00000040, /* Has index backup. */ }; /** * Masks for all features that should be supported by a Lustre version to * access a specific file. * This information is stored in lustre_mdt_attrs::lma_incompat. */ enum lma_incompat { LMAI_RELEASED = 0x00000001, /* file is released */ LMAI_AGENT = 0x00000002, /* agent inode */ LMAI_REMOTE_PARENT = 0x00000004, /* the parent of the object is on the remote MDT */ LMAI_STRIPED = 0x00000008, /* striped directory inode */ LMAI_ORPHAN = 0x00000010, /* inode is orphan */ LMA_INCOMPAT_SUPP = (LMAI_AGENT | LMAI_REMOTE_PARENT | \ LMAI_STRIPED | LMAI_ORPHAN) }; |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
how should I delete O/0/d29/2195101 |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
Strictly speaking, if O/0/d29/2195101 exists, then if the O/0/d29/2195165 link is deleted it should be OK again. That said, this object has no data and has not been used by an MDT inode yet (or it would report a "parent" FID sa well), so there is probably no huge risk to delete it as well, but I also don't think it is totally necessary. |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
any chance we can get a tool to scan the MDT for the bad lma? I think that is our only change find or ls -l just hangs |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
Do you have an idea of many bad objects exist in the filesystem? Have you been able to access file data for some files, but only some relatively small fraction (e.g. 1% or 5%) of the files are exhibiting the bad lma problem? Is this problem only happening on the MDT or also on the OST? The "right" tool for this would be to modify LFSCK to be able to detect an "obviously" corrupt LMA and erase and rebuild it, for some definition of "obviously correct", while preserving the original meaning of the incompat flag. However, that is not something that should be rushed, as we would need to test it fairly well to ensure it does not quickly and automatically do the wrong thing for the filesystem and cause more problems. Have you tried using something like "lfs find -uid 0 /mnt/XXX" to scan the mounted filesystem? It does not try to instantiate the file inodes on the client (to avoid cache pollution), but rather just fetches the inode attributes to the client and returns them to userspace. However, it does need to access the directory inodes. so there would still be some chance of the client hanging. |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
Don’t know for sure how many I am guessing +5000 or more. If you run lfs find it will not find any. you need to do at least a ls -l. Just like a ls won’t work. |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
If we can clear at lma xattr can we not read all the bad xattr mounted as ldiskfs |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
If it is mounted as ldiskfs, then there would need to be a userspace tool written to decode the lma xattr from disk, since it is a binary structure. The debugfs utility decodes this for us for debugging purposes. |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
Alex is investigating a change to LFSCK to rewrite the LMA and clear the bad flags and incorrect FID. In the meantime, if it is possible I would suggest to make a device-level backup of the MDT filesystem in case there are any problems. This should be possible in a few hours if there is a suitable device available to hold it. |
| Comment by Alex Zhuravlev [ 31/Oct/18 ] |
|
yes, I've been working on a patch for OI scrub to fix wrong names in /O/.. which seem to be the blocking point. |
| Comment by Alex Zhuravlev [ 31/Oct/18 ] |
|
as for duplicated hardlinks (have to you tried to remove O/0/d29/2195165 manually?) I think you can use the following command on a directly mounted OST filesystem: find O -type f ! -links 1 as that that is object index and it's not supposed to have hardlinks at all. this way you can estimate how objects may need recovery. |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
mhanafi were you able to clear the bad (hard-linked) inode(s) on OST0008 to get beyond the precreate problem? For the LMA issue, Alex is still working on a patch. It would be useful to also dump the "trusted.lov" xattr on one of the inodes that have the LMA error to see if it still contains a valid layout. This would need to be done via If the LOV does not contain a valid layout then it needs to be removed as well. My understanding is that beyond the files impacted by the LMA issue, the filesystem should be usable at this point. Peter was mentioning that there were several filesystems affected at this time? Are they all hitting the same problems? How did multiple filesystems become corrupted at the same time? |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
Yes i deleted the O/0/d29/2195165 and that got us past that. We had 5 filesystem on similar raid backends. They experienced the same issue during firmware update. During firmware update the RAID t10pi setting got turned off this caused errors on the hosts side. I was able to do a backend scan of all MDT inodes and dump out the lma_compat and lma_incompat. It looks like they are zero except for the bad inodes. On nbp10 I ran setfattr -x trusted.lma on the list of bad inodes. And mounted via lustre and start lfsck. it is still running. This is an example of a bad inode. I'll try to get you some more examples. nbp15-srv1 ~ # debugfs -c -R 'stat <236893545> ' /dev/mapper/nbp15_1-MDT0 debugfs 1.44.3.wc1 (23-July-2018) /dev/mapper/nbp15_1-MDT0: catastrophic mode - not reading inode or group bitmaps Inode: 236893545 Type: regular Mode: 0640 Flags: 0x0 Generation: 20109448 Version: 0x00000003:10887bb2 User: 522602360 Group: 1179 Project: 0 Size: 0 File ACL: 0 Links: 1 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5bbf9603:00000000 -- Thu Oct 11 11:27:15 2018 atime: 0x5b9814ba:00000000 -- Tue Sep 11 12:17:14 2018 mtime: 0x565df0af:00000000 -- Tue Dec 1 11:10:39 2015 crtime: 0x5b99a181:a7491e2c -- Wed Sep 12 16:30:09 2018 Size of extra inode fields: 32 Extended attributes: trusted.lma (24) = 6c 6c 63 2e 66 69 74 73 00 00 00 00 00 00 00 00 af 13 00 00 02 00 00 00 lma: fid=[0:0x13af:0x2] compat=2e636c6c incompat=73746966 trusted.link (80) trusted.lov (128) BLOCKS: How do I figure what file this is
Lustre: nbp10-MDT0000: trigger OI scrub by RPC for the [0x2000033ce:0x2f:0x0] with flags 0x4a, rc = 0
fid2path is hanging tpfe2 ~ # lfs fid2path /nobackupp10 0x2000033ce:0x2f:0x0 The filesystems may be usable for the most part, but we have taking them it offline make sure all issues are resolved before releasing back to the users.
|
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
nbp10 current lfsck status: name: lfsck_namespace magic: 0xa0621a0b version: 2 status: scanning-phase1 flags: inconsistent,incomplete param: all_targets,orphan,create_ostobj,create_mdtobj last_completed_time: N/A time_since_last_completed: N/A latest_start_time: 1541009812 time_since_latest_start: 2281 seconds last_checkpoint_time: 1541012058 time_since_last_checkpoint: 35 seconds latest_start_position: 77, N/A, N/A last_checkpoint_position: 116707678, N/A, N/A first_failure_position: 81903790, [0x200003393:0x149:0x0], 0x21ece37a checked_phase1: 72944086 checked_phase2: 0 updated_phase1: 1020 updated_phase2: 0 failed_phase1: 38 failed_phase2: 0 directories: 615982 dirent_repaired: 217 linkea_repaired: 802 nlinks_repaired: 0 multiple_linked_checked: 35712 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 1 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 0 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 0 run_time_phase1: 4136 seconds run_time_phase2: 0 seconds average_speed_phase1: 17636 items/sec average_speed_phase2: N/A average_speed_total: 17636 items/sec real_time_speed_phase1: 274 items/sec real_time_speed_phase2: N/A current_position: 116925445, N/A, N/A tpfe2 ~ # lfs fid2path /nobackupp10 0x200003393:0x149:0x0 /nobackupp10/hhashimo/data/GDM/data This the directory where ls -l is hanging. |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
more example of bad inode
debugfs: stat cmst_file
Inode: 229115909 Type: regular Mode: 0664 Flags: 0x0
Generation: 2422773946 Version: 0x00000001:00000028
User: 10376 Group: 1987 Project: 0 Size: 0
File ACL: 2239771779
Links: 1 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5b75bb84:00000000 -- Thu Aug 16 10:59:32 2018
atime: 0x5bd267f4:00000000 -- Thu Oct 25 18:03:48 2018
mtime: 0x5b75bb73:3e55a3d4 -- Thu Aug 16 10:59:15 2018
crtime: 0x5b75bb73:3e55a3d4 -- Thu Aug 16 10:59:15 2018
Size of extra inode fields: 32
Extended attributes:
trusted.lma (24) = 73 74 5f 66 69 6c 65 00 00 00 00 00 00 00 00 00 06 04 00 00 02 00 00 00
lma: fid=[0:0x406:0x2] compat=665f7473 incompat=656c69
trusted.link (51)
system.posix_acl_access (28) = 00 00 00 00 00 00 00 00 01 00 00 00 01 00 06 00 02 00 06 00 fb 28 00 00 04 00 04 00
trusted.lov (1688)
BLOCKS:
|
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
It is possible that fid2path is hanging because LFSCK is still running and rebuilding the OI files, so it is getting a return code if "-EINPROGRESS" for which the client will wait indefinitely until the MDS completes LFSCK and locates the respective FID it returns an error. That said, given the FID is corrupt in the LMA, then it is possible that the requested FID will no longer exist. Typically, LFSCK will trust the FID stored in the inode LMA over a fid in the directory entry, since the chance of the LMA FID xattr being corrupted without actually corrupting the xattr structure itself (which are stored within a few bytes of each other) was considered to be extremely unlikely, though I guess we may have to reconsider this assumption. I'd need to check the LFSCK code to see if it does a validity check on the dirent FID vs. the LMA FID and excludes one if it is not valid. For the files where you removed the LMA xattr, are those files now accessible? It is water under the bridge at this point, but in the future I'd suggest a staged rollout of changes like this so that any issues seem during the upgrade are contained to a single filesystem. |
| Comment by Mahmoud Hanafi [ 31/Oct/18 ] |
|
after removing the lma xattr, an ls -l will hang and trigger a oi_scrub.
|
| Comment by Mahmoud Hanafi [ 01/Nov/18 ] |
|
More examples: ls -l output -????????? ? ? ? ? ? PrfToolParametersTest.class
debugfs: stat PrfToolParametersTest.class
Inode: 168298272 Type: regular Mode: 0640 Flags: 0x0
Generation: 1296031430 Version: 0x00000003:3e688f5d
User: 30757 Group: 41548 Project: 0 Size: 0
File ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5bd11d80:79fa43d0 -- Wed Oct 24 18:33:52 2018
atime: 0x5bd11d80:79fa43d0 -- Wed Oct 24 18:33:52 2018
mtime: 0x5bd11d80:79fa43d0 -- Wed Oct 24 18:33:52 2018
crtime: 0x5bd11d80:79fa43d0 -- Wed Oct 24 18:33:52 2018
Size of extra inode fields: 32
Extended attributes:
trusted.lma (24) = 00 00 00 00 00 00 00 00 01 21 00 00 02 00 00 00 b3 6f 00 00 00 00 00 00
lma: fid=[0x200002101:0x6fb3:0x0] compat=0 incompat=0
trusted.lov (448)
trusted.link (69)
BLOCKS:
ls -l output
ls: cannot access './spocops/git/sector/spoc/code/commissioning-tools/build/src/main/matlab/write_LsqParameters.m': No such file or directory
debugfs: stat /ROOT/./spocops/git/sector/spoc/code/commissioning-tools/build/src/main/matlab/write_LsqParameters.m
Inode: 168297875 Type: regular Mode: 0640 Flags: 0x0
Generation: 1296029279 Version: 0x00000003:3e6662b8
User: 30757 Group: 41548 Project: 0 Size: 0
File ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5bd11cf9:00000000 -- Wed Oct 24 18:31:37 2018
atime: 0x5bd11d80:00000000 -- Wed Oct 24 18:33:52 2018
mtime: 0x5bd11cf9:00000000 -- Wed Oct 24 18:31:37 2018
crtime: 0x5bd11cf9:212fdac8 -- Wed Oct 24 18:31:37 2018
Size of extra inode fields: 32
Extended attributes:
trusted.lma (24) = 00 00 00 00 00 00 00 00 01 21 00 00 02 00 00 00 3c 67 00 00 00 00 00 00
lma: fid=[0x200002101:0x673c:0x0] compat=0 incompat=0
trusted.lov (448)
trusted.link (63)
BLOCKS:
|
| Comment by Andreas Dilger [ 01/Nov/18 ] |
|
Here files look like the LMA xattr is valid. Can you check "lfs getstripe" for the files to get the objects, then on the respective OSTs you can use "objid=NNNN; debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/XXX" to see if the object is missing or maybe broken (wrong parent FID)? |
| Comment by Mahmoud Hanafi [ 01/Nov/18 ] |
lfs getstripe write_LsqParameters.m
write_LsqParameters.m
lcm_layout_gen: 4
lcm_entry_count: 4
lcme_id: 1
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 8388608
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 21
lmm_objects:
0: { l_ost_idx: 21, l_fid: [0x100150000:0x215567:0x0] }
lcme_id: 2
lcme_flags: 0
lcme_extent.e_start: 8388608
lcme_extent.e_end: 17179869184
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 65535
lmm_stripe_offset: -1
lcme_id: 3
lcme_flags: 0
lcme_extent.e_start: 17179869184
lcme_extent.e_end: 68719476736
lmm_stripe_count: 8
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 65535
lmm_stripe_offset: -1
lcme_id: 4
lcme_flags: 0
lcme_extent.e_start: 68719476736
lcme_extent.e_end: EOF
lmm_stripe_count: 16
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 65535
lmm_stripe_offset: -1
=========================================
nbp13-srv2 ~ # objid=`printf "%i\n" 0x215567
nbp13-srv2 ~ # debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/mapper/nbp13_1-OST21
debugfs 1.44.3.wc1 (23-July-2018)
/dev/mapper/nbp13_1-OST21: catastrophic mode - not reading inode or group bitmaps
O/0/d7/2184551: File not found by ext2_lookup
lfs getstripe PrfToolParametersTest.class
PrfToolParametersTest.class
lcm_layout_gen: 4
lcm_entry_count: 4
lcme_id: 1
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 8388608
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 13
lmm_objects:
- 0: { l_ost_idx: 13, l_fid: [0x1000d0000:0x2156d1:0x0] }
lcme_id: 2
lcme_flags: 0
lcme_extent.e_start: 8388608
lcme_extent.e_end: 17179869184
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 65535
lmm_stripe_offset: -1
lcme_id: 3
lcme_flags: 0
lcme_extent.e_start: 17179869184
lcme_extent.e_end: 68719476736
lmm_stripe_count: 8
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 65535
lmm_stripe_offset: -1
lcme_id: 4
lcme_flags: 0
lcme_extent.e_start: 68719476736
lcme_extent.e_end: EOF
lmm_stripe_count: 16
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 65535
lmm_stripe_offset: -1
=================================
objid=`printf "%i\n" 0x2156d1`
debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/mapper/nbp13_1-OST13
debugfs 1.44.3.wc1 (23-July-2018)
/dev/mapper/nbp13_1-OST13: catastrophic mode - not reading inode or group bitmaps
O/0/d17/2184913: File not found by ext2_lookup
so missing objects.
|
| Comment by Gerrit Updater [ 01/Nov/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33546 |
| Comment by Gerrit Updater [ 01/Nov/18 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33547 |
| Comment by Mahmoud Hanafi [ 01/Nov/18 ] |
|
are they patches ready for us to try?
|
| Comment by Alex Zhuravlev [ 01/Nov/18 ] |
|
mhanafi not yet, still in testing.. |
| Comment by Jay Lan (Inactive) [ 02/Nov/18 ] |
|
File lustre/include/uapi/linux/lustre/lustre_user.h does not exist in 2.10.5. |
| Comment by Alex Zhuravlev [ 02/Nov/18 ] |
|
I'm making a port right now. |
| Comment by Gerrit Updater [ 02/Nov/18 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33560 |
| Comment by Alex Zhuravlev [ 02/Nov/18 ] |
|
first of all, you need to apply the patch and rebuild Lustre. the packages need to be installed on MDS, and OSS in case there is similar (but unseen) corruption there. Then it makes sense to estimate amount of recovery needed (i.e. identify files with broken LMA). Do the follow steps (on the MDS): 1) mount MDS with OI scrub disabled: mount -t lustre -o user_xattr,noscrub <mdt device> <mdt mountpoint> 2) set debug level for subsequent analysis: lctl set_param debug=+lfsck 3) start LFSCK in read-only mode: lctl lfsck_start -M nbp13_1-MDT0000 -t namespace -r --dryrun 4) wait for LFSCK completion checking status: lctl get_param -n mdd.*.lfsck_namespace | egrep "^status|inconsistent" 5) grab and post last lfsck_namespace status lctl dk | gzip -9 > /tmp/debug-lfsck-nbp_1-MDT0000.log.gz Thanks |
| Comment by Andreas Dilger [ 02/Nov/18 ] |
|
I've reviewed the patch and the backport. The LFSCK/scrub testing failed on the master version of the patch due to a known (unrelated) intermittent test error. The testing on the backported patch was delayed because we just added ARM builds to all patches and this was misconfigured for b2_10, but that has been resolved. The testing on the backported patch is expected to complete in about an hour, and no problems are expected. It should be noted that this patch to LFSCK is intended to repair the specific LMA corruption that is seen on this system, and is not intended for long-term inclusion in your production release. There is no expectation of problems in the short term, but the fix bypasses specific consistency checks in the code that should be restored before the system is upgraded, and a different patch will be landed for long-term production use. The above procedure is running LFSCK in "dry run" mode, so no fixes will be made to the filesystem, only a report of the number of files that will be repaired. If the dry run is successful and the number of files being repaired is consistent with expectations, I'd recommend to run in fixing mode (remove "--dryrun" option) on the "test" filesystem and/or MDT backup image to ensure it fixes the problem. Please attach logs to the ticket when LFSCK is finished, or if you have problems. |
| Comment by Jay Lan (Inactive) [ 02/Nov/18 ] |
|
Thanks for the update, Andreas and Alex~ |
| Comment by Mahmoud Hanafi [ 03/Nov/18 ] |
|
We haven't ran the new code but here is one more example: Is this bad lma on the OST object? [325981.396812] Lustre: Skipped 3 previous similar messages [326747.450553] Lustre: nbp13-OST0001: trigger OI scrub by RPC for the [0x100010000:0x2155af:0x0] with flags 0x4a, rc = 0 [326747.482740] Lustre: Skipped 3 previous similar messages [327512.978588] Lustre: nbp13-OST0001: trigger OI scrub by RPC for the [0x100010000:0x2155af:0x0] with flags 0x4a, rc = 0 [327513.010762] Lustre: Skipped 3 previous similar messages [328279.688198] Lustre: nbp13-OST0001: trigger OI scrub by RPC for the [0x100010000:0x2155af:0x0] with flags 0x4a, rc = 0 [328279.720378] Lustre: Skipped 3 previous similar messages nbp13-srv1 ~ # objid=`printf "%i" 0x2155af` nbp13-srv1 ~ # debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/mapper/nbp13_1-OST1 debugfs 1.44.3.wc1 (23-July-2018) /dev/mapper/nbp13_1-OST1: catastrophic mode - not reading inode or group bitmaps Inode: 1673602 Type: regular Mode: 0666 Flags: 0x80000 Generation: 2828099384 Version: 0x00000003:005e7593 User: 30757 Group: 41548 Project: 0 Size: 2180 File ACL: 0 Links: 2 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5bd11ce7:00000000 -- Wed Oct 24 18:31:19 2018 atime: 0x5bd11ce8:00000000 -- Wed Oct 24 18:31:20 2018 mtime: 0x5bd11ce7:00000000 -- Wed Oct 24 18:31:19 2018 crtime: 0x5bd11c77:03872348 -- Wed Oct 24 18:29:27 2018 Size of extra inode fields: 32 Extended attributes: trusted.lma (24) = 08 00 00 00 00 00 00 00 00 00 01 00 01 00 00 00 ae 55 21 00 00 00 00 00 lma: fid=[0x100010000:0x2155ae:0x0] compat=8 incompat=0 trusted.fid (44) fid: parent=[0x200002101:0x66b8:0x0] stripe=0 stripe_size=1048576 stripe_count=1 component_id=1 component_start=0 component_end=8388608 EXTENTS: (0):3426781548 tpfe2 ~ # lfs fid2path /nobackupp13 0x200002101:0x66b8:0x0 /nobackupp13/quarantine/spocops/git/sector/spoc/code/dist/dist/classes/java/main/gov/nasa/tess/dv/outputs/DvAbstractTargetTableData$Builder.class tpfe2 ~ # ls -l /nobackupp13/quarantine/spocops/git/sector/spoc/code/dist/dist/classes/java/main/gov/nasa/tess/dv/outputs/DvAbstractTargetTableData ls: cannot access '/nobackupp13/quarantine/spocops/git/sector/spoc/code/dist/dist/classes/java/main/gov/nasa/tess/dv/outputs/DvAbstractTargetTableData': No such file or directory |
| Comment by Mahmoud Hanafi [ 03/Nov/18 ] |
|
i ran lfsck on nbp15 which has the same issues as 13. We are planing on reformatting it.
|
| Comment by Andreas Dilger [ 03/Nov/18 ] |
nbp13-srv1 ~ # objid=`printf "%i" 0x2155af` nbp13-srv1 ~ # debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/mapper/nbp13_1-OST1 FYI, if you have the hex value for the object ID, you could directly use: debugfs -c -R "stat O/0/d$((0x2155af % 32))/$((0x2155af))" /dev/mapper/nbp13_1-OST1 In any case, what is strange is that this is object ID being looked up is 0x2155af, but the object that is found reports itself to be 0x2155ae: Extended attributes: lma: fid=[0x100010000:0x2155ae:0x0] compat=8 incompat=0 Based on the fid2path output, it looks like this object is actually 0x2155ae, so it should be renamed from "/O/0/d15/218463" to "/O/0/d14/2184622". It isn't clear why OI Scrub is not repairing this automatically. |
| Comment by Andreas Dilger [ 03/Nov/18 ] |
|
I did see something interesting in the debug log... One of the files that LFSCK complained about was: osd_handler.c:6401:osd_dirent_check_repair()) nbp15-MDT0000: the target inode does not recognize the dirent, dir = 237857984/19940587, name = kplr011027624-2012004120508_llc.fits, ino = 237860402, [0x2000013af:0x8f13:0x0]: rc = -61 osd_handler.c:6401:osd_dirent_check_repair()) nbp15-MDT0000: the target inode does not recognize the dirent, dir = 238340766/19942571, name = kplr005385471-2009259160929_llc.fits, ino = 238345690, [0x2000013ae:0x8f1c:0x0]: rc = -61 The filenames both end in "llc.fits" which is the same ASCII string that was corrupting the LMA FID. This is returning "-61 = -ENODATA" which Alex's patch is supposed to do when it finds a corrupted LMA FID, but it doesn't look like it repaired them:
rc = osd_get_lma(info, inode, dentry, &info->oti_ost_attrs);
if (rc == -ENODATA || !fid_is_sane(&lma->lma_self_fid))
lma = NULL;
:
:
if (!fid_is_zero(fid)) {
rc = osd_verify_ent_by_linkea(env, inode, pfid, ent->oied_name,
ent->oied_namelen);
if (rc == -ENOENT ||
(rc == -ENODATA &&
!(dev->od_scrub.os_scrub.os_file.sf_flags & SF_UPGRADE))) {
/*
* linkEA does not recognize the dirent entry,
* it may because the dirent entry corruption
* and points to other's inode.
*/
CDEBUG(D_LFSCK, "%s: the target inode does not "
"recognize the dirent, dir = %lu/%u, "
" name = %.*s, ino = %llu, "
DFID": rc = %d\n", devname, dir->i_ino,
dir->i_generation, ent->oied_namelen,
ent->oied_name, ent->oied_ino, PFID(fid), rc);
*attr |= LUDA_UNKNOWN;
GOTO(out, rc = 0);
}
I'd suspect that this is because the linkEA ("link" xattr which is also stored in the inode) is also missing? It looks like we need to set the SF_UPGRADE flag (maybe renamed to "SF_REBUILD_LMA") if the LMA has been removed (rc = -ENODATA) so that we fall through to the LMA repair code further down? We can't check for the LMAC_INIT_FID flag, since it is stored in the LMA itself, which is missing here. |
| Comment by Mahmoud Hanafi [ 03/Nov/18 ] |
|
Here is the nbp13 lfsck runs. nbp13-srv1 ~ # lctl get_param -n mdd.*.lfsck_namespace name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: inconsistent param: dryrun last_completed_time: 1541281433 time_since_last_completed: 341 seconds latest_start_time: 1541281072 time_since_latest_start: 702 seconds last_checkpoint_time: 1541281433 time_since_last_checkpoint: 341 seconds latest_start_position: 77, N/A, N/A last_checkpoint_position: 317719759, N/A, N/A first_failure_position: 153388517, [0x2000020af:0x39d9:0x0], 0x753a410c57f07b3 checked_phase1: 30987846 checked_phase2: 111 inconsistent_phase1: 2 inconsistent_phase2: 3 failed_phase1: 21 failed_phase2: 3 directories: 2709152 dirent_inconsistent: 0 linkea_inconsistent: 2 nlinks_inconsistent: 0 multiple_linked_checked: 5 multiple_linked_inconsistent: 0 unknown_inconsistency: 0 unmatched_pairs_inconsistent: 0 dangling_inconsistent: 0 multiple_referenced_inconsistent: 3 bad_file_type_inconsistent: 0 lost_dirent_inconsistent: 0 local_lost_found_scanned: 3 local_lost_found_moved: 3 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_inconsistent: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 0 striped_shards_inconsistent: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_inconsistent: 0 linkea_overflow_inconsistent: 0 success_count: 3 run_time_phase1: 362 seconds run_time_phase2: 0 seconds average_speed_phase1: 85601 items/sec average_speed_phase2: 111 objs/sec average_speed_total: 85366 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A |
| Comment by Gerrit Updater [ 05/Nov/18 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/33576 |
| Comment by Mahmoud Hanafi [ 05/Nov/18 ] |
|
Any comments on the output of nbp13.lfsck? |
| Comment by Alex Zhuravlev [ 05/Nov/18 ] |
|
I'm modifying the test to simulate additional broken LinkEA, going to report results ASAP. |
| Comment by Alex Zhuravlev [ 05/Nov/18 ] |
|
I still don't understand why the nbp13 log doesn't contain "unsupported incompat LMA feature" message. |
| Comment by Mahmoud Hanafi [ 18/Nov/18 ] |
|
I was able to find all the inodes with bad LMA and delete them via ldiskfs. So what we have left are files that trigger OI scrub and that report "?" for size/uid/etc. The user has been able recover all the effected files, so we just need a way to delete the files. If we delete the files via ldiskfs how can we make sure that the objects will be cleaned up.
|
| Comment by Andreas Dilger [ 21/Nov/18 ] |
|
Mahmoud, the orphan OST objects can be cleaned up with LFSCK layout checking. The orphans are linked into the $MOUNT/.lustre/lost+found directory if "lctl lfsck_start -o -t layout" is used (the "-o" option can be used as part of a full LFSCK run as well). |
| Comment by Mahmoud Hanafi [ 06/Dec/18 ] |
|
Open new prio1 case after delete quarantined files hitting lbug. |
| Comment by Gerrit Updater [ 27/Feb/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33546/ |
| Comment by Joseph Gmitter (Inactive) [ 25/Nov/19 ] |
|
Patch landed to master. |