[LU-11481] corrupt directory Created: 08/Oct/18  Updated: 25/Feb/19  Resolved: 15/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: Lustre 2.10.7

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

server: RHEL 7.4 derivative, zfs-0.7.11-4llnl.ch6.x86_64, lustre-2.10.4_1.chaos
client: RHEL 7.4 derivative, lustre-2.8.2_4.chaos-1
We are using DNE 1 with two MDTs on two servers, porter81 and porter82
for the zfs tag, see https://github.com/LLNL/zfs/releases
for our 2.10 tag, see https://github.com/LLNL/lustre/
for our 2.8 tag, see lustre-release-fe-llnl on gerritt


Attachments: File console.porter81.gz     File console.porter82.gz    
Issue Links:
Related
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

A directory has an entry for subdirectory "2fe", but the object ID stored for that entry does not exist:

alias ll="ls -l"
[root@catalyst101:~]# ll /p/lustre3/videousr/YLI/mmcommons/data/images_v1

ls: cannot access /p/lustre3/videousr/YLI/mmcommons/data/images_v1/2fe: No such file or directory

total 0

d????????? ? ? ? ?            ? 2fe

And when using zdb on the MDT to examine images_v1, one sees that 2fe refers to an object ID that is invalid:

[root@porter81:snap]# zdb -ddddd porter81/mdt0 533741247
Dataset porter81/mdt0 [ZPL], ID 148, cr_txg 98, 910G, 61852198 objects, rootbp DVA[0]=<4:88d9c400:200> DVA[1]=<5:25ca03c200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1214040L/1214040P fill=61852198 cksum=139cf672b7:5dc8d6146f6:f8e6add4f57c:1e27e38477f5c0                                                                                

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
 533741247    2   128K    16K   231K     512   528K  100.00  ZFS directory
                                               192   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED SPILL_BLKPTR
        dnode maxblkid: 32                                                           
        path    ???<object#533741247>                                                
        uid     0                                                                    
        gid     2093                                                                 
        atime   Mon Oct  8 11:01:28 2018                                             
        mtime   Wed Oct  3 15:53:08 2018                                             
        ctime   Wed Oct  3 15:53:08 2018                                             
        crtime  Mon Oct  1 20:53:54 2018                                             
        gen     1090081                                                              
        mode    42700                                                                
        size    2                                                                    
        parent  533740502                                                            
        links   3                                                                    
        pflags  0                                                                    
        rdev    0x0000000000000000                                                   
        SA xattrs: 204 bytes, 3 entries                                              

                trusted.lma = \000\000\000\000\000\000\000\0002@\000\000\002\000\000\000\245\037\001\000\000\000\000\000                                                                    
                trusted.link = \337\361\352\021\001\000\000\0003\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\033\000\000\000\002\000\000@F\000\0001\213\000\000\000\000images_v1                                                                                      
                trusted.version = \022\231\236+\011\000\000\000                               
        Fat ZAP stats:                                                                        
                Pointer table:                                                                
                        1024 elements                                                         
                        zt_blk: 0                                                             
                        zt_numblks: 0                                                         
                        zt_shift: 10                                                          
                        zt_blks_copied: 0                                                     
                        zt_nextblk: 0                                                         
                ZAP entries: 1                                                                
                Leaf blocks: 32                                                               
                Total blocks: 33                                                              
                zap_block_type: 0x8000000000000001                                            
                zap_magic: 0x2f52ab2ab                                                        
                zap_salt: 0x3e3cbee7f                                                         
                Leafs with 2^n pointers:                                                      
                          5:     32 ********************************                          
                Blocks with n*5 entries:                                                      
                          0:     32 ********************************                          
                Blocks n/10 full:                                                             
                          1:     32 ********************************                          
                Entries with n chunks:                                                        
                          4:      1 *                                                         
                Buckets with n entries:                                                       
                          0:  16383 ****************************************                  
                          1:      1 *                                                         

                2fe = 533742980 (type: Directory)
Indirect blocks:
               0 L1  6:1a0095d000:a00 20000L/a00P F=33 B=1133009/1133009
               0  L0 4:d99372200:200 4000L/200P F=1 B=1133009/1133009
            4000  L0 4:2b78affa00:e00 4000L/e00P F=1 B=1132989/1132989
            8000  L0 4:1a409fa00:e00 4000L/e00P F=1 B=1133008/1133008
            c000  L0 4:dbecc8800:e00 4000L/e00P F=1 B=1133003/1133003
           10000  L0 4:2d07544a00:e00 4000L/e00P F=1 B=1132997/1132997
           14000  L0 5:11130c9600:e00 4000L/e00P F=1 B=1133005/1133005
           18000  L0 5:1053a11c00:e00 4000L/e00P F=1 B=1132991/1132991
           1c000  L0 4:2d07545800:e00 4000L/e00P F=1 B=1132997/1132997
           20000  L0 6:1a41dd7c00:e00 4000L/e00P F=1 B=1133002/1133002
           24000  L0 5:112ca4cc00:e00 4000L/e00P F=1 B=1133007/1133007
           28000  L0 5:559e31000:e00 4000L/e00P F=1 B=1133000/1133000
           2c000  L0 4:d91a7e000:e00 4000L/e00P F=1 B=1133004/1133004
           30000  L0 4:d99372400:e00 4000L/e00P F=1 B=1133009/1133009
           34000  L0 4:265bf62800:e00 4000L/e00P F=1 B=1132993/1132993
           38000  L0 6:134c5fcc00:e00 4000L/e00P F=1 B=1132992/1132992
           3c000  L0 5:559e31e00:e00 4000L/e00P F=1 B=1133000/1133000
           40000  L0 5:11130ca400:e00 4000L/e00P F=1 B=1133005/1133005
           44000  L0 4:dbeccac00:e00 4000L/e00P F=1 B=1133003/1133003
           48000  L0 4:2b78b02200:e00 4000L/e00P F=1 B=1132989/1132989
           4c000  L0 6:134c5ff400:e00 4000L/e00P F=1 B=1132992/1132992
           50000  L0 4:1a40a2400:e00 4000L/e00P F=1 B=1133008/1133008
           54000  L0 5:11130cb200:e00 4000L/e00P F=1 B=1133005/1133005
           58000  L0 6:19f0f10c00:e00 4000L/e00P F=1 B=1132991/1132991
           5c000  L0 4:1a40a3200:e00 4000L/e00P F=1 B=1133008/1133008
           60000  L0 7:b97b6aa00:e00 4000L/e00P F=1 B=1133004/1133004
           64000  L0 5:112ca4f400:e00 4000L/e00P F=1 B=1133007/1133007
           68000  L0 4:17f825800:e00 4000L/e00P F=1 B=1132999/1132999
           6c000  L0 6:1a2429de00:e00 4000L/e00P F=1 B=1132995/1132995
           70000  L0 6:1a41dd9a00:e00 4000L/e00P F=1 B=1133002/1133002
           74000  L0 7:129d29e800:e00 4000L/e00P F=1 B=1133007/1133007
           78000  L0 4:dbeccca00:e00 4000L/e00P F=1 B=1133003/1133003
           7c000  L0 4:17f826600:e00 4000L/e00P F=1 B=1132999/1132999
           80000  L0 5:569fa5000:e00 4000L/e00P F=1 B=1132994/1132994

                segment [0000000000000000, 0000000000084000) size  528K

[root@porter81:snap]# zdb -ddddd porter81/mdt0 533742980
Dataset porter81/mdt0 [ZPL], ID 148, cr_txg 98, 910G, 61852198 objects, rootbp DVA[0]=<4:88d9c400:200> DVA[1]=<5:25ca03c200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1214040L/1214040P fill=61852198 cksum=139cf672b7:5dc8d6146f6:f8e6add4f57c:1e27e38477f5c0

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
zdb: dmu_bonus_hold(533742980) failed, errno 2

This is on a new file system that has not been used by end-users yet, but which we attempted to copy data to. More specifically:
1, We copied about 500 million files/dirs to it
2. We tried to use lfs migrate -M to move some large subtrees from one MDT to another, but that failed due to a Lustre 2.8 bug with lfs migrate
3. We deleted most of the files/dirs

  • The servers did not crash, as far as I can recall, while we were performing all the copy and delete operations. But I cannot be certain of that.
  • We inspected the console logs on the servers and clients but found nothing that sounded like it indicated object creation or destruction failing.


 Comments   
Comment by Olaf Faaland [ 08/Oct/18 ]

We are uncertain whether lfs migrate was involved. If there is anything I could look for within either of the MDTs to determine whether lfs migrate was attempting to migrate this directory, or its parent, to help corroborate or rule that out, let me know.

Comment by Olaf Faaland [ 08/Oct/18 ]

I've attached the console logs for the two MDSs. The files are named "console.porter

{81,82}

.tgz"

This corruption was discovered early in the day on 2018-10-04 and the first attempt to copy this directory tree to the new file system, was started on 2018-09-22. So the damage must have occurred during the period covered by these logs.

Comment by Olaf Faaland [ 08/Oct/18 ]

MDT0000 is in pool porter81.
MDT0001 is in pool porter82.

both pools report state "ONLINE", which means no errors.

We do not have debug logs for MDT0000; it was stopped and re-started on 2018-10-04.

Comment by Peter Jones [ 09/Oct/18 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Lai Siyao [ 09/Oct/18 ]

The only clue I see is from porter81:

2018-10-04 04:29:58 [125075.971428] LustreError: 166475:0:(mdt_open.c:1515:mdt_reint_open()) lustre3-MDT0000: name '2fe' present, but FID [0x200004031:0x1216b:0x0] is invalid

But it only tells that the FID of '2fe' doesn't exist. Can you use LFSCK to fix this inconsistency?

Comment by Olaf Faaland [ 09/Oct/18 ]

Yes, I can run lfsck. Is there anything else I can look for, that might give a clue as to how this happened?

Comment by Lai Siyao [ 10/Oct/18 ]

It may be related with dir migration, because in 2.10, dir migration will migrate dirent of all sub files to target first, and then migrate inodes of sub files, if it fails in the middle, it may leave some dirents point to nowhere.

Comment by Olaf Faaland [ 11/Oct/18 ]

I started lfsck about 40 hours ago, via

pdsh -w e81 lctl lfsck_start --all --create-ostobj on --create-mdtobj on --delay-create-ostobj on --orphan

The invalid directory entry has been removed. Ignoring the "0" valued results, it now reports:

[root@porteri:toss-4318]# pdsh -w e81 lctl lfsck_query  | awk "\$NF > 0 {print}"
e81: layout_mdts_completed: 2
e81: layout_osts_completed: 79
e81: layout_osts_unknown: 1
e81: namespace_mdts_completed: 2
e81: namespace_repaired: 1

1. Is there any way for me to know in more detail, what it changed?
2. What does layout_osts_unknown mean?

Comment by Olaf Faaland [ 11/Oct/18 ]

I found that many (possibly all) of the OSTs had BUG: reports in their console logs, and that one OST had been failed over to its partner. I will create a ticket for that issue if there isn't one already, and link to it here.

After I powered on the OSS that was off, and the OST moved back, I re-ran lfsck_query got this instead:

[root@porteri:toss-4318]# pdsh -w e81 lctl lfsck_query  | awk "\$NF > 0 {print}"
e81: layout_mdts_completed: 2
e81: layout_osts_completed: 80
e81: namespace_mdts_completed: 2
e81: namespace_repaired: 1

This file system has 80 OSTs and 2 MDTs.

Does that mean that:
1. layout_osts_unknown meant it could not fetch the status of 1 OST's lfsck
2. It is now telling me that all 2 MDTs finished and all 80 OSTs finished the lfsck?

thanks

Comment by Lai Siyao [ 12/Oct/18 ]

layout_osts_unknown should be a bug, maybe it's because that OST crashed, or server replied a status not known.

The second run should have finished lfsck on all MDTs and OSTs.

I need to check the BUG info to better understand it, and see if I can inject error in the code to get that output.

Comment by Lai Siyao [ 01/Nov/18 ]

Can you provide the BUG info?

Comment by Olaf Faaland [ 05/Nov/18 ]

Hello Lai,

Sorry for the long delay.  Please see https://jira.whamcloud.com/browse/LU-11620 for one stack related to lfsck.

Comment by Olaf Faaland [ 03/Jan/19 ]

Hi.  What's the plan for this issue (the creation of the inconsistency, not lfsck)?  Thanks.

Comment by Lai Siyao [ 04/Jan/19 ]

Directory migration is rewritten, which fixed many issues in migration, but it's in 2.12.

Comment by Olaf Faaland [ 04/Jan/19 ]

Directory migration is rewritten, which fixed many issues in migration, but it's in 2.12.

OK, then you're saying:
(a) directory migration in 2.10 is unsafe - risks data loss - and should not be used,
and
(b) there is nothing more to do on this issue and no additional debug code is necessary

Is that correct?
Thanks

Comment by Gerrit Updater [ 04/Jan/19 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/33960
Subject: LU-11481 utils: disable lfs migrate -m
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: d08d4a3b232c0e1a6a1fb9d2ee6f315fd26ae498

Comment by Olaf Faaland [ 04/Jan/19 ]

In case (a) is "yes", I've uploaded a patch for b2_10.

Comment by Lai Siyao [ 14/Jan/19 ]

Yes, Olaf.

Comment by Olaf Faaland [ 17/Jan/19 ]

Hello Lai,

I've added you as a reviewer on my patch, which at last update passed tests except sanity-scrub.test_9 which seems to me like it's unrelated to my patch - but maybe I'm mistaken.  Can you kick it so that the review-dne-part-2, which includes sanity-scrub, is re-tested?

thanks

Comment by Peter Jones [ 18/Jan/19 ]

I re-triggered it

Comment by Gerrit Updater [ 29/Jan/19 ]

Pushed against Master by mistake.  This one will be abandoned.

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/34130
Subject: LU-11481 utils: disable lfs migrate -m
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 459ba774583997e616a04715709fc2f671dbe0bb

Comment by Gerrit Updater [ 15/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33960/
Subject: LU-11481 utils: disable lfs migrate -m
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 3b7e4ac3bb896d66613e9a6bafbcf6c01a1ac63d

Comment by Peter Jones [ 15/Feb/19 ]

Landed for 2.10.7. Not needed on master

Generated at Sat Feb 10 02:44:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.