Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2240

implement index range lookup for osd-zfs.

Details

    • 3
    • 5303

    Description

      ZFS needs a index range lookup for DNE.

      Attachments

        Issue Links

          Activity

            [LU-2240] implement index range lookup for osd-zfs.

            I'd say post-2.4.0 would be a bit safer. But yes, we don't need to keep it around too long.

            morrone Christopher Morrone (Inactive) added a comment - I'd say post-2.4.0 would be a bit safer. But yes, we don't need to keep it around too long.
            prakash Prakash Surya (Inactive) added a comment - - edited

            Well, we still need to upgrade our production side of things which needs the conversion code. But since it landed in a tag already (2.3.63), I'm personally OK with dropping it from master. We can upgrade using a 2.3.63-based tag which will fix the FIDs, and then later upgrade to a newer tag which wouldn't have the conversion code. I'd imagine that would work just fine, and then the conversion code won't be in the actual 2.4 release.

            morrone, how does that sound to you?

            prakash Prakash Surya (Inactive) added a comment - - edited Well, we still need to upgrade our production side of things which needs the conversion code. But since it landed in a tag already (2.3.63), I'm personally OK with dropping it from master. We can upgrade using a 2.3.63-based tag which will fix the FIDs, and then later upgrade to a newer tag which wouldn't have the conversion code. I'd imagine that would work just fine, and then the conversion code won't be in the actual 2.4 release. morrone , how does that sound to you?

            Prakash, do you think we need to keep this conversion code around for a while? my preference is to drop it as soon as possible.

            bzzz Alex Zhuravlev added a comment - Prakash, do you think we need to keep this conversion code around for a while? my preference is to drop it as soon as possible.

            I reformatted out Grove-Test file system using our 2.3.62-4chaos tag. Our Grove-Production filesystem doesn't have any entries in oi.7/0x200000007* so we should be OK to simply upgrade that side of things without a reformat (as far as I can tell). So I'll go ahead and resolve this ticket.

            prakash Prakash Surya (Inactive) added a comment - I reformatted out Grove-Test file system using our 2.3.62-4chaos tag. Our Grove-Production filesystem doesn't have any entries in oi.7/0x200000007* so we should be OK to simply upgrade that side of things without a reformat (as far as I can tell). So I'll go ahead and resolve this ticket.

            I can make another patch to remove those objects, but frankly this isn't a nice way to go (we've made amount of changes to on-disk format from the beginning). so if this is possible, it'd be much much better to start from released version.
            to some extent we do check on-disk consistency, with ldiskfs though. the good thing is that attributes like nlink are manipulated the same way on zfs.

            bzzz Alex Zhuravlev added a comment - I can make another patch to remove those objects, but frankly this isn't a nice way to go (we've made amount of changes to on-disk format from the beginning). so if this is possible, it'd be much much better to start from released version. to some extent we do check on-disk consistency, with ldiskfs though. the good thing is that attributes like nlink are manipulated the same way on zfs.

            After talking with Brian some more, I definitely think the issue is the improper handling of the "links" field. The first "rm" actually deleted the object from the dataset, and the subsequent removes got ENOENT because the object was already deleted. So I think the only path forward is to either hack the ZPL or Lustre to remove the entries we're interested in from the ZAPs, or reformat the filesystem. Assuming we wont have this problem on our production FS (which I need to verify, still), I'm going to pursue a reformat of our test FS to get around this.

            prakash Prakash Surya (Inactive) added a comment - After talking with Brian some more, I definitely think the issue is the improper handling of the "links" field. The first "rm" actually deleted the object from the dataset, and the subsequent removes got ENOENT because the object was already deleted. So I think the only path forward is to either hack the ZPL or Lustre to remove the entries we're interested in from the ZAPs, or reformat the filesystem. Assuming we wont have this problem on our production FS (which I need to verify, still), I'm going to pursue a reformat of our test FS to get around this.

            Sigh.. Well it let me remove files oi.7/0x200000007:0x3:0x0, oi.7/0x200000007:0x4:0x0, and oi.7/0x200000007:0x1:0x0 (inode numbers 414211, 414213, and 414209 respectively) but I'm getting ENOENT when removing the others. Using systemtap, I can see it failing in zfs_zget:

            # grove-mds2 /mnt/grove-mds2/mdt0 > stap /usr/share/doc/systemtap-1.6/examples/general/para-callgraph.stp 'module("zfs").function("*")' -c "rm ./oi.7/0x200000007:0x2:0x0/0x1010000"
            
            ... [snip] ...
            
               677 rm(94074):    ->dmu_buf_get_user db_fake=0xffff880d717f1e40
               679 rm(94074):    <-dmu_buf_get_user return=0xffff880d52c28478
               684 rm(94074):    ->sa_get_userdata hdl=0xffff880d52c28478
               687 rm(94074):    <-sa_get_userdata return=0xffff880e6030ba70
               691 rm(94074):    ->sa_buf_rele db=0xffff880d717f1e40 tag=0x0
               694 rm(94074):     ->dbuf_rele db=0xffff880d717f1e40 tag=0x0
               696 rm(94074):      ->dbuf_rele_and_unlock db=0xffff880d717f1e40 tag=0x0
               698 rm(94074):      <-dbuf_rele_and_unlock 
               699 rm(94074):     <-dbuf_rele 
               701 rm(94074):    <-sa_buf_rele 
               703 rm(94074):   <-zfs_zget return=0x2
               707 rm(94074):   ->zfs_dirent_unlock dl=0xffff880f521949c0
               710 rm(94074):   <-zfs_dirent_unlock 
               712 rm(94074):  <-zfs_dirent_lock return=0x2
               714 rm(94074):  ->rrw_exit rrl=0xffff880d5a100290 tag=0xffffffffa0505727
               716 rm(94074):  <-rrw_exit 
               718 rm(94074): <-zfs_remove return=0x2
               720 rm(94074):<-zpl_unlink return=0xfffffffffffffffe
            

            I tried removing the files in the order that they were listed in the "find" command in my previous comment. So the first "rm" for each distinct inode number succeeded, but the following calls for files referencing the same inode number failed. Perhaps due to incorrect accounting of the number of links for a given inode?

            In case it's useful, the zdb info regarding these objects is below (AFAIK the inode number correspond to its dmu object number):

            # grove-mds2 /mnt/grove-mds2/mdt0 > zdb grove-mds2/mdt0 414209 414211 414213
            Dataset grove-mds2/mdt0 [ZPL], ID 45, cr_txg 110, 4.05G, 2088710 objects
            
                Object  lvl   iblk   dblk  dsize  lsize   %full  type
                414209    1    16K   128K   128K   128K  100.00  ZFS plain file
                414211    2     4K     4K     4K     8K  100.00  ZFS directory
                414213    2     4K     4K     4K     8K  100.00  ZFS directory
            

            I'm beginning to think a reformat is our best option moving forward...

            prakash Prakash Surya (Inactive) added a comment - Sigh.. Well it let me remove files oi.7/0x200000007:0x3:0x0 , oi.7/0x200000007:0x4:0x0 , and oi.7/0x200000007:0x1:0x0 (inode numbers 414211 , 414213 , and 414209 respectively) but I'm getting ENOENT when removing the others. Using systemtap, I can see it failing in zfs_zget : # grove-mds2 /mnt/grove-mds2/mdt0 > stap /usr/share/doc/systemtap-1.6/examples/general/para-callgraph.stp 'module("zfs").function("*")' -c "rm ./oi.7/0x200000007:0x2:0x0/0x1010000" ... [snip] ... 677 rm(94074): ->dmu_buf_get_user db_fake=0xffff880d717f1e40 679 rm(94074): <-dmu_buf_get_user return=0xffff880d52c28478 684 rm(94074): ->sa_get_userdata hdl=0xffff880d52c28478 687 rm(94074): <-sa_get_userdata return=0xffff880e6030ba70 691 rm(94074): ->sa_buf_rele db=0xffff880d717f1e40 tag=0x0 694 rm(94074): ->dbuf_rele db=0xffff880d717f1e40 tag=0x0 696 rm(94074): ->dbuf_rele_and_unlock db=0xffff880d717f1e40 tag=0x0 698 rm(94074): <-dbuf_rele_and_unlock 699 rm(94074): <-dbuf_rele 701 rm(94074): <-sa_buf_rele 703 rm(94074): <-zfs_zget return=0x2 707 rm(94074): ->zfs_dirent_unlock dl=0xffff880f521949c0 710 rm(94074): <-zfs_dirent_unlock 712 rm(94074): <-zfs_dirent_lock return=0x2 714 rm(94074): ->rrw_exit rrl=0xffff880d5a100290 tag=0xffffffffa0505727 716 rm(94074): <-rrw_exit 718 rm(94074): <-zfs_remove return=0x2 720 rm(94074):<-zpl_unlink return=0xfffffffffffffffe I tried removing the files in the order that they were listed in the "find" command in my previous comment. So the first "rm" for each distinct inode number succeeded, but the following calls for files referencing the same inode number failed. Perhaps due to incorrect accounting of the number of links for a given inode? In case it's useful, the zdb info regarding these objects is below (AFAIK the inode number correspond to its dmu object number): # grove-mds2 /mnt/grove-mds2/mdt0 > zdb grove-mds2/mdt0 414209 414211 414213 Dataset grove-mds2/mdt0 [ZPL], ID 45, cr_txg 110, 4.05G, 2088710 objects Object lvl iblk dblk dsize lsize %full type 414209 1 16K 128K 128K 128K 100.00 ZFS plain file 414211 2 4K 4K 4K 8K 100.00 ZFS directory 414213 2 4K 4K 4K 8K 100.00 ZFS directory I'm beginning to think a reformat is our best option moving forward...

            yes, I'd suggest to remove them .. and I'd suggest to take a snapshot just before that unfortunately I'm unable to reproduce the case locally:
            I can't generate such an image (can't even find the code in gerrit using 0x200000007 for quota).

            bzzz Alex Zhuravlev added a comment - yes, I'd suggest to remove them .. and I'd suggest to take a snapshot just before that unfortunately I'm unable to reproduce the case locally: I can't generate such an image (can't even find the code in gerrit using 0x200000007 for quota).

            Here's what I see on the MDS:

            # grove-mds2 /tmp/zfs > ls -li oi.3/0x200000003* oi.5/0x200000005* oi.6/0x200000006* oi.7/0x200000007* seq* quota*
               176 -rw-r--r-- 1 root root  8 Dec 31  1969 oi.3/0x200000003:0x1:0x0
               180 -rw-r--r-- 1 root root  0 Dec 31  1969 oi.3/0x200000003:0x3:0x0
            414212 -rw-r--r-- 1 root root  2 Dec 31  1969 oi.5/0x200000005:0x1:0x0
            414214 -rw-r--r-- 1 root root  2 Dec 31  1969 oi.5/0x200000005:0x2:0x0
            417923 -rw-r--r-- 1 root root  2 Dec 31  1969 oi.6/0x200000006:0x10000:0x0
            417924 -rw-r--r-- 1 root root  2 Dec 31  1969 oi.6/0x200000006:0x1010000:0x0
            417927 -rw-r--r-- 1 root root  2 Dec 31  1969 oi.6/0x200000006:0x1020000:0x0
            417926 -rw-r--r-- 1 root root  2 Dec 31  1969 oi.6/0x200000006:0x20000:0x0
            414209 -rw-r--r-- 1 root root  8 Dec 31  1969 oi.7/0x200000007:0x1:0x0
            414211 -rw-r--r-- 1 root root  2 Dec 31  1969 oi.7/0x200000007:0x3:0x0
            414213 -rw-r--r-- 1 root root  2 Dec 31  1969 oi.7/0x200000007:0x4:0x0
            414209 -rw-r--r-- 1 root root  8 Dec 31  1969 seq-200000007-lastid
               173 -rw-rw-rw- 1 root root 24 Dec 31  1969 seq_ctl
               174 -rw-rw-rw- 1 root root 24 Dec 31  1969 seq_srv
            
            oi.3/0x200000003:0x2:0x0:
            total 0
            
            oi.3/0x200000003:0x4:0x0:
            total 9
            417925 drwxr-xr-x 2 root root 2 Dec 31  1969 dt-0x0
            417922 drwxr-xr-x 2 root root 2 Dec 31  1969 md-0x0
            
            oi.3/0x200000003:0x5:0x0:
            total 9
            417923 -rw-r--r-- 1 root root 2 Dec 31  1969 0x10000
            417924 -rw-r--r-- 1 root root 2 Dec 31  1969 0x1010000
            
            oi.3/0x200000003:0x6:0x0:
            total 9
            417927 -rw-r--r-- 1 root root 2 Dec 31  1969 0x1020000
            417926 -rw-r--r-- 1 root root 2 Dec 31  1969 0x20000
            
            oi.7/0x200000007:0x2:0x0:
            total 18
            414211 -rw-r--r-- 1 root root 2 Dec 31  1969 0x10000
            414212 -rw-r--r-- 1 root root 2 Dec 31  1969 0x10000-MDT0000
            414213 -rw-r--r-- 1 root root 2 Dec 31  1969 0x1010000
            414214 -rw-r--r-- 1 root root 2 Dec 31  1969 0x1010000-MDT0000
            
            quota_master:
            total 9
            417925 drwxr-xr-x 2 root root 2 Dec 31  1969 dt-0x0
            417922 drwxr-xr-x 2 root root 2 Dec 31  1969 md-0x0
            
            quota_slave:
            total 18
            414211 -rw-r--r-- 1 root root 2 Dec 31  1969 0x10000
            414212 -rw-r--r-- 1 root root 2 Dec 31  1969 0x10000-MDT0000
            414213 -rw-r--r-- 1 root root 2 Dec 31  1969 0x1010000
            414214 -rw-r--r-- 1 root root 2 Dec 31  1969 0x1010000-MDT0000
            

            I'm somewhat guessing as to what the on disk format is supposed to look like, but it does appear to be using the new quota sequence numbers (0x200000005ULL and 0x200000006ULL).

            So, does this mean I can go ahead and remove these files:

            # grove-mds2 /tmp/zfs > find . -inum 414209 -o -inum 414211 -o -inum 414213
            ./oi.7/0x200000007:0x3:0x0
            ./oi.7/0x200000007:0x4:0x0
            ./oi.7/0x200000007:0x2:0x0/0x1010000
            ./oi.7/0x200000007:0x2:0x0/0x10000
            ./oi.7/0x200000007:0x1:0x0
            ./seq-200000007-lastid
            ./quota_slave/0x1010000
            ./quota_slave/0x10000
            

            ?

            prakash Prakash Surya (Inactive) added a comment - Here's what I see on the MDS: # grove-mds2 /tmp/zfs > ls -li oi.3/0x200000003* oi.5/0x200000005* oi.6/0x200000006* oi.7/0x200000007* seq* quota* 176 -rw-r--r-- 1 root root 8 Dec 31 1969 oi.3/0x200000003:0x1:0x0 180 -rw-r--r-- 1 root root 0 Dec 31 1969 oi.3/0x200000003:0x3:0x0 414212 -rw-r--r-- 1 root root 2 Dec 31 1969 oi.5/0x200000005:0x1:0x0 414214 -rw-r--r-- 1 root root 2 Dec 31 1969 oi.5/0x200000005:0x2:0x0 417923 -rw-r--r-- 1 root root 2 Dec 31 1969 oi.6/0x200000006:0x10000:0x0 417924 -rw-r--r-- 1 root root 2 Dec 31 1969 oi.6/0x200000006:0x1010000:0x0 417927 -rw-r--r-- 1 root root 2 Dec 31 1969 oi.6/0x200000006:0x1020000:0x0 417926 -rw-r--r-- 1 root root 2 Dec 31 1969 oi.6/0x200000006:0x20000:0x0 414209 -rw-r--r-- 1 root root 8 Dec 31 1969 oi.7/0x200000007:0x1:0x0 414211 -rw-r--r-- 1 root root 2 Dec 31 1969 oi.7/0x200000007:0x3:0x0 414213 -rw-r--r-- 1 root root 2 Dec 31 1969 oi.7/0x200000007:0x4:0x0 414209 -rw-r--r-- 1 root root 8 Dec 31 1969 seq-200000007-lastid 173 -rw-rw-rw- 1 root root 24 Dec 31 1969 seq_ctl 174 -rw-rw-rw- 1 root root 24 Dec 31 1969 seq_srv oi.3/0x200000003:0x2:0x0: total 0 oi.3/0x200000003:0x4:0x0: total 9 417925 drwxr-xr-x 2 root root 2 Dec 31 1969 dt-0x0 417922 drwxr-xr-x 2 root root 2 Dec 31 1969 md-0x0 oi.3/0x200000003:0x5:0x0: total 9 417923 -rw-r--r-- 1 root root 2 Dec 31 1969 0x10000 417924 -rw-r--r-- 1 root root 2 Dec 31 1969 0x1010000 oi.3/0x200000003:0x6:0x0: total 9 417927 -rw-r--r-- 1 root root 2 Dec 31 1969 0x1020000 417926 -rw-r--r-- 1 root root 2 Dec 31 1969 0x20000 oi.7/0x200000007:0x2:0x0: total 18 414211 -rw-r--r-- 1 root root 2 Dec 31 1969 0x10000 414212 -rw-r--r-- 1 root root 2 Dec 31 1969 0x10000-MDT0000 414213 -rw-r--r-- 1 root root 2 Dec 31 1969 0x1010000 414214 -rw-r--r-- 1 root root 2 Dec 31 1969 0x1010000-MDT0000 quota_master: total 9 417925 drwxr-xr-x 2 root root 2 Dec 31 1969 dt-0x0 417922 drwxr-xr-x 2 root root 2 Dec 31 1969 md-0x0 quota_slave: total 18 414211 -rw-r--r-- 1 root root 2 Dec 31 1969 0x10000 414212 -rw-r--r-- 1 root root 2 Dec 31 1969 0x10000-MDT0000 414213 -rw-r--r-- 1 root root 2 Dec 31 1969 0x1010000 414214 -rw-r--r-- 1 root root 2 Dec 31 1969 0x1010000-MDT0000 I'm somewhat guessing as to what the on disk format is supposed to look like, but it does appear to be using the new quota sequence numbers (0x200000005ULL and 0x200000006ULL). So, does this mean I can go ahead and remove these files: # grove-mds2 /tmp/zfs > find . -inum 414209 -o -inum 414211 -o -inum 414213 ./oi.7/0x200000007:0x3:0x0 ./oi.7/0x200000007:0x4:0x0 ./oi.7/0x200000007:0x2:0x0/0x1010000 ./oi.7/0x200000007:0x2:0x0/0x10000 ./oi.7/0x200000007:0x1:0x0 ./seq-200000007-lastid ./quota_slave/0x1010000 ./quota_slave/0x10000 ?

            could you check whether your filesystem has been using new quota files now? they're supposed to be in the following sequences:
            FID_SEQ_QUOTA = 0x200000005ULL,
            FID_SEQ_QUOTA_GLB = 0x200000006ULL,

            if so, then it should be OK to just remove old quota files in 0x200000007 sequence.

            bzzz Alex Zhuravlev added a comment - could you check whether your filesystem has been using new quota files now? they're supposed to be in the following sequences: FID_SEQ_QUOTA = 0x200000005ULL, FID_SEQ_QUOTA_GLB = 0x200000006ULL, if so, then it should be OK to just remove old quota files in 0x200000007 sequence.

            People

              bzzz Alex Zhuravlev
              di.wang Di Wang
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: