Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13823

Two hard links to the same directory

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • None
    • kernel-3.10.0-1127.0.0.1chaos.ch6.x86_64
      zfs-0.7.11-9.4llnl.ch6.x86_64
      lustre-2.10.8_9.chaos-1.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      2 directories in the same filesystem have the same inode

      [root@rzslic5]==> ls -lid /p/czlustre2/reza2/5_star_pattern_J
      288233885643309090 drwx------ 3 58904 58904 33280 Jul 24 15:55 /p/czlustre2/reza2/5_star_pattern_J
      [root@rzslic5]==> ls -lid /p/czlustre2/reza2/5_star_pattern_J_2/
      288233885643309090 drwx------ 3 58904 58904 33280 Jul 24 15:55 /p/czlustre2/reza2/5_star_pattern_J_2/
      

      the same FID

      [root@rzslic2:reza2]# lfs path2fid 5_star_pattern_J
      [0x40003311e:0x22:0x0]
      [root@rzslic2:reza2]# lfs path2fid 5_star_pattern_J_2
      [0x40003311e:0x22:0x0] 

      The directory has one subdirectory:

      [root@oslic7:reza2]# ls -al 5_star_pattern_J
      total 130
      drwx------   3 pearce7 pearce7 33280 Jul 24 15:55 .
      drwx------ 155 reza2   reza2   57856 Jul 27 12:54 ..
      drwx------   2 pearce7 pearce7 41472 Sep 21  2019 0
      

      Attachments

        Issue Links

          Activity

            [LU-13823] Two hard links to the same directory
            ofaaland Olaf Faaland added a comment -

            Removing "topllnl" tag because I do not see any way for us to get more information about what happened. I'm going to leave it open in case we see the same problem again, or someone else does.

            ofaaland Olaf Faaland added a comment - Removing "topllnl" tag because I do not see any way for us to get more information about what happened. I'm going to leave it open in case we see the same problem again, or someone else does.
            ofaaland Olaf Faaland added a comment -

            Andreas,

            I just noticed that the bash_history contents I pasted in above are for the wrong file system (/p/czlustre3 AKA /p/lustre3, != /p/lustre2). It's typical that our users have quota and a directory on multiple Lustre file systems.

            So we have no context at all. I don't see what else we can do. If you can think of anything else we should look at, or a debug patch that would be helpful, let me know.

            thanks

            ofaaland Olaf Faaland added a comment - Andreas, I just noticed that the bash_history contents I pasted in above are for the wrong file system (/p/czlustre3 AKA /p/lustre3, != /p/lustre2). It's typical that our users have quota and a directory on multiple Lustre file systems. So we have no context at all. I don't see what else we can do. If you can think of anything else we should look at, or a debug patch that would be helpful, let me know. thanks

            Andreas,

            We weren't able to get any good information about what led up to this.  We believe it occurred during the "mv" operation below (this is bash history from a node the sysadmin was using)

            40 cd /p/czlustre3/pearce7
            41 ls
            42 cd reza2
            43 ls
            44 ls /p/czlustre3/reza2/
            45 pwd
            46 mv * /p/czlustre3/reza2/

            and that before the "mv" command

            • /p/czlustre3/reza2/ was empty, and
            • /p/czlustre3/pearce7/reza2/5_star_pattern_J was an apparently normal directory.

            During the "mv" command the sysadmin got an error message that 5_star_pattern_J already existed in the target. At that point he looked and saw that both the source and target directory had a subdirectory by that name, and then found they were two references to the same directory.

            It's hard to see how this sequence of events could create the problem, but that's unfortunately all we were able to find out.

            ofaaland Olaf Faaland added a comment - Andreas, We weren't able to get any good information about what led up to this.  We believe it occurred during the "mv" operation below (this is bash history from a node the sysadmin was using) 40 cd /p/czlustre3/pearce7 41 ls 42 cd reza2 43 ls 44 ls /p/czlustre3/reza2/ 45 pwd 46 mv * /p/czlustre3/reza2/ and that before the "mv" command /p/czlustre3/reza2/ was empty, and /p/czlustre3/pearce7/reza2/5_star_pattern_J was an apparently normal directory. During the "mv" command the sysadmin got an error message that 5_star_pattern_J already existed in the target. At that point he looked and saw that both the source and target directory had a subdirectory by that name, and then found they were two references to the same directory. It's hard to see how this sequence of events could create the problem, but that's unfortunately all we were able to find out.

            Is it possible to ask the user how these two directories were created? Was there a rename, or they were possibly created in parallel? Was "lfs migrate -m" used on the directory to migrate between MDTs?

            I'm working on getting that information. There's been a complex chain of ownership so we're working through it.

            The "trusted.link" xattr shows only "5_star_pattern_J_2" for the name of the directory.

            The dnode shows "links 3", but that could be because of a subdirectory, and not necessarily because of multiple hard links to the file, but it would be useful to check. If the client had (somehow) allowed multiple hard links to the directory, it should also have added the filename to the "trusted.link" xattr at that time.

            There is one subdirectory, named "0". Sorry I left that out of the description.

            Have you tried creating hard links to a directory with ZFS? This should be caught by the client VFS, and also by ldiskfs, but I'm wondering if maybe ldiskfs implements such a check, but ZFS does this in the ZPL and that is not checked by osd-zfs?

            I haven't yet figured out where it's checked, but neither ZPL nor our lustre 2.10.8 backed by ZFS 0.7 allowed hard linking to a directory via link(3) when I tried it. In both cases link() failed and errno was set to EPERM, as you saw with your test. But there was nothing exciting going on while I tried that, like many processes in parallel, or a failover, etc.

            bash-4.2$ ll -d existing newlink
            ls: cannot access newlink: No such file or directory
            drwx------ 2 faaland1 faaland1 33280 Jul 27 16:50 existing
            
            bash-4.2$ strace -e link ./dolink existing newlink;
            link("existing", "newlink")             = -1 EPERM (Operation not permitted)
            +++ exited with 255 +++
            

            and this generates an RPC to the MDS, so
            it seems possible if some user binary was calling link(3) itself instead of ln(1) it might trigger this itself?

            Yes, maybe. I hope we're able to find out how these directories were created.

            ofaaland Olaf Faaland added a comment - Is it possible to ask the user how these two directories were created? Was there a rename, or they were possibly created in parallel? Was " lfs migrate -m " used on the directory to migrate between MDTs? I'm working on getting that information. There's been a complex chain of ownership so we're working through it. The " trusted.link " xattr shows only " 5_star_pattern_J_2 " for the name of the directory. The dnode shows " links 3 ", but that could be because of a subdirectory, and not necessarily because of multiple hard links to the file, but it would be useful to check. If the client had (somehow) allowed multiple hard links to the directory, it should also have added the filename to the " trusted.link " xattr at that time. There is one subdirectory, named "0". Sorry I left that out of the description. Have you tried creating hard links to a directory with ZFS? This should be caught by the client VFS, and also by ldiskfs, but I'm wondering if maybe ldiskfs implements such a check, but ZFS does this in the ZPL and that is not checked by osd-zfs ? I haven't yet figured out where it's checked, but neither ZPL nor our lustre 2.10.8 backed by ZFS 0.7 allowed hard linking to a directory via link(3) when I tried it. In both cases link() failed and errno was set to EPERM, as you saw with your test. But there was nothing exciting going on while I tried that, like many processes in parallel, or a failover, etc. bash-4.2$ ll -d existing newlink ls: cannot access newlink: No such file or directory drwx------ 2 faaland1 faaland1 33280 Jul 27 16:50 existing bash-4.2$ strace -e link ./dolink existing newlink; link("existing", "newlink") = -1 EPERM (Operation not permitted) +++ exited with 255 +++ and this generates an RPC to the MDS, so it seems possible if some user binary was calling link(3) itself instead of ln(1) it might trigger this itself? Yes, maybe. I hope we're able to find out how these directories were created.

            Note that birth time will be directly accessible on Lustre 2.14 clients with a suitably new (maybe RHEL8?) stat command that uses statx() under the covers.

            Is it possible to ask the user how these two directories were created? Was there a rename, or they were possibly created in parallel? Was "lfs migrate -m" used on the directory to migrate between MDTs?

            The "trusted.link" xattr shows only "5_star_pattern_J_2" for the name of the directory.

            The dnode shows "links 3", but that could be because of a subdirectory, and not necessarily because of multiple hard links to the file, but it would be useful to check. If the client had (somehow) allowed multiple hard links to the directory, it should also have added the filename to the "trusted.link" xattr at that time.

            Have you tried creating hard links to a directory with ZFS? This should be caught by the client VFS, and also by ldiskfs, but I'm wondering if maybe ldiskfs implements such a check, but ZFS does this in the ZPL and that is not checked by osd-zfs?

            I ran a quick test on ldiskfs, and I was surprised to see that a hard-link RPC is sent to the MDS in such a case when the "ln" binary is not used. The ln binary stat()}}s the source and target names itself to see if they exist and the file type before even calling the {{link() syscall:

            stat("/mnt/testfs/link", 0x7ffee35aa3c0) = -1 ENOENT (No such file or directory)
            lstat("/mnt/testfs/newdir", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
            write(2, "ln: /mnt/testfs/newdir: hard link not allowed for directory") = 61
            

            Running multiop (from the lustre-tests RPM) doesn't do any sanity checking before calling the link() syscall:

            mkdir /mnt/testfs/newdir
            strace multiop /mnt/testfs/newdir L /mnt/testfs/link
            
            link("/mnt/testfs2/newerdir", "/mnt/testfs2/link3") = -1 EPERM (Operation not permitted)
            write(3, "link(): Operation not permitted\n", 32
            

            and this generates an RPC to the MDS, so it seems possible if some user binary was calling link(3) itself instead of ln(1) it might trigger this itself?

            adilger Andreas Dilger added a comment - Note that birth time will be directly accessible on Lustre 2.14 clients with a suitably new (maybe RHEL8?) stat command that uses statx() under the covers. Is it possible to ask the user how these two directories were created? Was there a rename, or they were possibly created in parallel? Was " lfs migrate -m " used on the directory to migrate between MDTs? The " trusted.link " xattr shows only " 5_star_pattern_J_2 " for the name of the directory. The dnode shows " links 3 ", but that could be because of a subdirectory, and not necessarily because of multiple hard links to the file, but it would be useful to check. If the client had (somehow) allowed multiple hard links to the directory, it should also have added the filename to the " trusted.link " xattr at that time. Have you tried creating hard links to a directory with ZFS? This should be caught by the client VFS, and also by ldiskfs, but I'm wondering if maybe ldiskfs implements such a check, but ZFS does this in the ZPL and that is not checked by osd-zfs ? I ran a quick test on ldiskfs, and I was surprised to see that a hard-link RPC is sent to the MDS in such a case when the " ln " binary is not used. The ln binary stat()}}s the source and target names itself to see if they exist and the file type before even calling the {{link() syscall: stat("/mnt/testfs/link", 0x7ffee35aa3c0) = -1 ENOENT (No such file or directory) lstat("/mnt/testfs/newdir", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 write(2, "ln: /mnt/testfs/newdir: hard link not allowed for directory") = 61 Running multiop (from the lustre-tests RPM) doesn't do any sanity checking before calling the link() syscall: mkdir /mnt/testfs/newdir strace multiop /mnt/testfs/newdir L /mnt/testfs/link link("/mnt/testfs2/newerdir", "/mnt/testfs2/link3") = -1 EPERM (Operation not permitted) write(3, "link(): Operation not permitted\n", 32 and this generates an RPC to the MDS, so it seems possible if some user binary was calling link(3) itself instead of ln(1) it might trigger this itself?
            ofaaland Olaf Faaland added a comment -

            The underlying object was created in September 2019:

            [root@zinc9:~]# zdb -dddddddd zinc9/mdt1@toss-4847 117510086
            Dataset zinc9/mdt1@toss-4847 [ZPL], ID 529, cr_txg 184688294, 162G, 57992145 objects, rootbp DVA[0]=<1:1fd6afb000:1000> DVA[1]=<2:3b680fe000:1000> [L0 DMU objset] fletcher4 uncompressed LE contiguous unique double size=800L/800P birth=184688294L/184688294P fill=57992145 cksum=c37b11463:e9a37744f6d:b651035441589:6a1e4605d58ac95
            
            
                Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
             117510086    2   128K    16K    32K     512    32K  100.00  ZFS directory (K=inherit) (Z=inherit)
                                                           192   bonus  System attributes
            	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED SPILL_BLKPTR
            	dnode maxblkid: 1
            	path	???<object#117510086>
            	uid     58904
            	gid     58904
            	atime	Mon Jul 27 10:08:25 2020
            	mtime	Fri Jul 24 15:55:46 2020
            	ctime	Fri Jul 24 15:55:46 2020
            	crtime	Fri Sep 20 19:23:47 2019
            	gen	95807852
            	mode	40700
            	size	2
            	parent	8807988
            	links	3
            	pflags	0
            	rdev	0x0000000000000000
            	SA xattrs: 212 bytes, 3 entries
            
            
            		trusted.lma = \000\000\000\000\000\000\000\000\0361\003\000\004\000\000\000"\000\000\000\000\000\000\000
            		trusted.version = Y4\3243(\000\000\000
            		trusted.link = \337\361\352\021\001\000\000\000<\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000$\000\000\000\002\200\005`H\000\000\000\007\000\000\000\0005_star_pattern_J_2
             
            ofaaland Olaf Faaland added a comment - The underlying object was created in September 2019: [root@zinc9:~]# zdb -dddddddd zinc9/mdt1@toss-4847 117510086 Dataset zinc9/mdt1@toss-4847 [ZPL], ID 529, cr_txg 184688294, 162G, 57992145 objects, rootbp DVA[0]=<1:1fd6afb000:1000> DVA[1]=<2:3b680fe000:1000> [L0 DMU objset] fletcher4 uncompressed LE contiguous unique double size=800L/800P birth=184688294L/184688294P fill=57992145 cksum=c37b11463:e9a37744f6d:b651035441589:6a1e4605d58ac95     Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type 117510086    2   128K    16K    32K     512    32K  100.00  ZFS directory (K=inherit) (Z=inherit)                                                192   bonus  System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED SPILL_BLKPTR dnode maxblkid: 1 path ???<object#117510086> uid     58904 gid     58904 atime Mon Jul 27 10:08:25 2020 mtime Fri Jul 24 15:55:46 2020 ctime Fri Jul 24 15:55:46 2020 crtime Fri Sep 20 19:23:47 2019 gen 95807852 mode 40700 size 2 parent 8807988 links 3 pflags 0 rdev 0x0000000000000000 SA xattrs: 212 bytes, 3 entries trusted.lma = \000\000\000\000\000\000\000\000\0361\003\000\004\000\000\000"\000\000\000\000\000\000\000 trusted.version = Y4\3243(\000\000\000 trusted.link = \337\361\352\021\001\000\000\000<\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000$\000\000\000\002\200\005`H\000\000\000\007\000\000\000\0005_star_pattern_J_2
            ofaaland Olaf Faaland added a comment -

            stat shows:

            [root@rzslic2:reza2]# stat 5_star_pattern_J
              File: '5_star_pattern_J'
              Size: 33280     	Blocks: 65         IO Block: 131072 directory
            Device: a2f6a642h/2734073410d	Inode: 288233885643309090  Links: 3
            Access: (0700/drwx------)  Uid: (58904/ UNKNOWN)   Gid: (58904/ UNKNOWN)
            Access: 2020-07-27 10:08:25.000000000 -0700
            Modify: 2020-07-24 15:55:46.000000000 -0700
            Change: 2020-07-24 15:55:46.000000000 -0700
             Birth: - 
            ofaaland Olaf Faaland added a comment - stat shows: [root@rzslic2:reza2]# stat 5_star_pattern_J   File: '5_star_pattern_J'   Size: 33280     Blocks: 65         IO Block: 131072 directory Device: a2f6a642h/2734073410d Inode: 288233885643309090  Links: 3 Access: (0700/drwx------)  Uid: (58904/ UNKNOWN)   Gid: (58904/ UNKNOWN) Access: 2020-07-27 10:08:25.000000000 -0700 Modify: 2020-07-24 15:55:46.000000000 -0700 Change: 2020-07-24 15:55:46.000000000 -0700 Birth: -
            ofaaland Olaf Faaland added a comment - - edited

            Perhaps the underlying cause of LU-13758 ?

            ofaaland Olaf Faaland added a comment - - edited Perhaps the underlying cause of LU-13758 ?

            People

              adilger Andreas Dilger
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: