Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13360

getdents() against empty striped directory always returns 1365 dirents

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.10.8, Lustre 2.12.4
    • 9223372036854775807

    Description

      Created directory with "lfs mkdir -c2 <thepath>" on Lustre 2.10.8 based client.
      May have attempted create subdirs or files (not sure, I'd run a series of mdtest runs across multiple directories, and canceled via Ctrl-C partway through).
      ls <thepath> appears to hang.
      Running under strace shows that getdents() calls are always returning the same contents, never return EOF.

      No error messages are reported on the console of the servers or the client.

      [faaland1@opal174 branch:master src] $ps -f
      UID        PID  PPID  C STIME TTY          TIME CMD
      faaland1 52122 52121  0 09:58 pts/0    00:00:00 -bash
      faaland1 59859 52122 99 10:16 pts/0    01:20:38 ls -l /p/lforge/faaland1/mdtest/mdt0 /p/lforge/faaland1/mdtest/mdt1 /p
      faaland1 64327 52122  0 11:36 pts/0    00:00:00 ps -f
      [faaland1@opal174 branch:master src] $strace -p 59859 2>&1 | head -n4
      strace: Process 59859 attached
      getdents(3, /* 1365 entries */, 32768)  = 32760
      getdents(3, /* 1365 entries */, 32768)  = 32760
      getdents(3, /* 1365 entries */, 32768)  = 32760
      [faaland1@opal174 branch:master src] $ls -l /proc/59859/fd
      total 0
      lrwx------ 1 faaland1 faaland1 64 Mar 13 10:41 0 -> /dev/pts/0
      lrwx------ 1 faaland1 faaland1 64 Mar 13 10:41 1 -> /dev/pts/0
      lrwx------ 1 faaland1 faaland1 64 Mar 13 10:16 2 -> /dev/pts/0
      lr-x------ 1 faaland1 faaland1 64 Mar 13 10:41 3 -> /p/lforge/faaland1/mdtest/mdtcount2
      

      The 1365 entries returned are always the same - 676 entries for ".." and 689 entries for "."

          676 {d_ino=144115272398143489, d_off=0, d_reclen=24, d_name="..", d_type=DT_DIR}
          689 {d_ino=144115339574197893, d_off=0, d_reclen=24, d_name=".", d_type=DT_DIR}
      

      I mounted the filesystem on another Lustre 2.10.8 node and ls of that directory produces the same symptoms.

      I mounted the file system on a Lustre 2.12.4 node and ls of that directory behaves as normal - getdents() is called twice, once it returns 2 entries, second time it returns 0 entries and 0 bytes (end of directory).

      Attachments

        Activity

          [LU-13360] getdents() against empty striped directory always returns 1365 dirents

          We're moving along on our updates to 2.12 that I'm going to abandon this. Thank you for your help, though.

          ofaaland Olaf Faaland added a comment - We're moving along on our updates to 2.12 that I'm going to abandon this. Thank you for your help, though.
          ofaaland Olaf Faaland added a comment -

          Thanks Lai, that helped me figure it out. Looks like it was the format of my commit messages.

          ofaaland Olaf Faaland added a comment - Thanks Lai, that helped me figure it out. Looks like it was the format of my commit messages.
          laisiyao Lai Siyao added a comment -

          I tried the first patch, and it looks working: https://review.whamcloud.com/#/c/38826/.

          laisiyao Lai Siyao added a comment - I tried the first patch, and it looks working: https://review.whamcloud.com/#/c/38826/ .
          ofaaland Olaf Faaland added a comment -

          Hi Lai,
          I cherry-picked those patches and tried to push them to gerrit against b2_10 fortestonly, but I'm being rejected in a way I haven't seen before. I'm not sure if there's something with my patch stack, or something else - maybe a permissions thing bubbling up with a non-obvious message.

          The patch stack is here:
          https://github.com/ofaaland/lustre/tree/b-toss-4558-stripedir

          and the error I get is:

          [faaland1@oslic5 branch:b-toss-4558-stripedir lustre-210] $git push wcrev HEAD:refs/for/b2_10
          Enter passphrase for key '/g/g0/faaland1/.ssh/swdev': 
          Counting objects: 126, done.
          Delta compression using up to 36 threads.
          Compressing objects: 100% (67/67), done.
          Writing objects: 100% (95/95), 87.24 KiB | 0 bytes/s, done.
          Total 95 (delta 69), reused 36 (delta 28)
          remote: Resolving deltas: 100% (69/69)
          remote: Processing changes: refs: 1, done    
          To ssh://review.whamcloud.com/fs/lustre-release
           ! [remote rejected] HEAD -> refs/for/b2_10 (not Signed-off-by author/committer/uploader in commit message footer)
          error: failed to push some refs to 'ssh://review.whamcloud.com/fs/lustre-release'
          

          It looks straightforward, but I do not see patches without the Signed-off-by matching the author, and I also found that even just pushing a branch with just one commit produces the same result:
          https://github.com/ofaaland/lustre/tree/b-test-b210-push

          Can you either try pushing that branch for me, or help me troubleshoot this? I don't know if the error message is coming from a script I can inspect to understand what's going on. It doesn't seem to be from anything under contrib in lustre.

          thanks

          ofaaland Olaf Faaland added a comment - Hi Lai, I cherry-picked those patches and tried to push them to gerrit against b2_10 fortestonly, but I'm being rejected in a way I haven't seen before. I'm not sure if there's something with my patch stack, or something else - maybe a permissions thing bubbling up with a non-obvious message. The patch stack is here: https://github.com/ofaaland/lustre/tree/b-toss-4558-stripedir and the error I get is: [faaland1@oslic5 branch:b-toss-4558-stripedir lustre-210] $git push wcrev HEAD:refs/for/b2_10 Enter passphrase for key '/g/g0/faaland1/.ssh/swdev': Counting objects: 126, done. Delta compression using up to 36 threads. Compressing objects: 100% (67/67), done. Writing objects: 100% (95/95), 87.24 KiB | 0 bytes/s, done. Total 95 (delta 69), reused 36 (delta 28) remote: Resolving deltas: 100% (69/69) remote: Processing changes: refs: 1, done To ssh://review.whamcloud.com/fs/lustre-release ! [remote rejected] HEAD -> refs/for/b2_10 (not Signed-off-by author/committer/uploader in commit message footer) error: failed to push some refs to 'ssh://review.whamcloud.com/fs/lustre-release' It looks straightforward, but I do not see patches without the Signed-off-by matching the author, and I also found that even just pushing a branch with just one commit produces the same result: https://github.com/ofaaland/lustre/tree/b-test-b210-push Can you either try pushing that branch for me, or help me troubleshoot this? I don't know if the error message is coming from a script I can inspect to understand what's going on. It doesn't seem to be from anything under contrib in lustre. thanks
          laisiyao Lai Siyao added a comment -

          2.12 contains several fixes for readdir of striped directory:
          https://review.whamcloud.com/#/c/27663
          https://review.whamcloud.com/#/c/28548
          https://review.whamcloud.com/#/c/32180

          You can apply them and try again. If you can enable 'trace' in debuglog, it can help identify the exact cause, however IMO it should have been fixed by the above patches.

          laisiyao Lai Siyao added a comment - 2.12 contains several fixes for readdir of striped directory: https://review.whamcloud.com/#/c/27663 https://review.whamcloud.com/#/c/28548 https://review.whamcloud.com/#/c/32180 You can apply them and try again. If you can enable 'trace' in debuglog, it can help identify the exact cause, however IMO it should have been fixed by the above patches.
          ofaaland Olaf Faaland added a comment -

          Hello Lai,

          I've attached debug.toss-4558.a.tar.gz which contains debug logs from two clients.  debug was +rpctrace and +vfstrace.  An ls -al of the same striped directory was performed on each of the two nodes.

          opal64: Lustre 2.12, the ls is successful and shows "." and ".." and exits.
          opal174: Lustre 2.10, the ls is stuck in the loop calling getdents() as described above.

          ofaaland Olaf Faaland added a comment - Hello Lai, I've attached debug.toss-4558.a.tar.gz which contains debug logs from two clients.  debug was +rpctrace and +vfstrace.  An ls -al of the same striped directory was performed on each of the two nodes. opal64: Lustre 2.12, the ls is successful and shows "." and ".." and exits. opal174: Lustre 2.10, the ls is stuck in the loop calling getdents() as described above.
          pjones Peter Jones added a comment -

          Lai

          Could you please investigate

          Peter

          pjones Peter Jones added a comment - Lai Could you please investigate Peter
          ofaaland Olaf Faaland added a comment -

          Contents of the two FIDs reported by lfs getdirstripe, in case it helps.

          lfs getdirstripe:

          $lfs getdirstripe /p/lforge/faaland1/mdtest/mdtcount2
          lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
          mdtidx		 FID[seq:oid:ver]
               0		 [0x200002340:0x7:0x0]		
               1		 [0x280002340:0x7:0x0]
          

          MDT0:

          [root@zwicky1:toss4558]# find . -name '*0x200002340:0x7:0x0*' 2> /dev/null
          ./oi.23/0x200000417:0x1:0x0/mdtest/mdtcount2/[0x200002340:0x7:0x0]:0
          ./oi.64/0x200002340:0x7:0x0
          [root@zwicky1:toss4558]# ls -l ./oi.64/0x200002340:0x7:0x0
          ls: cannot access ./oi.64/0x200002340:0x7:0x0: Input/output error
          [root@zwicky1:toss4558]# ls -al ./oi.23/0x200000417:0x1:0x0/mdtest/mdtcount2/[0x200002340:0x7:0x0]:0
          total 22
          drwx------ 2 faaland1 faaland1 2 Mar 13 10:10 .
          drwx------ 4 faaland1 faaland1 2 Mar 13 10:10 ..
          

          MDT1:

          [root@zwicky2:toss4558]# find . -name "*0x280002340:0x7:0x0*" 2> /dev/null
          ./oi.64/0x280002340:0x7:0x0
          ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0
          [root@zwicky2:toss4558]# ls -al ./oi.64/0x280002340:0x7:0x0 ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0
          ls: cannot access ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0: Input/output error
          ./oi.64/0x280002340:0x7:0x0:
          total 22
          drwx------ 2 faaland1 faaland1 2 Mar 13 10:10 .
          drwxr-xr-x 0 root     root     0 Dec 31  1969 ..
          
          ofaaland Olaf Faaland added a comment - Contents of the two FIDs reported by lfs getdirstripe, in case it helps. lfs getdirstripe: $lfs getdirstripe /p/lforge/faaland1/mdtest/mdtcount2 lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 0 [0x200002340:0x7:0x0] 1 [0x280002340:0x7:0x0] MDT0: [root@zwicky1:toss4558]# find . -name '*0x200002340:0x7:0x0*' 2> /dev/null ./oi.23/0x200000417:0x1:0x0/mdtest/mdtcount2/[0x200002340:0x7:0x0]:0 ./oi.64/0x200002340:0x7:0x0 [root@zwicky1:toss4558]# ls -l ./oi.64/0x200002340:0x7:0x0 ls: cannot access ./oi.64/0x200002340:0x7:0x0: Input/output error [root@zwicky1:toss4558]# ls -al ./oi.23/0x200000417:0x1:0x0/mdtest/mdtcount2/[0x200002340:0x7:0x0]:0 total 22 drwx------ 2 faaland1 faaland1 2 Mar 13 10:10 . drwx------ 4 faaland1 faaland1 2 Mar 13 10:10 .. MDT1: [root@zwicky2:toss4558]# find . -name "*0x280002340:0x7:0x0*" 2> /dev/null ./oi.64/0x280002340:0x7:0x0 ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0 [root@zwicky2:toss4558]# ls -al ./oi.64/0x280002340:0x7:0x0 ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0 ls: cannot access ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0: Input/output error ./oi.64/0x280002340:0x7:0x0: total 22 drwx------ 2 faaland1 faaland1 2 Mar 13 10:10 . drwxr-xr-x 0 root root 0 Dec 31 1969 ..
          ofaaland Olaf Faaland added a comment -

          For my tracking purposes, my internal ticket is TOSS4558

          ofaaland Olaf Faaland added a comment - For my tracking purposes, my internal ticket is TOSS4558

          People

            laisiyao Lai Siyao
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: