Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0, Lustre 2.1.6
    • Lustre 2.4.0, Lustre 2.1.4
    • None
    • 3
    • 7394

    Description

      We are using lustre 2.1.4-3chaos on our server clusters.

      Running a test application, one of our archive storage folks discovered that Lustre's directory listings are rather unreliable. The first thing she noticed is that directory entries can appear multiple times:

      > cd /p/lscratchrza/apotts/divt_rzstagg0/htar_1st_27475
      > find . -type f > ../test.lst0 ; echo $? ; wc -l ../test.lst0
      0
      34339 ../test.lst0
      > find . -type f > ../test.lst1 ; echo $? ; wc -l ../test.lst1
      0
      35006 ../test.lst1
      

      When the two directory listings are sorted and run through uniq, there are only 34339 unique entries.

      One of our sysadmins investigated, and further found that sometimes entry listing are missing altogether. But when the missing files are checked with an ls, they are present.

      This has been noticed with the above find command, and also using "/bin/ls -laR .". Both files and subdirectories have appeared twice in the directory listing.

      The Lustre clients that have reproduced this behaviour are running 2.1.2-4chaos and 1.8.5.0-6chaos.

      Attachments

        Activity

          [LU-3029] Directory listings are unreliable
          pjones Peter Jones added a comment -

          Excellent - thanks Ned!

          pjones Peter Jones added a comment - Excellent - thanks Ned!

          Peter, yes I believe we've had no further reports of this issue since we rolled out the patch. Marking resolved.

          nedbass Ned Bass (Inactive) added a comment - Peter, yes I believe we've had no further reports of this issue since we rolled out the patch. Marking resolved.
          pjones Peter Jones added a comment -

          A version of this patch has landed for both 2.4.0 and 2.1.6. Has LLNL been able to confirm that this work does correct the issue? Are we able to mark this issue as resolved?

          pjones Peter Jones added a comment - A version of this patch has landed for both 2.4.0 and 2.1.6. Has LLNL been able to confirm that this work does correct the issue? Are we able to mark this issue as resolved?
          yong.fan nasf (Inactive) added a comment - Patch for b2_1: http://review.whamcloud.com/#change,6176

          The patch should be backported to b2_1 as well.

          nedbass Ned Bass (Inactive) added a comment - The patch should be backported to b2_1 as well.

          Actually, the BG/P systems are 32bit (at least the I/O nodes are)! BG/Q (Sequoia and Vulcan) are the first PPC systems that are fully 64-bit.

          morrone Christopher Morrone (Inactive) added a comment - Actually, the BG/P systems are 32bit (at least the I/O nodes are)! BG/Q (Sequoia and Vulcan) are the first PPC systems that are fully 64-bit.

          So please to apply the patch:

          http://review.whamcloud.com/#change,6138

          yong.fan nasf (Inactive) added a comment - So please to apply the patch: http://review.whamcloud.com/#change,6138

          We do have PPC64 clients on our BG/P systems that are stuck at 1.8. The udawn front end (login) nodes are running 1.8.5.0-5chaos. The IONs are running 1.8.5-8chaos.

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - We do have PPC64 clients on our BG/P systems that are stuck at 1.8. The udawn front end (login) nodes are running 1.8.5.0-5chaos. The IONs are running 1.8.5-8chaos.
          It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.
          

          Nice catch! We do have some clients running a 1.8.5-5chaos tag (old BG/P systems), and interestingly enough, they are only mounting the two filesystems that we see this issue on. So it all seems to add up, IMO. The 1.8.5-5chaos clients appear to have tainted a subset of the MDS threads, causing 64bit enabled clients to see this issue when the readdir takes more than 1 RPC, and the readdir is serviced by a mix of tainted and not tainted MDS threads.

          prakash Prakash Surya (Inactive) added a comment - It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash. Nice catch! We do have some clients running a 1.8.5-5chaos tag (old BG/P systems), and interestingly enough, they are only mounting the two filesystems that we see this issue on. So it all seems to add up, IMO. The 1.8.5-5chaos clients appear to have tainted a subset of the MDS threads, causing 64bit enabled clients to see this issue when the readdir takes more than 1 RPC, and the readdir is serviced by a mix of tainted and not tainted MDS threads.
          yong.fan nasf (Inactive) added a comment - - edited

          It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.

          yong.fan nasf (Inactive) added a comment - - edited It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.

          People

            yong.fan nasf (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: