Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8569

Sharded DNE directory full of files that don't exist

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      On our DNE testbed, one of our sharded directories seems to contain files that are all in a broken state. Currently both servers and clients are running 2.8.0_0.0.llnlpreview.40 (see the lustre-release-fe-llnl repo).

      We can get a directory listing, but nothing listed is actually accessible. Here is an excerpt from running ls -l:

      # pwd
      /p/lquake/casses1/opal-jet/simul_2
      # ls -l
      ls: cannot access simul_link.2243: No such file or directory
      ls: cannot access simul_link.3161: No such file or directory
      ls: cannot access simul_link.3129: No such file or directory
      ls: cannot access simul_link.3893: No such file or directory
      ls: cannot access simul_link.691: No such file or directory
      ls: cannot access simul_link.3233: No such file or directory
      ls: cannot access simul_link.235: No such file or directory
      ls: cannot access simul_link.1653: No such file or directory
      ls: cannot access simul_link.3167: No such file or directory
      ls: cannot access simul_link.681: No such file or directory
      ls: cannot access simul_link.835: No such file or directory
      ls: cannot access simul_link.3857: No such file or directory
      ls: cannot access simul_link.1591: No such file or directory
      ls: cannot access simul_link.1175: No such file or directory
      [cut]
      -????????? ? ? ? ?            ? simul_link.937
      -????????? ? ? ? ?            ? simul_link.94
      -????????? ? ? ? ?            ? simul_link.940
      -????????? ? ? ? ?            ? simul_link.941
      -????????? ? ? ? ?            ? simul_link.942
      -????????? ? ? ? ?            ? simul_link.943
      -????????? ? ? ? ?            ? simul_link.944
      -????????? ? ? ? ?            ? simul_link.947
      [cut]
      

      Here is the striping information:

      # lfs getdirstripe .
      .
      lmv_stripe_count: 16 lmv_stripe_offset: 12
      mdtidx           FID[seq:oid:ver]
          12           [0x50000996c:0x14fed:0x0]
          13           [0x54000919d:0x14fed:0x0]
          14           [0x58000a086:0x14fed:0x0]
          15           [0x5c000996b:0x14fed:0x0]
           0           [0x200006b03:0x14fed:0x0]
           1           [0x3000089cc:0x14fed:0x0]
           2           [0x38000996d:0x14fed:0x0]
           3           [0x4c000b0df:0x14fed:0x0]
           4           [0x2c000a142:0xec09:0x0]
           5           [0x3c000b8b2:0xec09:0x0]
           6           [0x34000a143:0xec09:0x0]
           7           [0x40000a143:0xec09:0x0]
           8           [0x44000a142:0xec09:0x0]
           9           [0x24000a143:0xec09:0x0]
          10           [0x2800091a4:0xec09:0x0]
          11           [0x4800091a3:0xec09:0x0]
      

      I ran lfsck on all services (at least those started by the "--all" option), but that did not address this situation.

      The problem files cannot be unlinked:

      # rm simul_link.999
      rm: cannot remove 'simul_link.999': No such file or directory
      

      Attachments

        1. getstripelogs.tar.gz
          0.2 kB
        2. jet-link-logs-part1.tar.gz
          0.2 kB
        3. jet-link-logs-part2.tar.gz
          0.2 kB
        4. jet-link-logs-part3.tar.gz
          0.2 kB
        5. jet-link-logs-part4.tar.gz
          0.2 kB
        6. lfsck_namespace_state-9-28-2016.log
          24 kB

        Issue Links

          Activity

            [LU-8569] Sharded DNE directory full of files that don't exist
            nedbass Ned Bass (Inactive) made changes -
            Labels Original: llnl topllnl New: llnl
            mdiep Minh Diep made changes -
            Link Original: This issue is related to JFC-17 [ JFC-17 ]
            mdiep Minh Diep made changes -
            Link New: This issue is related to JFC-20 [ JFC-20 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-21 [ JFC-21 ]
            mdiep Minh Diep made changes -
            Link New: This issue is related to LDEV-301 [ LDEV-301 ]
            mdiep Minh Diep made changes -
            Link Original: This issue is related to LDEV-341 [ LDEV-341 ]
            mdiep Minh Diep made changes -
            Link New: This issue is related to LDEV-342 [ LDEV-342 ]

            Apologies Peter, I went ahead and created LU-9037 to keep track of the porting so those who are interested can keep track of it's progress.

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - Apologies Peter, I went ahead and created LU-9037 to keep track of the porting so those who are interested can keep track of it's progress.
            pjones Peter Jones added a comment -

            We'll post the links on the ticket and mark with llnlfixready when it's ready for you to pick up

            pjones Peter Jones added a comment - We'll post the links on the ticket and mark with llnlfixready when it's ready for you to pick up

            Peter,

            Are there tasks created so I can keep track of the 2.8 FE port?

            Joe

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - Peter, Are there tasks created so I can keep track of the 2.8 FE port? Joe

            People

              yong.fan nasf (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: