Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8967

directory entries for non existing files

Details

    • 3
    • 9223372036854775807

    Description

      We have several directories with entries for non existing files. For example:

      [root@quartz2311:~]# ls -l /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0                                                                                 
      ls: cannot access /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0/filler.003: No such file or directory
      total 3154
      -rw------- 1 casses1 casses1 1048576 Dec 21 16:43 filler.000
      -rw------- 1 casses1 casses1 1048576 Dec 21 16:43 filler.001
      -rw------- 1 casses1 casses1 1048576 Dec 21 16:43 filler.002
      -????????? ? ?       ?             ?            ? filler.003
      drwx------ 2 casses1 casses1   25600 Dec 21 16:43 ~dmtmp
      

      The directory itself is a remote directory on one MDT:

      [root@quartz2311:~]# lfs getdirstripe -d /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0
      lmv_stripe_count: 0 lmv_stripe_offset: 3
      

      We are able to get striping information for this file:

      [root@quartz2311:~]# lfs getstripe /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0/filler.003
      /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0/filler.003
      lmm_stripe_count:   1
      lmm_stripe_size:    1048576
      lmm_pattern:        1
      lmm_layout_gen:     0
      lmm_stripe_offset:  27
              obdidx           objid           objid           group
                  27        20538776      0x1396598      0xcc0000402
      

      It looks like the OSS serving that OST was rebooted and the OST went through recovery around the time the missing file was created. In particular, we note that the object number falls in the range of orphan objects that were deleted:

      [root@zinci:~]# grep 0xcc0000402 /var/log/conman/console.zinc*
      /var/log/conman/console.zinc43:2016-12-21 16:30:56 [189484.767900] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538706 to 0xcc0000402:20541649
      /var/log/conman/console.zinc43:2016-12-21 16:33:30 [189639.110247] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538766 to 0xcc0000402:20541649
      /var/log/conman/console.zinc43:2016-12-21 16:35:41 [189769.704490] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538766 to 0xcc0000402:20541649
      /var/log/conman/console.zinc43:2016-12-21 16:40:19 [190047.449320] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538766 to 0xcc0000402:20541649
      /var/log/conman/console.zinc43:2016-12-21 16:44:45 [190313.751155] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538820 to 0xcc0000402:20541649
      /var/log/conman/console.zinc44:2016-12-21 16:49:27 [  159.838420] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538820 to 0xcc0000402:20541649
      

      I will attach server console logs separately.

      Attachments

        Issue Links

          Activity

            [LU-8967] directory entries for non existing files

            Ned, so this issue is solved by LU-8562 in general, but patch itself contains defect. I checked your patch, does it solves your problem? Or more work is required in that area?

            Interesting that LU-8562 itself is quite recent change and we did't observe a lot of issues similar to LU-8967 without it. I wonder what was changed in your system when you start seeing it. Was it just a software update or hardware as well?

            tappro Mikhail Pershin added a comment - Ned, so this issue is solved by LU-8562 in general, but patch itself contains defect. I checked your patch, does it solves your problem? Or more work is required in that area? Interesting that LU-8562 itself is quite recent change and we did't observe a lot of issues similar to LU-8967 without it. I wonder what was changed in your system when you start seeing it. Was it just a software update or hardware as well?

            Hi Mikhail, Each occurrence that I've investigated happened immediately after the OST completed recovery. The object numbers of the missing files all fall at the beginning of the range of deleted orphans. It does not continue to occur when all OSTs are up.

            I can remove the files as root. The rm command fails for an unprivileged user because stat() returns ENONENT and rm treats that as fatal unless you're root.

            I have confirmed that I can reproduce LU-8562 on our system using the test case from that patch and it looks just like this issue. I tested https://review.whamcloud.com/#/c/22211/ on a single node setup and wasn't able to reproduce the bug. However I ran into a defect with that patch that causes the osp_precreate thread to hang as I described in LU-8562.

             

            nedbass Ned Bass (Inactive) added a comment - Hi Mikhail, Each occurrence that I've investigated happened immediately after the OST completed recovery. The object numbers of the missing files all fall at the beginning of the range of deleted orphans. It does not continue to occur when all OSTs are up. I can remove the files as root. The rm command fails for an unprivileged user because stat() returns ENONENT and rm treats that as fatal unless you're root. I have confirmed that I can reproduce LU-8562 on our system using the test case from that patch and it looks just like this issue. I tested  https://review.whamcloud.com/#/c/22211/  on a single node setup and wasn't able to reproduce the bug. However I ran into a defect with that patch that causes the osp_precreate thread to hang as I described in LU-8562 .  

            Ned, are these entries occurred once when OST was failed over or still continue to occur? Is it possible to remove them?

            I am checking patches you've mentioned.

            tappro Mikhail Pershin added a comment - Ned, are these entries occurred once when OST was failed over or still continue to occur? Is it possible to remove them? I am checking patches you've mentioned.

            I suspect this is related to LU-8562.

            nedbass Ned Bass (Inactive) added a comment - I suspect this is related to LU-8562 .
            pjones Peter Jones added a comment -

            Mike

            Could you please assist with this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please assist with this issue? Thanks Peter

            Also please note that we first observed this problem after our most recent Lustre update to 2.8.0_6chaos last Friday (December 16). The patches added in that update were:

            * 353716b (tag: 2.8.0_6.chaos, llnl/2.8.0-llnl) LU-8753 llog: add some debug patch
            * 17d469a LU-8936 llite: use percpu env correctly in ll_invalidatepage          
            * a15b2ef LU-8361 lfsck: detect Lustre device automatically                     
            * 0220e0b LU-7648 man: new man pages for LFSCK commands                         
            * 1638a07 LU-7256 tests: wait current LFSCK to exit before next test            
            * 1d8cfaa LU-8407 recovery: more clear message about recovery failure           
            * fdea0d2 LU-7732 ldlm: silence verbose "waking for gap" log messages           
            * 82e924c LU-8753 llog: remove lgh_write_offset                                 
            * 3a8db9a LU-8493 osp: Do not set stale for new osp obj                         
            * 38c062b LU-7660 dne: support fs default stripe                                
            * bc3df36 Revert "LU-8422 update: add more debug info for the ticket"           
            * 10170a0 Revert "LU-8422 llog: extended debug info"                            
            * 490414a Revert "LU-6635 lfsck: more debug message for sanity-lfsck test_18e"  
            
            
            nedbass Ned Bass (Inactive) added a comment - Also please note that we first observed this problem after our most recent Lustre update to 2.8.0_6chaos last Friday (December 16). The patches added in that update were: * 353716b (tag: 2.8.0_6.chaos, llnl/2.8.0-llnl) LU-8753 llog: add some debug patch * 17d469a LU-8936 llite: use percpu env correctly in ll_invalidatepage * a15b2ef LU-8361 lfsck: detect Lustre device automatically * 0220e0b LU-7648 man: new man pages for LFSCK commands * 1638a07 LU-7256 tests: wait current LFSCK to exit before next test * 1d8cfaa LU-8407 recovery: more clear message about recovery failure * fdea0d2 LU-7732 ldlm: silence verbose "waking for gap" log messages * 82e924c LU-8753 llog: remove lgh_write_offset * 3a8db9a LU-8493 osp: Do not set stale for new osp obj * 38c062b LU-7660 dne: support fs default stripe * bc3df36 Revert "LU-8422 update: add more debug info for the ticket" * 10170a0 Revert "LU-8422 llog: extended debug info" * 490414a Revert "LU-6635 lfsck: more debug message for sanity-lfsck test_18e"

            People

              tappro Mikhail Pershin
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: