Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5204

2.6 DNE stress testing: EINVAL when attempting to delete file

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.6.0
    • 3
    • 14529

    Description

      After our stress testing this weekend, we are unable to delete some (perhaps any?) of the files on a particular OST (OST 38). All of them give EINVAL.

      For example:
      [root@galaxy-esf-mds008 tmp]# rm -f posix_shm_open
      rm: cannot remove `posix_shm_open': Invalid argument
      [root@galaxy-esf-mds008 tmp]# lfs getstripe posix_shm_open
      posix_shm_open
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 38
      obdidx objid objid group
      38 907263 0xdd7ff 0

      However, OST 38 (OST0027) is showing up in lctl dl, and as far as I know, there are no issues with it. (The dk logs on the OSS don't show any issues.)

      Here's the relevant part of the log from MDT000:
      00000004:00020000:2.0:1402947131.685511:0:25039:0:(lod_lov.c:695:validate_lod_and_idx()) esfprod-MDT0000-mdtlov: bad idx: 38 of 64
      00000004:00000001:2.0:1402947131.685513:0:25039:0:(lod_lov.c:757:lod_initialize_objects()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
      00000004:00000010:2.0:1402947131.685515:0:25039:0:(lod_lov.c:782:lod_initialize_objects()) kfreed 'stripe': 8 at ffff8807fc208a00.
      00000004:00000001:2.0:1402947131.685516:0:25039:0:(lod_lov.c:788:lod_initialize_objects()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685519:0:25039:0:(lod_lov.c:839:lod_parse_striping()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685520:0:25039:0:(lod_lov.c:885:lod_load_striping_locked()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685522:0:25039:0:(lod_object.c:2754:lod_declare_object_destroy()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685524:0:25039:0:(mdd_dir.c:1586:mdd_unlink()) Process leaving via stop (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

      I don't know for certain if this is related to DNE2 or not, but this is not an error I've seen before. The file system and objects are still around, so I can provide further data if needed.

      Any thoughts?

      Attachments

        1. invalid_object_client_mdt0007
          506 kB
        2. invalid_object_mds_mdt0000
          133 kB
        3. lctl_dl_from_client
          4 kB
        4. lctl_dl_from_mds001_mdt0000
          4 kB
        5. LU-5204_mds0_start_log.tar.gz
          0.2 kB
        6. LU-5204_start_log_with_oss.tar.gz
          0.3 kB
        7. mdt0.config.log
          57 kB

        Issue Links

          Activity

            [LU-5204] 2.6 DNE stress testing: EINVAL when attempting to delete file
            jlevi Jodi Levi (Inactive) made changes -
            Fix Version/s Original: Lustre 2.7.0 [ 10631 ]
            adilger Andreas Dilger made changes -
            Fix Version/s New: Lustre 2.7.0 [ 10631 ]
            Resolution New: Cannot Reproduce [ 5 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Unable to figure out what the problem is, please reopen if it is hit again.

            adilger Andreas Dilger added a comment - Unable to figure out what the problem is, please reopen if it is hit again.
            jlevi Jodi Levi (Inactive) made changes -
            Fix Version/s Original: Lustre 2.7.0 [ 10631 ]
            di.wang Di Wang added a comment -

            Since we can not reproduce the problem locally, I can not figure out why the config log is "corrupted". If it happens again in DNE testing, please remember what's the step to reproduce it. We will probably have more ideas.

            di.wang Di Wang added a comment - Since we can not reproduce the problem locally, I can not figure out why the config log is "corrupted". If it happens again in DNE testing, please remember what's the step to reproduce it. We will probably have more ideas.
            emoly.liu Emoly Liu added a comment -

            Sorry for my late update. I can't reproduce this issue in my testing environment.

            emoly.liu Emoly Liu added a comment - Sorry for my late update. I can't reproduce this issue in my testing environment.
            jlevi Jodi Levi (Inactive) made changes -
            Fix Version/s New: Lustre 2.7.0 [ 10631 ]
            Fix Version/s Original: Lustre 2.6.0 [ 10595 ]
            adilger Andreas Dilger made changes -
            Affects Version/s New: Lustre 2.6.0 [ 10595 ]

            The one obvious problem that I see is that it should ALWAYS be possible to delete a file, even if the OST is unavailable, or configured out of the system. Regardless of what the root cause of the problem is, there needs to be a patch to allow the file to be deleted.

            adilger Andreas Dilger added a comment - The one obvious problem that I see is that it should ALWAYS be possible to delete a file, even if the OST is unavailable, or configured out of the system. Regardless of what the root cause of the problem is, there needs to be a patch to allow the file to be deleted.
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-5233 [ LU-5233 ]

            People

              emoly.liu Emoly Liu
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: