Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5204

2.6 DNE stress testing: EINVAL when attempting to delete file

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.6.0
    • 3
    • 14529

    Description

      After our stress testing this weekend, we are unable to delete some (perhaps any?) of the files on a particular OST (OST 38). All of them give EINVAL.

      For example:
      [root@galaxy-esf-mds008 tmp]# rm -f posix_shm_open
      rm: cannot remove `posix_shm_open': Invalid argument
      [root@galaxy-esf-mds008 tmp]# lfs getstripe posix_shm_open
      posix_shm_open
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 38
      obdidx objid objid group
      38 907263 0xdd7ff 0

      However, OST 38 (OST0027) is showing up in lctl dl, and as far as I know, there are no issues with it. (The dk logs on the OSS don't show any issues.)

      Here's the relevant part of the log from MDT000:
      00000004:00020000:2.0:1402947131.685511:0:25039:0:(lod_lov.c:695:validate_lod_and_idx()) esfprod-MDT0000-mdtlov: bad idx: 38 of 64
      00000004:00000001:2.0:1402947131.685513:0:25039:0:(lod_lov.c:757:lod_initialize_objects()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
      00000004:00000010:2.0:1402947131.685515:0:25039:0:(lod_lov.c:782:lod_initialize_objects()) kfreed 'stripe': 8 at ffff8807fc208a00.
      00000004:00000001:2.0:1402947131.685516:0:25039:0:(lod_lov.c:788:lod_initialize_objects()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685519:0:25039:0:(lod_lov.c:839:lod_parse_striping()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685520:0:25039:0:(lod_lov.c:885:lod_load_striping_locked()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685522:0:25039:0:(lod_object.c:2754:lod_declare_object_destroy()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685524:0:25039:0:(mdd_dir.c:1586:mdd_unlink()) Process leaving via stop (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

      I don't know for certain if this is related to DNE2 or not, but this is not an error I've seen before. The file system and objects are still around, so I can provide further data if needed.

      Any thoughts?

      Attachments

        1. invalid_object_client_mdt0007
          506 kB
        2. invalid_object_mds_mdt0000
          133 kB
        3. lctl_dl_from_client
          4 kB
        4. lctl_dl_from_mds001_mdt0000
          4 kB
        5. LU-5204_mds0_start_log.tar.gz
          0.2 kB
        6. LU-5204_start_log_with_oss.tar.gz
          0.3 kB
        7. mdt0.config.log
          57 kB

        Issue Links

          Activity

            [LU-5204] 2.6 DNE stress testing: EINVAL when attempting to delete file

            Unable to figure out what the problem is, please reopen if it is hit again.

            adilger Andreas Dilger added a comment - Unable to figure out what the problem is, please reopen if it is hit again.
            di.wang Di Wang added a comment -

            Since we can not reproduce the problem locally, I can not figure out why the config log is "corrupted". If it happens again in DNE testing, please remember what's the step to reproduce it. We will probably have more ideas.

            di.wang Di Wang added a comment - Since we can not reproduce the problem locally, I can not figure out why the config log is "corrupted". If it happens again in DNE testing, please remember what's the step to reproduce it. We will probably have more ideas.
            emoly.liu Emoly Liu added a comment -

            Sorry for my late update. I can't reproduce this issue in my testing environment.

            emoly.liu Emoly Liu added a comment - Sorry for my late update. I can't reproduce this issue in my testing environment.

            The one obvious problem that I see is that it should ALWAYS be possible to delete a file, even if the OST is unavailable, or configured out of the system. Regardless of what the root cause of the problem is, there needs to be a patch to allow the file to be deleted.

            adilger Andreas Dilger added a comment - The one obvious problem that I see is that it should ALWAYS be possible to delete a file, even if the OST is unavailable, or configured out of the system. Regardless of what the root cause of the problem is, there needs to be a patch to allow the file to be deleted.

            Opened LU-5233 for the MDS1 LBUG I mentioned above.

            paf Patrick Farrell (Inactive) added a comment - Opened LU-5233 for the MDS1 LBUG I mentioned above.

            Andreas,

            It's really unlikely. No one should have been mucking with the system. I can't say it's impossible, but...

            Now that we've tracked it down to such a strange error, I'm planning to go ahead and fix it, and not worry unless it occurs again in further stress testing. In fact, I'm going to do exactly that unless someone has further information they'd like from the system. (Speak up soon - I'm going to fix it for our stress testing slot tonight.)

            I've also (in further testing) hit an MDS0 crash bug that could possibly be related to this one I'm going to open shortly. I'll reference that LU here once I've got it open.

            paf Patrick Farrell (Inactive) added a comment - Andreas, It's really unlikely. No one should have been mucking with the system. I can't say it's impossible, but... Now that we've tracked it down to such a strange error, I'm planning to go ahead and fix it, and not worry unless it occurs again in further stress testing. In fact, I'm going to do exactly that unless someone has further information they'd like from the system. (Speak up soon - I'm going to fix it for our stress testing slot tonight.) I've also (in further testing) hit an MDS0 crash bug that could possibly be related to this one I'm going to open shortly. I'll reference that LU here once I've got it open.

            Is it possible that OST0026 was ever deactivated during testing (e.g. lctl conf_param esfprod-OST0026.osc.active=0 or similar)? That would permanently disable the OST in the config log and seems to me to be the most likely cause of this problem.

            adilger Andreas Dilger added a comment - Is it possible that OST0026 was ever deactivated during testing (e.g. lctl conf_param esfprod-OST0026.osc.active=0 or similar)? That would permanently disable the OST in the config log and seems to me to be the most likely cause of this problem.

            Emoly,

            Unfortunately, I don't really know how many is enough. We have 8 MDSes and 8 MDTs, and 4 OSSes and 40 OSTs. It's a test bed system for DNE, which is why it's such a weird configuration.

            We do have separate MGT and MDT.

            As far as other things: all I know about what we did is we ran a bunch of different IO tests, like IOR and a large number of tests from the Linux test project in various configurations, all with mkdir replaced by a script which would randomly create striped or remote directories. It would also sometimes create normal directories.

            We did that last weekend, and had this problem on Monday. No idea what was running when it started.

            Sorry for not having many specifics on testing, it's a large test suite.

            We're probably going to fix the system soon by doing a writeconf, so we can continue stress testing DNE2. Let me know if there's anything else I can give you first.

            paf Patrick Farrell (Inactive) added a comment - Emoly, Unfortunately, I don't really know how many is enough. We have 8 MDSes and 8 MDTs, and 4 OSSes and 40 OSTs. It's a test bed system for DNE, which is why it's such a weird configuration. We do have separate MGT and MDT. As far as other things: all I know about what we did is we ran a bunch of different IO tests, like IOR and a large number of tests from the Linux test project in various configurations, all with mkdir replaced by a script which would randomly create striped or remote directories. It would also sometimes create normal directories. We did that last weekend, and had this problem on Monday. No idea what was running when it started. Sorry for not having many specifics on testing, it's a large test suite. We're probably going to fix the system soon by doing a writeconf, so we can continue stress testing DNE2. Let me know if there's anything else I can give you first.
            emoly.liu Emoly Liu added a comment -

            Patrick,

            I will try to upgrade a lustre file system from 2.5.1 to 2.6 to reproduce this problem. Could you please suggest how many OSTs and MDTs are enough for this test? What's more, I know MGS and MDS should be separated in this test, and anything else I should pay attention to?

            Thanks.

            emoly.liu Emoly Liu added a comment - Patrick, I will try to upgrade a lustre file system from 2.5.1 to 2.6 to reproduce this problem. Could you please suggest how many OSTs and MDTs are enough for this test? What's more, I know MGS and MDS should be separated in this test, and anything else I should pay attention to? Thanks.
            pjones Peter Jones added a comment -

            Emoly

            Could you please try reproducing this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Emoly Could you please try reproducing this issue? Thanks Peter

            People

              emoly.liu Emoly Liu
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: