Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5204

2.6 DNE stress testing: EINVAL when attempting to delete file

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.6.0
    • 3
    • 14529

    Description

      After our stress testing this weekend, we are unable to delete some (perhaps any?) of the files on a particular OST (OST 38). All of them give EINVAL.

      For example:
      [root@galaxy-esf-mds008 tmp]# rm -f posix_shm_open
      rm: cannot remove `posix_shm_open': Invalid argument
      [root@galaxy-esf-mds008 tmp]# lfs getstripe posix_shm_open
      posix_shm_open
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 38
      obdidx objid objid group
      38 907263 0xdd7ff 0

      However, OST 38 (OST0027) is showing up in lctl dl, and as far as I know, there are no issues with it. (The dk logs on the OSS don't show any issues.)

      Here's the relevant part of the log from MDT000:
      00000004:00020000:2.0:1402947131.685511:0:25039:0:(lod_lov.c:695:validate_lod_and_idx()) esfprod-MDT0000-mdtlov: bad idx: 38 of 64
      00000004:00000001:2.0:1402947131.685513:0:25039:0:(lod_lov.c:757:lod_initialize_objects()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
      00000004:00000010:2.0:1402947131.685515:0:25039:0:(lod_lov.c:782:lod_initialize_objects()) kfreed 'stripe': 8 at ffff8807fc208a00.
      00000004:00000001:2.0:1402947131.685516:0:25039:0:(lod_lov.c:788:lod_initialize_objects()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685519:0:25039:0:(lod_lov.c:839:lod_parse_striping()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685520:0:25039:0:(lod_lov.c:885:lod_load_striping_locked()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685522:0:25039:0:(lod_object.c:2754:lod_declare_object_destroy()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
      00000004:00000001:2.0:1402947131.685524:0:25039:0:(mdd_dir.c:1586:mdd_unlink()) Process leaving via stop (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

      I don't know for certain if this is related to DNE2 or not, but this is not an error I've seen before. The file system and objects are still around, so I can provide further data if needed.

      Any thoughts?

      Attachments

        1. invalid_object_client_mdt0007
          506 kB
        2. invalid_object_mds_mdt0000
          133 kB
        3. lctl_dl_from_client
          4 kB
        4. lctl_dl_from_mds001_mdt0000
          4 kB
        5. LU-5204_mds0_start_log.tar.gz
          0.2 kB
        6. LU-5204_start_log_with_oss.tar.gz
          0.3 kB
        7. mdt0.config.log
          57 kB

        Issue Links

          Activity

            People

              emoly.liu Emoly Liu
              paf Patrick Farrell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: