Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9725

Mount commands don't return for targets in LFS with DNE and 3 MDTs

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.1, Lustre 2.11.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      kernel version: 3.10.0-514.21.1.el7_lustre.x86_64
      lustre version: 2.10.0_RC1-1.el7
      OS: CentOS Linux release 7.3.1611 (Core)

      Failure consistently occurs in test_filesystem_dne.py test_md0_undeleteable() during IML SSI automated test runs testing against lustre b2.10

      This is the only test we have which creates a filesystem with 3 MDTs

      On recreating LFS (outside of test infrastructure) in a similar configuration with mgs, 3*mdts and 1 ost through IML, all other targets mount commands return successfully but ost mount command never returns.

      During when the MDT mount commands are being issued, lots of activity in the kernel messages log including multiple LustreErrors and stack traces, warnings of high cpu usage and then

      kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [lwp_notify_fs1-:13630]

      This is on a LDISKF only lfs with DNE enabled. The OST mount command used is as follows and the MDT mount commands are of a similar format:

      mount -t lustre /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk5 /mnt/fs1-OST0000

      The following gists show excerpts from the /var/log/messages log during instances of this type of failure (MDT mounting in DNE):

      https://gist.github.com/tanabarr/1adb35a7e7da2581be79df8f45417411
      https://gist.github.com/tanabarr/70d3bfa66c4fc474b82c7c02adcda511
      https://gist.github.com/tanabarr/9f54584621aacfdeb3899f59687cb918

      The last gist link is an extended excerpt giving more contextual log information regarding the attempted mounting of the MDTs and the subsequent CPU load warnings. The entire logfile for that failure instance (in addition to other IML related log files) is attached to this ticket.

      original IML ticket: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/108

      Attachments

        1. yum.log.txt
          147 kB
        2. sysrq-t
          375 kB
        3. messages.txt
          2.13 MB
        4. job_scheduler.log.txt
          8.02 MB
        5. chroma-agent-console.log.txt
          1.22 MB
        6. chroma-agent.log.txt
          1.57 MB

        Issue Links

          Activity

            [LU-9725] Mount commands don't return for targets in LFS with DNE and 3 MDTs

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28357/
            Subject: LU-9725 quota: always deregister lwp
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: e53c0fbeefc1c29d7b5256c6a4cc6ead96ae41e8

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28357/ Subject: LU-9725 quota: always deregister lwp Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: e53c0fbeefc1c29d7b5256c6a4cc6ead96ae41e8
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28356/
            Subject: LU-9725 quota: always deregister lwp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ce8ca7d3564439285a56982430f380354b697f68

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28356/ Subject: LU-9725 quota: always deregister lwp Project: fs/lustre-release Branch: master Current Patch Set: Commit: ce8ca7d3564439285a56982430f380354b697f68

            I just tested this patch and this is the bug that preventing my debugfs port. Due to lwp not being totally unregistered the debugfs kobjects were not being freed so when it attempted to mount the second time the MDT it would fail due to the debugfs files already existing. You can't register debugfs file twice.

            simmonsja James A Simmons added a comment - I just tested this patch and this is the bug that preventing my debugfs port. Due to lwp not being totally unregistered the debugfs kobjects were not being freed so when it attempted to mount the second time the MDT it would fail due to the debugfs files already existing. You can't register debugfs file twice.

            laisiyao: Sure.  I will do a cherry-pick to b2_10 and test from there.

            brian Brian Murrell (Inactive) added a comment - laisiyao : Sure.  I will do a cherry-pick to b2_10 and test from there.

            Brian J. Murrell (brian.murrell@intel.com) uploaded a new patch: https://review.whamcloud.com/28357
            Subject: LU-9725 quota: always deregister lwp
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 728454ee8f85d3074124933d0d83e42f10515500

            gerrit Gerrit Updater added a comment - Brian J. Murrell (brian.murrell@intel.com) uploaded a new patch: https://review.whamcloud.com/28357 Subject: LU-9725 quota: always deregister lwp Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 728454ee8f85d3074124933d0d83e42f10515500
            laisiyao Lai Siyao added a comment -

            Brian, I uploaded a patch, could you help verify it?

            laisiyao Lai Siyao added a comment - Brian, I uploaded a patch, could you help verify it?

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28356
            Subject: LU-9725 quota: always deregister lwp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1a5df6bc53e8356d8ae83d4031ea4397ebc03af3

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28356 Subject: LU-9725 quota: always deregister lwp Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1a5df6bc53e8356d8ae83d4031ea4397ebc03af3

            laisiyao: No problem. Let me know if there is anything else I can do to help.

            brian Brian Murrell (Inactive) added a comment - laisiyao : No problem. Let me know if there is anything else I can do to help.
            laisiyao Lai Siyao added a comment -

            Thanks Brian, I'll continue checking the code.

            laisiyao Lai Siyao added a comment - Thanks Brian, I'll continue checking the code.

            People

              laisiyao Lai Siyao
              tanabarr Tom Nabarro (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: