Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9725

Mount commands don't return for targets in LFS with DNE and 3 MDTs

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.1, Lustre 2.11.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      kernel version: 3.10.0-514.21.1.el7_lustre.x86_64
      lustre version: 2.10.0_RC1-1.el7
      OS: CentOS Linux release 7.3.1611 (Core)

      Failure consistently occurs in test_filesystem_dne.py test_md0_undeleteable() during IML SSI automated test runs testing against lustre b2.10

      This is the only test we have which creates a filesystem with 3 MDTs

      On recreating LFS (outside of test infrastructure) in a similar configuration with mgs, 3*mdts and 1 ost through IML, all other targets mount commands return successfully but ost mount command never returns.

      During when the MDT mount commands are being issued, lots of activity in the kernel messages log including multiple LustreErrors and stack traces, warnings of high cpu usage and then

      kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [lwp_notify_fs1-:13630]

      This is on a LDISKF only lfs with DNE enabled. The OST mount command used is as follows and the MDT mount commands are of a similar format:

      mount -t lustre /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk5 /mnt/fs1-OST0000

      The following gists show excerpts from the /var/log/messages log during instances of this type of failure (MDT mounting in DNE):

      https://gist.github.com/tanabarr/1adb35a7e7da2581be79df8f45417411
      https://gist.github.com/tanabarr/70d3bfa66c4fc474b82c7c02adcda511
      https://gist.github.com/tanabarr/9f54584621aacfdeb3899f59687cb918

      The last gist link is an extended excerpt giving more contextual log information regarding the attempted mounting of the MDTs and the subsequent CPU load warnings. The entire logfile for that failure instance (in addition to other IML related log files) is attached to this ticket.

      original IML ticket: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/108

      Attachments

        1. chroma-agent.log.txt
          1.57 MB
        2. chroma-agent-console.log.txt
          1.22 MB
        3. job_scheduler.log.txt
          8.02 MB
        4. messages.txt
          2.13 MB
        5. sysrq-t
          375 kB
        6. yum.log.txt
          147 kB

        Issue Links

          Activity

            [LU-9725] Mount commands don't return for targets in LFS with DNE and 3 MDTs

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28356/
            Subject: LU-9725 quota: always deregister lwp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ce8ca7d3564439285a56982430f380354b697f68

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28356/ Subject: LU-9725 quota: always deregister lwp Project: fs/lustre-release Branch: master Current Patch Set: Commit: ce8ca7d3564439285a56982430f380354b697f68

            I just tested this patch and this is the bug that preventing my debugfs port. Due to lwp not being totally unregistered the debugfs kobjects were not being freed so when it attempted to mount the second time the MDT it would fail due to the debugfs files already existing. You can't register debugfs file twice.

            simmonsja James A Simmons added a comment - I just tested this patch and this is the bug that preventing my debugfs port. Due to lwp not being totally unregistered the debugfs kobjects were not being freed so when it attempted to mount the second time the MDT it would fail due to the debugfs files already existing. You can't register debugfs file twice.

            laisiyao: Sure.  I will do a cherry-pick to b2_10 and test from there.

            brian Brian Murrell (Inactive) added a comment - laisiyao : Sure.  I will do a cherry-pick to b2_10 and test from there.

            Brian J. Murrell (brian.murrell@intel.com) uploaded a new patch: https://review.whamcloud.com/28357
            Subject: LU-9725 quota: always deregister lwp
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 728454ee8f85d3074124933d0d83e42f10515500

            gerrit Gerrit Updater added a comment - Brian J. Murrell (brian.murrell@intel.com) uploaded a new patch: https://review.whamcloud.com/28357 Subject: LU-9725 quota: always deregister lwp Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 728454ee8f85d3074124933d0d83e42f10515500
            laisiyao Lai Siyao added a comment -

            Brian, I uploaded a patch, could you help verify it?

            laisiyao Lai Siyao added a comment - Brian, I uploaded a patch, could you help verify it?

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28356
            Subject: LU-9725 quota: always deregister lwp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1a5df6bc53e8356d8ae83d4031ea4397ebc03af3

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28356 Subject: LU-9725 quota: always deregister lwp Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1a5df6bc53e8356d8ae83d4031ea4397ebc03af3

            laisiyao: No problem. Let me know if there is anything else I can do to help.

            brian Brian Murrell (Inactive) added a comment - laisiyao : No problem. Let me know if there is anything else I can do to help.
            laisiyao Lai Siyao added a comment -

            Thanks Brian, I'll continue checking the code.

            laisiyao Lai Siyao added a comment - Thanks Brian, I'll continue checking the code.
            brian Brian Murrell (Inactive) added a comment - - edited

            Brian, is it true 2.10.0 include this fix?

            No, 2.10.0 does not include this fix. b2_10 does contain this fix though and that's what we are testing with. You can see from the comment above though that we tested with 2.10.0_5_gbb3c407 which is a build with this commit in it which is 5 commits newer than the landed patch for this ticket.  So we definitely did test the patch from this ticket.

            even tag 2.10.50 doesn't include this, can you test with master build?

            We cannot test with master due to issues that Jenkins has trying to be an HTTP server. But given the above, we really shouldn't need to test master given that we have tested the patch on b2_10.

            brian Brian Murrell (Inactive) added a comment - - edited Brian, is it true 2.10.0 include this fix? No, 2.10.0 does not include this fix. b2_10 does contain this fix though and that's what we are testing with. You can see from the comment above though that we tested with 2.10.0_5_gbb3c407 which is a build with this commit in it which is 5 commits newer than the landed patch for this ticket.  So we definitely did test the patch from this ticket. even tag 2.10.50 doesn't include this, can you test with master build? We cannot test with master due to issues that Jenkins has trying to be an HTTP server. But given the above, we really shouldn't need to test master given that we have tested the patch on b2_10.
            laisiyao Lai Siyao added a comment -

            Brian, is it true 2.10.0 include this fix? even tag 2.10.50 doesn't include this, can you test with master build?

            laisiyao Lai Siyao added a comment - Brian, is it true 2.10.0 include this fix? even tag 2.10.50 doesn't include this, can you test with master build?
            pjones Peter Jones added a comment -

            Lai

            Could you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please advise on this one? Thanks Peter

            People

              laisiyao Lai Siyao
              tanabarr Tom Nabarro (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: