Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15742

lockup in LNetMDUnlink during filesystem migration

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.8
    • TOSS 3.7-19 based
      RHEL kernel 3.10.0-1160.59.1
      lustre 2.12.8_6.llnl
    • 3
    • 9223372036854775807

    Description

      We are having significant lnet issues that have caused us to disable lustre on one of our compute clusters (catalyst). We've had to turn off all of the router nodes in that cluster.

      When the routers for catalyst are on we see lots of errors and have connectivity problems on multiple clusters.

      This ticket may be useful to explain our lnet setup. https://jira.whamcloud.com/browse/LU-15234

      UPDATE: The initial issue have been resolved and our clusters and file systems are working and we don't have to turn off clusters and/or routers anymore. This ticket is now focused on the LNetMDUnlink() containing stack trace as a possible root cause. The OS update and underlying network issues we had seem to have been confounders.

      Related to https://jira.whamcloud.com/browse/LU-11895

      Attachments

        1. call-2022-4-19.tar.gz
          943 kB
        2. console.catalyst153
          1.96 MB
        3. console.orelic.tar.gz
          368 kB
        4. console.orelic2
          430 kB
        5. console.pascal128
          1.73 MB
        6. console.zrelic.tar.gz
          304 kB
        7. console.zrelic2
          485 kB
        8. lustre_network_updated.jpg
          lustre_network_updated.jpg
          202 kB
        9. opensm.orelic.log.gz
          132 kB
        10. opensm.zrelic.log.gz
          194 kB
        11. pascal128-vmcore-dmesg.txt
          772 kB
        12. pfstest-nodes.tar.gz
          7.51 MB

        Activity

          [LU-15742] lockup in LNetMDUnlink during filesystem migration
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-21 [ JFC-21 ]
          ofaaland Olaf Faaland made changes -
          Description Original: We are having significant lnet issues that have caused us to disable lustre on one of our compute clusters (catalyst). We've had to turn off all of the router nodes in that cluster.

          When the routers for catalyst are on we see lots of errors and have connectivity problems on multiple clusters.

          This ticket may be useful to explain our lnet setup. https://jira.whamcloud.com/browse/LU-15234

          UPDATE: The initial issue have been resolved and our clusters and file systems are working and we don't have to turn off clusters and/or routers anymore. This ticket is now focused on the LNetMDUnlink() containing stack trace as a possible root cause. The OS update and underlying network issues we had seem to have been confounders.
          New: We are having significant lnet issues that have caused us to disable lustre on one of our compute clusters (catalyst). We've had to turn off all of the router nodes in that cluster.

          When the routers for catalyst are on we see lots of errors and have connectivity problems on multiple clusters.

          This ticket may be useful to explain our lnet setup. https://jira.whamcloud.com/browse/LU-15234

          UPDATE: The initial issue have been resolved and our clusters and file systems are working and we don't have to turn off clusters and/or routers anymore. This ticket is now focused on the LNetMDUnlink() containing stack trace as a possible root cause. The OS update and underlying network issues we had seem to have been confounders.

          Related to https://jira.whamcloud.com/browse/LU-11895
          defazio Gian-Carlo Defazio made changes -
          Labels Original: llnl topllnl New: llnl
          defazio Gian-Carlo Defazio made changes -
          Summary Original: lnet router problems resulting in disabling cluster New: lockup in LNetMDUnlink during filesystem migration
          defazio Gian-Carlo Defazio made changes -
          Description Original: We are having significant lnet issues that have caused us to disable lustre on one of our compute clusters (catalyst). We've had to turn off all of the router nodes in that cluster.

          When the routers for catalyst are on we see lots of errors and have connectivity problems on multiple clusters.

          This ticket may be useful to explain our lnet setup. https://jira.whamcloud.com/browse/LU-15234

           
          New: We are having significant lnet issues that have caused us to disable lustre on one of our compute clusters (catalyst). We've had to turn off all of the router nodes in that cluster.

          When the routers for catalyst are on we see lots of errors and have connectivity problems on multiple clusters.

          This ticket may be useful to explain our lnet setup. https://jira.whamcloud.com/browse/LU-15234

          UPDATE: The initial issue have been resolved and our clusters and file systems are working and we don't have to turn off clusters and/or routers anymore. This ticket is now focused on the LNetMDUnlink() containing stack trace as a possible root cause. The OS update and underlying network issues we had seem to have been confounders.
          defazio Gian-Carlo Defazio made changes -
          Attachment New: pascal128-vmcore-dmesg.txt [ 43360 ]
          defazio Gian-Carlo Defazio made changes -
          Attachment New: console.pascal128 [ 43359 ]
          defazio Gian-Carlo Defazio made changes -
          Attachment New: pfstest-nodes.tar.gz [ 43312 ]
          defazio Gian-Carlo Defazio made changes -
          Attachment New: call-2022-4-19.tar.gz [ 43311 ]
          defazio Gian-Carlo Defazio made changes -
          Attachment New: opensm.zrelic.log.gz [ 43310 ]

          People

            ssmirnov Serguei Smirnov
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: