Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16146

after dropping mgs/mdt0 for test: mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.15.0
    • TOSS 4.4-4.1
      4.18.0-372.19.1.1toss.t4.x86_64
      lustre 2.15.0_3.llnl
    • 3
    • 9223372036854775807

    Description

      After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up.  They are stuck "WAITING".

      When the mgs node was powered off, an ior test was in progress.

      Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

      After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as

      2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6

      I believe that -6 is coming from lfsck_start()

              lfsck = lfsck_instance_find(key, true, false);                                                           
              if (unlikely(lfsck == NULL))                                                                             
                      RETURN(-ENXIO);                                                                                  
                                                                                                                       
              if (unlikely(lfsck->li_stopping))                                                                        
                      GOTO(put, rc = -ENXIO); 
      

      For our reference, our local ticket is TOSS5875

      Attachments

        Activity

          [LU-16146] after dropping mgs/mdt0 for test: mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6
          defazio Gian-Carlo Defazio made changes -
          Resolution New: Cannot Reproduce [ 5 ]
          Status Original: Open [ 1 ] New: Closed [ 6 ]
          defazio Gian-Carlo Defazio made changes -
          Labels Original: llnl topllnl New: llnl
          ofaaland Olaf Faaland made changes -
          Description Original: After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up.  They are stuck "WAITING".

          When the mgs node was powered off, an ior test was in progress.

          Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

          After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as
          {noformat}
          2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat}
          I believe that -6 is coming from lfsck_start()
          {noformat}
                  lfsck = lfsck_instance_find(key, true, false);                                                           
                  if (unlikely(lfsck == NULL))                                                                             
                          RETURN(-ENXIO);                                                                                  
                                                                                                                           
                  if (unlikely(lfsck->li_stopping))                                                                        
                          GOTO(put, rc = -ENXIO); 
          {noformat}
          New: After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up.  They are stuck "WAITING".

          When the mgs node was powered off, an ior test was in progress.

          Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

          After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as
          {noformat}
          2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat}
          I believe that -6 is coming from lfsck_start()
          {noformat}
                  lfsck = lfsck_instance_find(key, true, false);                                                           
                  if (unlikely(lfsck == NULL))                                                                             
                          RETURN(-ENXIO);                                                                                  
                                                                                                                           
                  if (unlikely(lfsck->li_stopping))                                                                        
                          GOTO(put, rc = -ENXIO); 
          {noformat}

          For our reference, our local ticket is TOSS5875
          ofaaland Olaf Faaland made changes -
          Description Original: After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up.  They are stuck "WAITING".

          When the mgs node was powered off, an ior test was in progress.

          Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

          After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as
          {noformat}
          2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat}
          I believe that -6 is coming from lfsck_start()
          {noformat}
                  lfsck = lfsck_instance_find(key, true, false);                                                           
                  if (unlikely(lfsck == NULL))                                                                             
                          RETURN(-ENXIO);                                                                                  
                                                                                                                           
                  if (unlikely(lfsck->li_stopping))                                                                        
                          GOTO(put, rc = -ENXIO); {noformat}
          New: After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up.  They are stuck "WAITING".

          When the mgs node was powered off, an ior test was in progress.

          Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

          After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as
          {noformat}
          2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat}
          I believe that -6 is coming from lfsck_start()
          {noformat}
                  lfsck = lfsck_instance_find(key, true, false);                                                           
                  if (unlikely(lfsck == NULL))                                                                             
                          RETURN(-ENXIO);                                                                                  
                                                                                                                           
                  if (unlikely(lfsck->li_stopping))                                                                        
                          GOTO(put, rc = -ENXIO); 
          {noformat}
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-21 [ JFC-21 ]
          defazio Gian-Carlo Defazio made changes -
          Attachment New: 2022-09-07_console.garter2.gz [ 45660 ]
          defazio Gian-Carlo Defazio made changes -
          Attachment New: 2022-09-07_console.garter1.gz [ 45659 ]
          ofaaland Olaf Faaland made changes -
          Labels New: llnl topllnl
          ofaaland Olaf Faaland made changes -
          Description Original: After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up.  They are stuck "WAITING".

          When the mgs node was powered off, an ior test was in progress.

          Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

          After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as

           
          {noformat}
          2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat}
           

           

          I believe that -6 is coming from lfsck_start()

           
          {noformat}
                  lfsck = lfsck_instance_find(key, true, false);                                                           
                  if (unlikely(lfsck == NULL))                                                                             
                          RETURN(-ENXIO);                                                                                  
                                                                                                                           
                  if (unlikely(lfsck->li_stopping))                                                                        
                          GOTO(put, rc = -ENXIO); {noformat}
           

           
          New: After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up.  They are stuck "WAITING".

          When the mgs node was powered off, an ior test was in progress.

          Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

          After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as
          {noformat}
          2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat}
          I believe that -6 is coming from lfsck_start()
          {noformat}
                  lfsck = lfsck_instance_find(key, true, false);                                                           
                  if (unlikely(lfsck == NULL))                                                                             
                          RETURN(-ENXIO);                                                                                  
                                                                                                                           
                  if (unlikely(lfsck->li_stopping))                                                                        
                          GOTO(put, rc = -ENXIO); {noformat}
          pjones Peter Jones made changes -
          Link New: This issue is related to JFC-21 [ JFC-21 ]

          People

            laisiyao Lai Siyao
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: