Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16146

after dropping mgs/mdt0 for test: mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.15.0
    • TOSS 4.4-4.1
      4.18.0-372.19.1.1toss.t4.x86_64
      lustre 2.15.0_3.llnl
    • 3
    • 9223372036854775807

    Description

      After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up.  They are stuck "WAITING".

      When the mgs node was powered off, an ior test was in progress.

      Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

      After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as

      2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6

      I believe that -6 is coming from lfsck_start()

              lfsck = lfsck_instance_find(key, true, false);                                                           
              if (unlikely(lfsck == NULL))                                                                             
                      RETURN(-ENXIO);                                                                                  
                                                                                                                       
              if (unlikely(lfsck->li_stopping))                                                                        
                      GOTO(put, rc = -ENXIO); 
      

      For our reference, our local ticket is TOSS5875

      Attachments

        Activity

          [LU-16146] after dropping mgs/mdt0 for test: mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6

          We were unable to reproduce this when bringing up out file system both manually and using pacemaker.

          defazio Gian-Carlo Defazio added a comment - We were unable to reproduce this when bringing up out file system both manually and using pacemaker.

          pjones
          Sorry for the delay.
          The system I'd need to generate those logs on, garter, is being used for higher priority stuff right now. It's using new MDTs and OSTs. I have the old MDTs that failed on lfsck_start saved, but I can't switch back to them right now.

          So yes I can get logs for trying to start up the MDTs, but I'm not sure when.

          defazio Gian-Carlo Defazio added a comment - pjones Sorry for the delay. The system I'd need to generate those logs on, garter, is being used for higher priority stuff right now. It's using new MDTs and OSTs. I have the old MDTs that failed on lfsck_start saved, but I can't switch back to them right now. So yes I can get logs for trying to start up the MDTs, but I'm not sure when.
          pjones Peter Jones added a comment -

          defazio will you be able to gather this additional debug info?

          pjones Peter Jones added a comment - defazio  will you be able to gather this additional debug info?
          laisiyao Lai Siyao added a comment -

          I'll review related code. It's better if you can enable trace with "lfs set_param debug=+trace" and dump debug of MDS.

          laisiyao Lai Siyao added a comment - I'll review related code. It's better if you can enable trace with "lfs set_param debug=+trace" and dump debug of MDS.

          I've added the console logs for garter[1,2]

          defazio Gian-Carlo Defazio added a comment - I've added the console logs for garter [1,2]

          For the failover, I could see pacemaker moving MGS and MDT0 from garter1 to garter2 after garter1 was turned off, then ltop showed MDT1 running on garter2.

          defazio Gian-Carlo Defazio added a comment - For the failover, I could see pacemaker moving MGS and MDT0 from garter1 to garter2 after garter1 was turned off, then ltop showed MDT1 running on garter2.
          ofaaland Olaf Faaland added a comment - - edited

          Gian-Carlo,

          Can you describe what you saw that gave the impression that "Failover seemed to work"?   IE did the mgs and mdts mount, but never exit recovery?  And I think attach the console log for the garter nodes that were hosting the MGS and MDT0000 at any point during this sequence of events.

          thanks

          ofaaland Olaf Faaland added a comment - - edited Gian-Carlo, Can you describe what you saw that gave the impression that "Failover seemed to work"?   IE did the mgs and mdts mount, but never exit recovery?  And I think attach the console log for the garter nodes that were hosting the MGS and MDT0000 at any point during this sequence of events. thanks
          pjones Peter Jones added a comment -

          Lai

          Could you please advise on this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Lai Could you please advise on this one? Thanks Peter

          People

            laisiyao Lai Siyao
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: