Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10993

Fix for LU-10826 is problematic and skips recvoery

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.12.0
    • None
    • 2
    • 9223372036854775807

    Description

      I think aptch https://review.whamcloud.com/#/c/31690/ for LU-10826 is more problematic.
      after apply patch https://review.whamcloud.com/#/c/31690/ and test_req_buffer_pressure=1, it prevents OOM, but they are skipping some recvoery clients.

      [root@voss05 ~]#  lctl get_param obdfilter.*.recovery_status
      obdfilter.scratch-OST0024.recovery_status=
      status: COMPLETE
      recovery_start: 1525317355
      recovery_duration: 54
      completed_clients: 7249/7249
      replayed_requests: 0
      last_transno: 98784247808
      VBR: DISABLED
      IR: ENABLED
      obdfilter.scratch-OST0025.recovery_status=
      status: COMPLETE
      recovery_start: 1525317353
      recovery_duration: 56
      completed_clients: 7031/7031
      replayed_requests: 0
      last_transno: 98784247808
      VBR: DISABLED
      IR: ENABLED
      obdfilter.scratch-OST0026.recovery_status=
      status: COMPLETE
      recovery_start: 1525317352
      recovery_duration: 57
      completed_clients: 8168/8168
      replayed_requests: 0
      last_transno: 98784247808
      VBR: DISABLED
      IR: ENABLED
      obdfilter.scratch-OST0027.recovery_status=
      status: COMPLETE
      recovery_start: 1525317350
      recovery_duration: 59
      completed_clients: 8195/8195
      replayed_requests: 0
      last_transno: 98784247808
      VBR: DISABLED
      IR: ENABLED
      obdfilter.scratch-OST0028.recovery_status=
      status: COMPLETE
      recovery_start: 1525317355
      recovery_duration: 54
      completed_clients: 7984/7984
      replayed_requests: 0
      last_transno: 98784247808
      VBR: DISABLED
      IR: ENABLED
      obdfilter.scratch-OST0029.recovery_status=
      status: COMPLETE
      recovery_start: 1525317352
      recovery_duration: 57
      completed_clients: 7985/7985
      replayed_requests: 0
      last_transno: 98784247808
      VBR: DISABLED
      IR: ENABLED
      obdfilter.scratch-OST002a.recovery_status=
      status: COMPLETE
      recovery_start: 1525317354
      recovery_duration: 55
      completed_clients: 8329/8329
      replayed_requests: 0
      last_transno: 98784247808
      VBR: DISABLED
      IR: ENABLED
      obdfilter.scratch-OST002b.recovery_status=
      status: COMPLETE
      recovery_start: 1525317351
      recovery_duration: 58
      completed_clients: 8291/8291
      replayed_requests: 0
      last_transno: 98784247808
      VBR: DISABLED
      IR: ENABLED
      obdfilter.scratch-OST002c.recovery_status=
      status: COMPLETE
      recovery_start: 1525317350
      recovery_duration: 59
      completed_clients: 8286/8286
      replayed_requests: 0
      last_transno: 94489280512
      VBR: DISABLED
      IR: ENABLED
      

      And, aslo sometimes, recovery still never triggered. e.g failover situation.
      I see the messages after restart OSTs

      [ 9169.158440] Lustre: 14598:0:(events.c:368:request_in_callback()) All ost request buffers busy
      [ 9169.158447] Lustre: 14598:0:(events.c:368:request_in_callback()) Skipped 3508 previous similar messages
      

      Attachments

        Issue Links

          Activity

            [LU-10993] Fix for LU-10826 is problematic and skips recvoery
            pjones Peter Jones added a comment -

            Descoping from 2.12 for now as there is not enough to work on. We can certainly continue to work this as soon as there is some more data available

            pjones Peter Jones added a comment - Descoping from 2.12 for now as there is not enough to work on. We can certainly continue to work this as soon as there is some more data available
            pjones Peter Jones added a comment -

            Mike

            Could you please assess this situation?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please assess this situation? Thanks Peter

            Hello Shuichi,
            Just a small update to let you know that the attempts to reproduce this problem have all been unsuccessful until now.
            BTW, did you find sometime to reproduce again on your side and in order to provide the infos I have requested before?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Shuichi, Just a small update to let you know that the attempts to reproduce this problem have all been unsuccessful until now. BTW, did you find sometime to reproduce again on your side and in order to provide the infos I have requested before?

            > ok, let me know what exact information do you need.
            Well, like what I have already indicated in my previous comment! : "can you try to reduce the test to a minimal sub-set of OSS's OSTs and connected Clients and then take a full Lustre debug log on OSS and Clients ? I would like to get at least the trace from OSS and from both a successful and failed Clients."

            bfaccini Bruno Faccini (Inactive) added a comment - > ok, let me know what exact information do you need. Well, like what I have already indicated in my previous comment! : "can you try to reduce the test to a minimal sub-set of OSS's OSTs and connected Clients and then take a full Lustre debug log on OSS and Clients ? I would like to get at least the trace from OSS and from both a successful and failed Clients."
            ihara Shuichi Ihara (Inactive) added a comment - - edited

            ok, let me know what exact information do you need.
            at least, non re-trigger recvoery situation could be easy possible to reproduce.

            ihara Shuichi Ihara (Inactive) added a comment - - edited ok, let me know what exact information do you need. at least, non re-trigger recvoery situation could be easy possible to reproduce.

            I know it is not a simple task, but as you seem to be able to reproduce easily, can you try to reduce the test to a minimal sub-set of OSS's OSTs and connected Clients and then take a full Lustre debug log on OSS and Clients ? I would like to get at least the trace from OSS and from both a successful and failed Clients.
            In the mean time I will try to reproduce on a test platform.

            bfaccini Bruno Faccini (Inactive) added a comment - I know it is not a simple task, but as you seem to be able to reproduce easily, can you try to reduce the test to a minimal sub-set of OSS's OSTs and connected Clients and then take a full Lustre debug log on OSS and Clients ? I would like to get at least the trace from OSS and from both a successful and failed Clients. In the mean time I will try to reproduce on a test platform.

            Yes, and I've checked client side, but they didn't connect to OST even reveroy stat is completed.
            Another prolbem. there are 40 x OSS here and some of OSS triggered recovery, but still many OSS didn't trigger recovery.
            Actually, if we do umounted OSTs and remount them again on those OSS, recoery retriggered, but not all clients to recover.
            might imcomplete patch of https://review.whamcloud.com/#/c/31690/

            ihara Shuichi Ihara (Inactive) added a comment - Yes, and I've checked client side, but they didn't connect to OST even reveroy stat is completed. Another prolbem. there are 40 x OSS here and some of OSS triggered recovery, but still many OSS didn't trigger recovery. Actually, if we do umounted OSTs and remount them again on those OSS, recoery retriggered, but not all clients to recover. might imcomplete patch of https://review.whamcloud.com/#/c/31690/

            Hello Shuichi,
            Why do you think that some recovery clients are being missed ? Because you expect the number of completed_clients to be the same for each OST ?

            Also, the "(events.c:368:request_in_callback()) All ost request buffers busy" is expected to occur when running when test_req_buffer_pressure=1.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Shuichi, Why do you think that some recovery clients are being missed ? Because you expect the number of completed_clients to be the same for each OST ? Also, the "(events.c:368:request_in_callback()) All ost request buffers busy" is expected to occur when running when test_req_buffer_pressure=1.

            People

              tappro Mikhail Pershin
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: