Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7073

racer with OST object migration hangs on cleanup

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.7.0
    • Fix Version/s: None
    • Labels:
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      Pulling in file migration patch into my tree (pretty much master) and running racer in a loop makes racer not being able to stop lustre on second or so iteration.

      What I see in the logs:

      [ 8420.766878] Lustre: DEBUG MARKER: == racer test complete, duration 2754 sec == 00:07:26 (1440994046)
      [ 8420.967636] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
      [ 8421.140263] Lustre: 27652:0:(llite_lib.c:2617:ll_dirty_page_discard_warn()) lustre: dirty page discard: 192.168.10.210@tcp:/lustre/fid: [0x200000401:0x1434e:0x0]/ may get corrupted (rc -108)
      [ 8421.141516] Lustre: 27650:0:(llite_lib.c:2617:ll_dirty_page_discard_warn()) lustre: dirty page discard: 192.168.10.210@tcp:/lustre/fid: [0x200000402:0x1418e:0x0]/ may get corrupted (rc -108)
      [ 8421.141559] Lustre: 27651:0:(llite_lib.c:2617:ll_dirty_page_discard_warn()) lustre: dirty page discard: 192.168.10.210@tcp:/lustre/fid: [0x200000401:0x1433e:0x0]/ may get corrupted (rc -108)
      [ 8421.441181] Lustre: Unmounted lustre-client
      [ 8421.587773] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
      [ 8421.589214] Lustre: Skipped 2 previous similar messages
      [ 8421.770504] Lustre: 27654:0:(llite_lib.c:2617:ll_dirty_page_discard_warn()) lustre: dirty page discard: 192.168.10.210@tcp:/lustre/fid: [0x200000402:0x1414f:0x0]/ may get corrupted (rc -108)
      [ 8422.089900] Lustre: Unmounted lustre-client
      [ 8422.692354] LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 0@lo failed: rc = -107
      [ 8422.692393] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [ 8422.692573] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
      [ 8422.696661] LustreError: Skipped 1 previous similar message
      [ 8422.728159] LustreError: 27652:0:(client.c:1138:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880017a6dce8 x1510988910285696/t0(0) o13->lustre-OST0000-osc-MDT0000@0@lo:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      [ 8425.805649] LustreError: 27651:0:(client.c:1138:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880036aa3ce8 x1510988910285704/t0(0) o6->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      [ 8425.807388] LustreError: 27651:0:(client.c:1138:ptlrpc_import_delay_req()) Skipped 1 previous similar message
      [ 8427.692280] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
      [ 8427.693352] Lustre: Skipped 2 previous similar messages
      [ 8428.654000] Lustre: server umount lustre-MDT0000 complete
      [ 8429.084825] LustreError: 7702:0:(ldlm_resource.c:835:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x29b0:0x0:0x0].0 (ffff880093089ef8) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 8429.086152] LustreError: 7702:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x29b0:0x0:0x0].0 (ffff880093089ef8) refcount = 2
      [ 8434.084126] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8439.084079] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8439.692145] Lustre: 27655:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994058/real 1440994058]  req@ffff88009fac0ce8 x1510988910286992/t0(0) o400->MGC192.168.10.210@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1440994065 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8439.694855] LustreError: 166-1: MGC192.168.10.210@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      [ 8444.084112] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8445.696141] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994065/real 1440994065]  req@ffff880033d4cce8 x1510988910287004/t0(0) o250->MGC192.168.10.210@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1440994071 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8448.696152] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994058/real 1440994058]  req@ffff88000a215ce8 x1510988910286996/t0(0) o38->lustre-MDT0000-lwp-OST0001@0@lo:12/10 lens 520/544 e 0 to 1 dl 1440994074 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8449.084099] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8454.084088] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8460.696064] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994075/real 1440994075]  req@ffff88004cc54ce8 x1510988910287008/t0(0) o250->MGC192.168.10.210@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1440994086 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8464.084075] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8464.085092] LustreError: Skipped 1 previous similar message
      [ 8470.696064] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994075/real 1440994075]  req@ffff8800a10a5ce8 x1510988910287012/t0(0) o38->lustre-MDT0000-lwp-OST0001@0@lo:12/10 lens 520/544 e 0 to 1 dl 1440994096 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8480.696143] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994090/real 1440994090]  req@ffff88000e31bce8 x1510988910287016/t0(0) o250->MGC192.168.10.210@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1440994106 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8484.084065] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8484.084866] LustreError: Skipped 3 previous similar messages
      [ 8499.696134] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994100/real 1440994100]  req@ffff88000a789ce8 x1510988910287020/t0(0) o38->lustre-MDT0000-lwp-OST0001@0@lo:12/10 lens 520/544 e 0 to 1 dl 1440994125 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8519.084087] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8519.084900] LustreError: Skipped 6 previous similar messages
      [ 8534.696118] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994135/real 1440994135]  req@ffff8800186e6ce8 x1510988910287032/t0(0) o250->MGC192.168.10.210@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1440994160 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8534.697994] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [ 8584.084096] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8584.084930] LustreError: Skipped 12 previous similar messages
      [ 8619.696078] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994220/real 1440994220]  req@ffff88000a789ce8 x1510988910287052/t0(0) o38->lustre-MDT0000-lwp-OST0001@0@lo:12/10 lens 520/544 e 0 to 1 dl 1440994245 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8619.697726] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      [ 8714.084086] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      [ 8714.085350] LustreError: Skipped 25 previous similar messages
      [ 8769.696090] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440994370/real 1440994370]  req@ffff8800aebeece8 x1510988910287092/t0(0) o38->lustre-MDT0000-lwp-OST0001@0@lo:12/10 lens 520/544 e 0 to 1 dl 1440994395 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 8769.698845] Lustre: 27649:0:(client.c:2039:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
      [ 8974.084085] LustreError: 0-0: Forced cleanup waiting for filter-lustre-OST0000_UUID namespace with 1 resources in use, (rc=-110)
      

      So it looks like there was a leaked lock somewhere along the way.

      Comparing a succesful run vs unsuccesful, this message pops out in the unsuccesful run:

      [ 7616.286047] LustreError: 32525:0:(ldlm_request.c:452:ldlm_cli_enqueue_local()) ### delayed lvb init failed (rc -2) ns: filter-lustre-OST0000_UUID lock: ffff880094d41db8/0x96ec5ed20a281423 lrc: 2/0,0 mode: --/PR res: [0x29b0:0x0:0x0].0 rrc: 1 type: EXT [0->0] (req 0->0) flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 32525 timeout: 0 lvb_type: 0
      

      Resource seems to be matching stuck one, so I bet it's related.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                riauxjb Jean-Baptiste Riaux (Inactive)
                Reporter:
                green Oleg Drokin
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated: