Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5805

tgt_recov blocked and "waking for gap in transno"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.5.3
    • 3
    • 16286

    Description

      We are testing our 2.5.3-based branch using osd-zfs. The clients, lnet router, and server nodes all had version 2.5.3-1chaos installed (see github.com/chaos/lustre).

      On the recommendation from LU-5803, I made a test build of lustre that consists of 2.5.3-1chaos + http://review.whamcloud.com/12365.

      I installed this on the servers only. At the time, we had the SWL IO test running (mixture of ior, mdtest, simul, etc. all running at the same time).

      I then rebooted just the servers onto the test build. The OSS nodes show lots of startup error messages, many that we didn't see without this new patch. Granted, it was just one time.

      See attached file named simply "log". This is the console log from one of the OSS nodes.

      Here's my initial view of what is going on:

      The OSS nodes boot significantly faster than the MGS/MDS node. We have retry set to 32. I suspect that this noise is related to the MGS not yet having started:

      2014-10-24 14:21:38 LustreError: 7421:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880fef274800 x1482881169883144/t0(0) o253->MGC10.1.1.169@o2ib9@10.1.1.1
      69@o2ib9:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      2014-10-24 14:21:38 LustreError: 7421:0:(obd_mount_server.c:1120:server_register_target()) lcy-OST0001: error registering with the MGS: rc = -5 (not fatal)
      

      The MGS/MDS node doesn't start mounting the MGS and MDS devices until 14:25:47 and 14:25:47, respectively.

      The MDS enters recovery at this time:

      2014-10-24 14:26:23 zwicky-lcy-mds1 login: Lustre: lcy-MDT0000: Will be in recovery for at least 5:00, or until 134 clients reconnect.
      

      So there are at least 4 problems here. We may need to split them up into separate subtickets:

      1. OSS noise before MGS/MDS has started
      2. tgt_recov "blocked for more than 102 seconds" (some of the 16 OSS nodes did this)
      3. "waking for gap in transno", the MDS and some of the OSS nodes show a swath of these
      4. Many OSS nodes hit "2014-10-24 14:36:23 LustreError: 7479:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel" within a few minutes of recovery being complete

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: