Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1717

mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid XXX has mismatched opc: new 101 old 0

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.0
    • Lustre 2.3.0
    • LLNL Hyperion, CHAOS 5 servers/clients, Lustre 2.2.92
    • 3
    • 6355

    Description

      Running SWL tests, mix of various (IOR, mdtest, simul, mir, fdtree)
      Seeing this sequence repeatedly, lustre dump sent to ftp site. File Name: lu-1442.dump.gz

      Aug 7 09:51:29 ehyperion-rst6 kernel: LustreError: 29701:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid 1409327760945603 has mismatched opc: new 101 old 0
      Aug 7 09:51:29 ehyperion-rst6 kernel: LustreError: 29701:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Skipped 5 previous similar messages
      Aug 7 09:51:29 ehyperion-rst6 kernel: Lustre: 29701:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff8802d1a96000 x1409327760945603.t537972417766 o0 NID 192.168.117.9@o2ib1
      Aug 7 09:51:29 ehyperion-rst6 kernel: Lustre: 29701:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Skipped 5 previous similar messages
      Aug 7 09:51:29 ehyperion-rst6 kernel: Lustre: 4710:0:(service.c:2095:ptlrpc_handle_rs()) All locks stolen from rs ffff8802d1a96000 x1409327760945603.t537972417766 o0 NID 192.168.117.9@o2ib1

      Attachments

        Issue Links

          Activity

            [LU-1717] mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid XXX has mismatched opc: new 101 old 0

            Patch landed

            ian Ian Colle (Inactive) added a comment - Patch landed

            http://review.whamcloud.com/4271

            This does what Oleg suggested.

            liwei Li Wei (Inactive) added a comment - http://review.whamcloud.com/4271 This does what Oleg suggested.
            green Oleg Drokin added a comment -

            I don't really know how are you losing the messages.

            The resent xid could only occur if a reply was not seen by a client and it decided to resend the message (there probably should be a client-side message about that too).
            The specific message you see could only happen when that lost reply happened to be one for a so called "difficult" reply - where a lock is being returned to the client.

            green Oleg Drokin added a comment - I don't really know how are you losing the messages. The resent xid could only occur if a reply was not seen by a client and it decided to resend the message (there probably should be a client-side message about that too). The specific message you see could only happen when that lost reply happened to be one for a so called "difficult" reply - where a lock is being returned to the client.

            Oleg, if your assumption about lost replies is correct, then I think we have a bigger problem here. We do not have lnet routers on Sequoia so we should have a reliable communication fabric.

            How are we losing messages so often??

            morrone Christopher Morrone (Inactive) added a comment - Oleg, if your assumption about lost replies is correct, then I think we have a bigger problem here. We do not have lnet routers on Sequoia so we should have a reliable communication fabric. How are we losing messages so often??
            green Oleg Drokin added a comment -

            The particular problem you'd see this in is when a reply from server to client was lost and client did a resend.
            The message is harmless (and wrong, and will be fixed).

            green Oleg Drokin added a comment - The particular problem you'd see this in is when a reply from server to client was lost and client did a resend. The message is harmless (and wrong, and will be fixed).

            We are seening this repeatedly on our 2.1.2 and 2.1.3 servers, but I can not pin this to a particular reported problem.

            jaylan Jay Lan (Inactive) added a comment - We are seening this repeatedly on our 2.1.2 and 2.1.3 servers, but I can not pin this to a particular reported problem.

            In particular, we are still seeing this in newer master code at 2.3.53-3chaos.

            morrone Christopher Morrone (Inactive) added a comment - In particular, we are still seeing this in newer master code at 2.3.53-3chaos .

            We are also seeing this repeatedly.

            morrone Christopher Morrone (Inactive) added a comment - We are also seeing this repeatedly.
            green Oleg Drokin added a comment -

            I think this might be a case of improper init:

            in target_send_reply() we have:

            rs->rs_opc = lustre_msg_get_opc(rs->rs_msg);

            rs->rs_msg does not seem to be initialized in a proper way yet (points to yet uninitialized reply buffer).

            so probably should be lustre_msg_get_opc(req->rq_reqmsg);

            The Stealing..../Stolen... messages should be silenced.

            Technically these messages should only be seen when there is a lost reply enroute to client

            green Oleg Drokin added a comment - I think this might be a case of improper init: in target_send_reply() we have: rs->rs_opc = lustre_msg_get_opc(rs->rs_msg); rs->rs_msg does not seem to be initialized in a proper way yet (points to yet uninitialized reply buffer). so probably should be lustre_msg_get_opc(req->rq_reqmsg); The Stealing..../Stolen... messages should be silenced. Technically these messages should only be seen when there is a lost reply enroute to client

            People

              liwei Li Wei (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: