Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6953

LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • None
    • lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
    • 3
    • 9223372036854775807

    Description

      grove-mds1 crashed 2015-07-29 with the following LBUG:

      2015-07-29 03:05:17 LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG
      2015-07-29 03:05:17 Call Trace:
      2015-07-29 03:05:17 [<ffffffffa07b28f5>] libcfs_debug dumpstack+0x55/0x80 [libcfs]
      2015-07-29 03:05:17 Jul 29 03:05:17 [<ffffffffa07b2ef7>] lbug_with_loc+0x47/0xb0 [libcfs]
      2015-07-29 03:05:17 grove-mds1 kerne [<ffffffffa0fcf9d8>] mdt_handle_common+0x13d8/0x1470 [mdt]
      2015-07-29 03:05:17 l: LustreError:  [<ffffffffa100b625>] mds_regular_handle+0x15/0x20 [mdt]
      2015-07-29 03:05:17 50126:0:(mdt_han [<ffffffffa0b05095>] ptlrpc_server_handle_request+0x305/0xc00 [ptlrpc]
      2015-07-29 03:05:17 dler.c:3409:mdt_ [<ffffffffa07b352e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      2015-07-29 03:05:17 recovery()) LBUG [<ffffffffa07c4845>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
      

      It was preceded by a ptlrpc debug message

      2015-07-29 03:05:17 Lustre:50126:0:(mdt_handler.c:4508:mdt_recovery()) @@@ rq_xid 15027...0684 matches last_xid, expected REPLAY or RESENT flag (0) req@ffff...d1400 x15027...0684/t0(0) o101->28e0...cc83@172.20.15.14@o2ib500:0/0 lens 4616/0 e 0 to 0 dl 1438165072 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      

      For this system, I cannot extract bulk logs and add them to the ticket. We do we have a crash dump and console logs, I can obtain specific information that would help.

      The mds was under severe memory pressure at the time of the lbug.

      The MDS was responding very slowly at the time. At 3:05:03 it appears to have dropped 84,316 timed out requests (output from one DEBUG_REQ() call from within ptlrpc_server_handle_request() appears in the console log, followed by Skipped 84315 previous similar messages).

      Attachments

        Activity

          [LU-6953] LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG
          pjones Peter Jones added a comment -

          ok thanks

          pjones Peter Jones added a comment - ok thanks
          ofaaland Olaf Faaland added a comment -

          Mikhail,

          This has not occurred again, so go ahead and close it. If it happens again we can reopen it.

          thanks,
          Olaf

          ofaaland Olaf Faaland added a comment - Mikhail, This has not occurred again, so go ahead and close it. If it happens again we can reopen it. thanks, Olaf

          Were there other occurrences of this issue? There is not enough information to solve it, if it is happening regularly then it is possible to add more debug.

          tappro Mikhail Pershin added a comment - Were there other occurrences of this issue? There is not enough information to solve it, if it is happening regularly then it is possible to add more debug.
          ofaaland Olaf Faaland added a comment -

          Debug logs from the mds crash dump had nothing obviously of interest. The XID in question did not occur in any message other than the one given above. Rest of the log messages are variations on request timed out, request took too long to process, etc.

          ofaaland Olaf Faaland added a comment - Debug logs from the mds crash dump had nothing obviously of interest. The XID in question did not occur in any message other than the one given above. Rest of the log messages are variations on request timed out, request took too long to process, etc.

          Mikhail, that XID doesn't appear in the client's console log. The client with NID 172.20.15.14@o2ib500 logged nothing but the expected "lost connection" and "connection restored" messages during the hour leading up to the lbug. Unfortunately we have no additional information from the client.

          I'm extracting the lustre debug logs from the crash dump and I'll check for that XID and post anything I find.

          ofaaland Olaf Faaland added a comment - Mikhail, that XID doesn't appear in the client's console log. The client with NID 172.20.15.14@o2ib500 logged nothing but the expected "lost connection" and "connection restored" messages during the hour leading up to the lbug. Unfortunately we have no additional information from the client. I'm extracting the lustre debug logs from the crash dump and I'll check for that XID and post anything I find.

          Olaf, is that possible to find more information about the request with that XID in log? Especially in client log where request was sent from. We have two possibilities here - request flag (RESENT or REPLAY) was dropped somehow or XID was assigned improperly. Client log may help to find out was that resent case or normal request.

          tappro Mikhail Pershin added a comment - Olaf, is that possible to find more information about the request with that XID in log? Especially in client log where request was sent from. We have two possibilities here - request flag (RESENT or REPLAY) was dropped somehow or XID was assigned improperly. Client log may help to find out was that resent case or normal request.

          He meant mostly that those things cannot be shared outside of the lab.

          morrone Christopher Morrone (Inactive) added a comment - He meant mostly that those things cannot be shared outside of the lab.
          green Oleg Drokin added a comment -

          So this is another case of "client sent us something that we don't understand, let's panic", I guess.
          We need to drop this LBUG a tthe very least as the first step and avoid this crash.
          Though I am not sure how the condition might arise at all.

          Also what do you mean you cannot extract bulk logs, crash+ the module to extract the logs is not working?

          green Oleg Drokin added a comment - So this is another case of "client sent us something that we don't understand, let's panic", I guess. We need to drop this LBUG a tthe very least as the first step and avoid this crash. Though I am not sure how the condition might arise at all. Also what do you mean you cannot extract bulk logs, crash+ the module to extract the logs is not working?
          pjones Peter Jones added a comment -

          Mike

          Could you please advise here?

          Thanks

          Peter

          pjones Peter Jones added a comment - Mike Could you please advise here? Thanks Peter

          People

            tappro Mikhail Pershin
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: