Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1094

general protection fault in _debug_req()

Details

    • 3
    • 6461

    Description

      We had five occurrences of this crash on OSS nodes in our classified Lustre 2.1 cluster. Timeframe coincided with LU-1085. Like the other bugs in that window, this crash was preceded by hundreds of messages like:

      LustreError: 14210:0:(genops.c:1270:class_disconnect_stale_exports()) ls5-OST0349: disconnect stale client [UUID]@<unknown>

      general protection fault: 0000 1 SMP
      Pid: 13890, comm: ll_ost_34

      machine_kexec
      crash_kexec
      oops_end
      die
      do_general_protection
      general_protection
      [exception RIP: strnlen+9]
      string
      vsnprintf
      libcfs_debug_vmsg2
      _debug_req
      target_send_reply_msg
      target_send_reply
      ost_handle
      ptlrpc_main
      kernel_thread

      Attachments

        Activity

          [LU-1094] general protection fault in _debug_req()
          pjones Peter Jones added a comment -

          Believed to be a duplicate of LU-1092

          pjones Peter Jones added a comment - Believed to be a duplicate of LU-1092

          Sorry, disregard previous comment. We hit a new GPF, not this one.

          nedbass Ned Bass (Inactive) added a comment - Sorry, disregard previous comment. We hit a new GPF, not this one.

          FYI, we did in fact hit this again with the LU-106 patch here:

          http://review.whamcloud.com/326

          nedbass Ned Bass (Inactive) added a comment - FYI, we did in fact hit this again with the LU-106 patch here: http://review.whamcloud.com/326
          green Oleg Drokin added a comment -

          I think this one also has a chance of being related to lu-106, so let's see if the runs with the patch would help.

          green Oleg Drokin added a comment - I think this one also has a chance of being related to lu-106, so let's see if the runs with the patch would help.

          Comment copied from LU-1085:

          I did some digging in crash to see what state the ptlrpc_reqeust was in. I dug up the pointer address from the backtrace (let's call it <addr1> to save typing). Then resolving some of the strings that get passed to libcfs_debug_vmsg2() from _debug_req(), I see:

          crash> struct ptlrpc_request.rq_import <addr1>
           rp_import = 0x0 
          crash> struct ptlrpc_request.rq_export <addr1>
           rp_export = <addr2>
          crash> struct obd_export.exp_connection <addr2>
           exp_connection = 0x5a5a5a5a5a5a5a5a
          crash> struct obd_export.exp_client_uuid <addr2>
           exp_client_uuid = { 
                  uuid = "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"
           }
          

          So the presence of poison value and bogus uuid suggests this export has already been destroyed.

          For reference, here a snippet from from _debug_req() that uses these values:

          2271 void _debug_req(struct ptlrpc_request *req,
          2272                 struct libcfs_debug_msg_data *msgdata,
          2273                 const char *fmt, ... )
          2274 {       
          2275         va_list args;
          2276         va_start(args, fmt);
          2277         libcfs_debug_vmsg2(msgdata, fmt, args,
          2278                            " req@%p x"LPU64"/t"LPD64"("LPD64") o%d->%s@%s:%d/%d"
          2279                            " lens %d/%d e %d to %d dl "CFS_TIME_T" ref %d "
          2280                            "fl "REQ_FLAGS_FMT"/%x/%x rc %d/%d\n",
          2281                            req, req->rq_xid, req->rq_transno,
          2282                            req->rq_reqmsg ? lustre_msg_get_transno(req->rq_reqmsg) : 0,
          2283                            req->rq_reqmsg && req_ptlrpc_body_swabbed(req) ?
          2284                            lustre_msg_get_opc(req->rq_reqmsg) : -1, 
          2285                            req->rq_import ? obd2cli_tgt(req->rq_import->imp_obd) :
          2286                            req->rq_export ?
          2287                            (char*)req->rq_export->exp_client_uuid.uuid : "<?>",
          
          nedbass Ned Bass (Inactive) added a comment - Comment copied from LU-1085 : I did some digging in crash to see what state the ptlrpc_reqeust was in. I dug up the pointer address from the backtrace (let's call it <addr1> to save typing). Then resolving some of the strings that get passed to libcfs_debug_vmsg2() from _debug_req(), I see: crash> struct ptlrpc_request.rq_import <addr1> rp_import = 0x0 crash> struct ptlrpc_request.rq_export <addr1> rp_export = <addr2> crash> struct obd_export.exp_connection <addr2> exp_connection = 0x5a5a5a5a5a5a5a5a crash> struct obd_export.exp_client_uuid <addr2> exp_client_uuid = { uuid = "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" } So the presence of poison value and bogus uuid suggests this export has already been destroyed. For reference, here a snippet from from _debug_req() that uses these values: 2271 void _debug_req(struct ptlrpc_request *req, 2272 struct libcfs_debug_msg_data *msgdata, 2273 const char *fmt, ... ) 2274 { 2275 va_list args; 2276 va_start(args, fmt); 2277 libcfs_debug_vmsg2(msgdata, fmt, args, 2278 " req@%p x"LPU64"/t"LPD64"("LPD64") o%d->%s@%s:%d/%d" 2279 " lens %d/%d e %d to %d dl "CFS_TIME_T" ref %d " 2280 "fl "REQ_FLAGS_FMT"/%x/%x rc %d/%d\n", 2281 req, req->rq_xid, req->rq_transno, 2282 req->rq_reqmsg ? lustre_msg_get_transno(req->rq_reqmsg) : 0, 2283 req->rq_reqmsg && req_ptlrpc_body_swabbed(req) ? 2284 lustre_msg_get_opc(req->rq_reqmsg) : -1, 2285 req->rq_import ? obd2cli_tgt(req->rq_import->imp_obd) : 2286 req->rq_export ? 2287 (char*)req->rq_export->exp_client_uuid.uuid : "<?>",

          People

            green Oleg Drokin
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: