Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3698

sec.c:1060:sptlrpc_cli_unwrap_reply() ASSERTION( req->rq_repdata == ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.5.0
    • Lustre 2.5.0
    • None
    • 3
    • 9544

    Description

      As seen on a Hyperion client iwc1

      There was serious memory pressure lots of allocation were failing.

      Build:
      http://build.whamcloud.com/job/lustre-master/1594/

      Lustre: Lustre: Build Version: jenkins-arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel-1594-gbdf591f-PRISTINE-2.6.32-358.11.1.el6.x86_64
      
      2013-08-05 10:27:53 LustreError: 84566:0:(osc_request.c:2161:osc_build_rpc()) prep_req failed: -12
      2013-08-05 10:27:53 LustreError: 84566:0:(osc_request.c:2161:osc_build_rpc()) Skipped 4 previous similar messages
      2013-08-05 10:27:53 LustreError: 84566:0:(osc_cache.c:2091:osc_check_rpcs()) Write request failed with -12
      2013-08-05 10:27:53 LustreError: 84566:0:(osc_cache.c:2091:osc_check_rpcs()) Skipped 5 previous similar messages
      2013-08-05 10:27:53 LustreError: 84566:0:(sec.c:1060:sptlrpc_cli_unwrap_reply()) ASSERTION( req->rq_repdata == ((void *)0) ) failed:
      2013-08-05 10:27:53 LustreError: 84566:0:(sec.c:1060:sptlrpc_cli_unwrap_reply()) LBUG
      2013-08-05 10:27:53 Pid: 84566, comm: ptlrpcd_7
      2013-08-05 10:27:53
      2013-08-05 10:27:53 Call Trace:
      2013-08-05 10:27:53  [<ffffffffa056e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-08-05 10:27:53  [<ffffffffa056ee97>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-08-05 10:27:53  [<ffffffffa0c52256>] sptlrpc_cli_unwrap_reply+0x1d6/0x240 [ptlrpc]
      2013-08-05 10:27:53  [<ffffffffa0c1735f>] after_reply+0x6f/0xd90 [ptlrpc]
      2013-08-05 10:27:53  [<ffffffffa057efb1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      2013-08-05 10:27:53  [<ffffffffa0c1c941>] ptlrpc_check_set+0xfd1/0x1b20 [ptlrpc]
      2013-08-05 10:27:53  [<ffffffffa0c4752b>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
      2013-08-05 10:27:53  [<ffffffffa0c47a43>] ptlrpcd+0x223/0x380 [ptlrpc]
      2013-08-05 10:27:53  [<ffffffff81063310>] ? default_wake_function+0x0/0x20
      2013-08-05 10:27:53  [<ffffffffa0c47820>] ? ptlrpcd+0x0/0x380 [ptlrpc]
      2013-08-05 10:27:53  [<ffffffff81096936>] kthread+0x96/0xa0
      2013-08-05 10:27:53  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      2013-08-05 10:27:53  [<ffffffff810968a0>] ? kthread+0x0/0xa0
      2013-08-05 10:27:53  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      

      There should be crashdump for this that I will take a look at if needed.

      I will do a little more digging and further updating of this LU it looks like we are not handling the ENOMEM (-12) quite right in this code path.

      Attachments

        1. iwc1-console
          229 kB
          Keith Mannthey

        Activity

          [LU-3698] sec.c:1060:sptlrpc_cli_unwrap_reply() ASSERTION( req->rq_repdata == ((void *)0) ) failed
          pjones Peter Jones made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment -

          Landed for 2.5.0

          pjones Peter Jones added a comment - Landed for 2.5.0

          Peng Tao, I agree with your idea and have created a patch for it. http://review.whamcloud.com/7411

          A rpc expert will need to take a good look at this idea.

          keith Keith Mannthey (Inactive) added a comment - Peng Tao, I agree with your idea and have created a patch for it. http://review.whamcloud.com/7411 A rpc expert will need to take a good look at this idea.
          bergwolf Peng Tao added a comment -

          We just saw this on one of our test machines with b2_4 HEAD as well.

          I took a look at related code and it seems that the problem is that in ptl_send_rpc() req is not sent and req flags are not reset, so in ptlrpc_check_set() reply flag checking thinks req has got reply and goes on to process the reply and thus LBUG.

          I guess something like following should fix the LBUG. But I'm not familiar with ptlrpc code so not entirely sure.

           
          diff --git a/lustre/ptlrpc/client.c b/lustre/ptlrpc/client.c
          index 8b96c5d..9008e62 100644
          --- a/lustre/ptlrpc/client.c
          +++ b/lustre/ptlrpc/client.c
          @@ -1737,6 +1737,7 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
                                                  spin_lock(&req->rq_lock);
                                                  req->rq_net_err = 1;
                                                  spin_unlock(&req->rq_lock);
          +                                       continue;
                                          }
                                          /* need to reset the timeout */
                                          force_timer_recalc = 1;
          
          bergwolf Peng Tao added a comment - We just saw this on one of our test machines with b2_4 HEAD as well. I took a look at related code and it seems that the problem is that in ptl_send_rpc() req is not sent and req flags are not reset, so in ptlrpc_check_set() reply flag checking thinks req has got reply and goes on to process the reply and thus LBUG. I guess something like following should fix the LBUG. But I'm not familiar with ptlrpc code so not entirely sure. diff --git a/lustre/ptlrpc/client.c b/lustre/ptlrpc/client.c index 8b96c5d..9008e62 100644 --- a/lustre/ptlrpc/client.c +++ b/lustre/ptlrpc/client.c @@ -1737,6 +1737,7 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set) spin_lock(&req->rq_lock); req->rq_net_err = 1; spin_unlock(&req->rq_lock); + continue; } /* need to reset the timeout */ force_timer_recalc = 1;
          adilger Andreas Dilger made changes -
          Affects Version/s New: Lustre 2.5.0 [ 10295 ]

          Jinshan: can you comment on Keith's question.

          doug Doug Oucharek (Inactive) added a comment - Jinshan: can you comment on Keith's question.
          adilger Andreas Dilger made changes -
          Fix Version/s New: Lustre 2.5.0 [ 10295 ]
          doug Doug Oucharek (Inactive) made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Keith Mannthey [ keith ]

          Well I took a quick look at the code and I think this may be a little more complicated than a usual out of memory issue"

          osc_check_rpcs has the following code:

                          osc_object_lock(osc);
                          if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) {
                                  rc = osc_send_write_rpc(env, cli, osc, pol);
                                  if (rc < 0) {
                                          CERROR("Write request failed with %d\n", rc);
          
                                          /* osc_send_write_rpc failed, mostly because of
                                           * memory pressure.
                                           *
                                           * It can't break here, because if:
                                           *  - a page was submitted by osc_io_submit, so
                                           *    page locked;
                                           *  - no request in flight
                                           *  - no subsequent request
                                           * The system will be in live-lock state,
                                           * because there is no chance to call
                                           * osc_io_unplug() and osc_check_rpcs() any
                                           * more. pdflush can't help in this case,
                                           * because it might be blocked at grabbing
                                           * the page lock as we mentioned.
                                           *
                                           * Anyway, continue to drain pages. */
                                          /* break; */
                                  }
                          }
          

          So we don't handle the enomem on write it seems??

          It seems like this might be the outcome of this decision?

          keith Keith Mannthey (Inactive) added a comment - Well I took a quick look at the code and I think this may be a little more complicated than a usual out of memory issue" osc_check_rpcs has the following code: osc_object_lock(osc); if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) { rc = osc_send_write_rpc(env, cli, osc, pol); if (rc < 0) { CERROR("Write request failed with %d\n", rc); /* osc_send_write_rpc failed, mostly because of * memory pressure. * * It can't break here, because if: * - a page was submitted by osc_io_submit, so * page locked; * - no request in flight * - no subsequent request * The system will be in live-lock state, * because there is no chance to call * osc_io_unplug() and osc_check_rpcs() any * more. pdflush can't help in this case, * because it might be blocked at grabbing * the page lock as we mentioned. * * Anyway, continue to drain pages. */ /* break; */ } } So we don't handle the enomem on write it seems?? It seems like this might be the outcome of this decision?
          keith Keith Mannthey (Inactive) made changes -
          Attachment New: iwc1-console [ 13329 ]

          People

            keith Keith Mannthey (Inactive)
            keith Keith Mannthey (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: