Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1682

Memory allocation failure in mgc_enqueue() tends to cause LBUG and refcount issue

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.1.0, Lustre 2.3.0, Lustre 1.8.8
    • Originally it happened with FEFS, based on Lustre-1.8.5 in RIKEN K computer environment
      MDSx1, OSSx2592, OSTx5184, Clientx84672
    • 3
    • 4517

    Description

      Hi, I believe I found some bugs in mgc_enqueue().
      I'm sure in 1.8.x case because I've often run into the case actually,
      but I'm not so in 2.x case because it's due to only my sourde code
      inspection, not actuall facing the problem.

      So could someone verify whether or not it's true or not in the 2.x case?
      The over views of the bugs are the below.

      1)
      mgc_blocking_ast() is never callced when the case that mgc_enqnene()
      failed to allocat memory for a new ldlm_lock for cld, because there
      is no ldlm_lock holding callback-functions due to the allocation failure.
      That's why the cld remains and it cause the obd_device of MGC remains
      forever even after umount-command is exexuted.

      IMHO, In order to fix the problem, we should call config_log_put() when
      the above allocation failure happened. But we have no way to know just
      only the failure because ldlm_cli_enqueue() returns only -ENOMEM to caller
      and there're some other functions which will return -ENOMEM.

      Locally, I've fixed it with adding one more argument, though my case
      is FEFS, based on Lustre-1.8.5. but I think it's not so good.

      So I'd like to discuss about the way to fix it.

      2)
      When the case that mgc_enqnene() failed to allocat memory for a new
      ptrpc_request for cld, failed_lock_cleanup() is called, and the call
      path will follow the below list ...

      failed_lock_clanup()
          ldlm_lock_decref_internal()
              ldlm_handle_bl_callback()
                  mgc_blocking_ast()
                      ldlm_cli_cancel()
                          ldlm_cli_cancel_local()
      

      Then, it will reach ldlm_cli_cancel_local() and go the below
      else-statement because the lock->l_conn_export haven't been
      set so far.
      But it's caught by LBUG because it's a client side lock.

      static int ldlm_cli_cancel_local(struct ldlm_lock *lock)
      {
              int rc = LDLM_FL_LOCAL_ONLY;
              ENTRY;
      
              if (lock->l_conn_export) { <---- NULL in the case
                      ....
              } else {
                      if (ns_is_client(ldlm_lock_to_ns(lock))) { <---- This is a client side lock, then it will be True
                              LDLM_ERROR(lock, "Trying to cancel local lock"); <---- caught here
                              LBUG();
                      }
                      LDLM_DEBUG(lock, "server-side local cancel");
                      ldlm_lock_cancel(lock);
                      ldlm_reprocess_all(lock->l_resource);
              }
      
              RETURN(rc);
      } 
      

      So, I believe ldlm_cli_enqueue should be modified like the below pseudo code.

      +        lock->l_conn_export = exp;
      +        lock->l_export = NULL;
      +        lock->l_blocking_ast = einfo->ei_cb_bl;
      
              if (reqp == NULL || *reqp == NULL) {
                      req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp),
                                                      &RQF_LDLM_ENQUEUE,
                                                      LUSTRE_DLM_VERSION,
                                                      LDLM_ENQUEUE);
                      if (req == NULL) {
                              failed_lock_cleanup(ns, lock, einfo->ei_mode);
                              LDLM_LOCK_RELEASE(lock);
                              RETURN(-ENOMEM);
                      }
                      req_passed_in = 0;
                      if (reqp)
                              *reqp = req;
              } else {
                      int len;
      
                      req = *reqp;
                      len = req_capsule_get_size(&req->rq_pill, &RMF_DLM_REQ,
                                                 RCL_CLIENT);
                      LASSERTF(len >= sizeof(*body), "buflen[%d] = %d, not %d\n",
                               DLM_LOCKREQ_OFF, len, (int)sizeof(*body));
              }
      
      -        lock->l_conn_export = exp;
      -        lock->l_export = NULL;
      -        lock->l_blocking_ast = einfo->ei_cb_bl;
      

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            nozaki Hiroya Nozaki (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: