Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17149

TBF: req_capsule_extend() ASSERTION( fmt->rf_fields[i].nr >= old->rf_fields[i].nr )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.15.3
    • "tbf" activated on "mdt" service
    • 3
    • 9223372036854775807

    Description

      We hit the follwing crash on a 2.15.3 Lustre version with TBF NRS policy activated on "mdt" service:

      [892127.117400] LustreError: 8949:0:(layout.c:2467:req_capsule_extend()) ASSERTION( fmt->rf_fields[i].nr >= old->rf_fields[i].nr ) failed:
      [892127.118895] LustreError: 8949:0:(layout.c:2467:req_capsule_extend()) LBUG
      [892127.119727] Pid: 8949, comm: mdt03_008 4.18.0-477.13.1.el8_8.x86_64 #1 SMP Tue May 30 14:53:41 EDT 2023
      [892127.120846] Call Trace TBD:
      [892127.121216] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
      [892127.121874] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
      [892127.122485] [<0>] req_capsule_extend+0x174/0x1b0 [ptlrpc]
      [892127.123422] [<0>] nrs_tbf_id_cli_set+0x1ee/0x2a0 [ptlrpc]
      [892127.124165] [<0>] nrs_tbf_generic_cli_init+0x50/0x180 [ptlrpc]
      [892127.124986] [<0>] nrs_tbf_res_get+0x1fe/0x430 [ptlrpc]
      [892127.125670] [<0>] nrs_resource_get+0x6c/0xe0 [ptlrpc]
      [892127.126382] [<0>] nrs_resource_get_safe+0x87/0xe0 [ptlrpc]
      [892127.127126] [<0>] ptlrpc_nrs_req_initialize+0x58/0xb0 [ptlrpc]
      [892127.127919] [<0>] ptlrpc_server_request_add+0x248/0xa20 [ptlrpc]
      [892127.128771] [<0>] ptlrpc_server_handle_req_in+0x36a/0x8c0 [ptlrpc]
      [892127.129607] [<0>] ptlrpc_main+0xb97/0x1530 [ptlrpc]
      [892127.130284] [<0>] kthread+0x134/0x150
      [892127.130826] [<0>] ret_from_fork+0x1f/0x40
      

      ldlm_tbf_id_cli_set() try to extend a request already extend:
      We have pill->rc_fmt == RQF_LDLM_INTENT_GETATTR
      And we try to do: req_capsule_extend(&req->rq_pill, &RQF_LDLM_INTENT_BASIC);

      RQF_LDLM_INTENT_GETATTR has 7 fields:

      static const struct req_msg_field *ldlm_intent_getattr_client[] = {           
              &RMF_PTLRPC_BODY,                                                     
              &RMF_DLM_REQ,                                                         
              &RMF_LDLM_INTENT,                                                     
              &RMF_MDT_BODY,     /* coincides with mds_getattr_name_client[] */     
              &RMF_CAPA1,                                                           
              &RMF_NAME,                                                            
              &RMF_FILE_SECCTX_NAME                                                 
      };                                                              
      

      RQF_LDLM_INTENT_BASIC has only 3 fields:

      static const struct req_msg_field *ldlm_intent_basic_client[] = { 
              &RMF_PTLRPC_BODY,                                         
              &RMF_DLM_REQ,                                             
              &RMF_LDLM_INTENT,                                         
      };                                                                
      

      This was made possible since the patch: https://review.whamcloud.com/45272 ("LU-15118 ldlm: no free thread to process resend request")

      We call ldlm_enqueue_hpreq_check() before nrs_resource_get_safe() that initialize the pill with RMF_DLM_REQ for LDLM_ENQUEUE with MSG_RESENT flag:

      static int ldlm_enqueue_hpreq_check(struct ptlrpc_request *req)                          
      {                                                                                        
      ....                                                                                                                                                  
              if ((lustre_msg_get_flags(req->rq_reqmsg) & (MSG_REPLAY|MSG_RESENT)) !=    
                  MSG_RESENT)                                                            
                      RETURN(0);                                                         
                                                                                         
              req_capsule_init(&req->rq_pill, req, RCL_SERVER);                          
              req_capsule_set(&req->rq_pill, &RQF_LDLM_ENQUEUE);                         
                                                              
      ....
      

      Then nrs_tbf_id_cli_set() is called 2 times in nrs_tbf_res_get():

      • o_cli_find(): nrs_tbf_id_cli_find()
      • o_cli_init(): nrs_tbf_id_cli_init()

      After nrs_tbf_id_cli_find(): rc_fmt == RQF_LDLM_INTENT_GETATTR
      So nrs_tbf_id_cli_init() -> nrs_tbf_id_cli_set() -> ldlm_tbf_id_cli_set() -> req_capsule_extend() will crash.

      This crash does not occur if rc_fmt was initially NULL because nrs_tbf_id_cli_set() restores the NULL pointer before returning:

       static int nrs_tbf_id_cli_set(struct ptlrpc_request *req ...
      ....
              req_capsule_init(&req->rq_pill, req, RCL_SERVER); 
              if (req->rq_pill.rc_fmt == NULL) {                
                      req_capsule_set(&req->rq_pill, fmt);      
                      fmt_unset = true;                         
              }                                                 
      ....
             /* restore it to the initialized state */        
             if (fmt_unset)                                   
                     req->rq_pill.rc_fmt = NULL;              
             return rc;                                       
      

       

      Reproducer
      I was not able to reproduce the issue in a test environment. But this appears when the server was heavily loaded. This occurs only for resent requests, not replays.

      The impacted versions are:

      • 2.15.3
      • master with client in 2.15.3 or 2.12 (without the LU-16077)

      Attachments

        Issue Links

          Activity

            People

              eaujames Etienne Aujames
              eaujames Etienne Aujames
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: