Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10985

Attempting to send a mkdir create intents crashes server

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0, Lustre 2.10.7
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      Testing WBC code against unpatched 2.11 servers it could be easily observed that sending a create intent (a valid intent handled by mdt_intent_reint ) crashes mdt with

      [  850.056294] LustreError: 2568:0:(layout.c:2398:req_capsule_extend()) ASSERTION( fmt->rf_fields[i].nr >= old->rf_fields[i].nr ) failed: 
      [  850.058033] LustreError: 2568:0:(layout.c:2398:req_capsule_extend()) LBUG
      [  850.058796] Pid: 2568, comm: mdt01_002
      [  850.059467] 
      Call Trace:
      [  850.060682]  [<ffffffffa01ab7ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [  850.061433]  [<ffffffffa01ab85c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [  850.062203]  [<ffffffffa05b32a9>] req_capsule_extend+0x159/0x1c0 [ptlrpc]
      [  850.062920]  [<ffffffffa0c5d237>] mdt_create_unpack+0x157/0x4b0 [mdt]
      [  850.063630]  [<ffffffffa0c5dd78>] mdt_reint_unpack+0xa8/0x210 [mdt]
      [  850.064290]  [<ffffffffa0c4824f>] mdt_reint_internal+0x3f/0x990 [mdt]
      [  850.064992]  [<ffffffffa0c54bc7>] mdt_intent_reint+0x157/0x420 [mdt]
      [  850.065693]  [<ffffffffa0c4b8e2>] mdt_intent_opc+0x442/0xad0 [mdt]
      [  850.066381]  [<ffffffffa058fdd0>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
      [  850.067065]  [<ffffffffa0c533b6>] mdt_intent_policy+0x1a6/0x360 [mdt]
      [  850.067786]  [<ffffffffa053ed63>] ldlm_lock_enqueue+0x363/0xa40 [ptlrpc]
      [  850.068160]  [<ffffffffa01bcb05>] ? cfs_hash_rw_unlock+0x15/0x20 [libcfs]
      [  850.068554]  [<ffffffffa01bfe96>] ? cfs_hash_add+0xa6/0x180 [libcfs]
      [  850.068958]  [<ffffffffa05671a3>] ldlm_handle_enqueue0+0x933/0x1540 [ptlrpc]
      [  850.069354]  [<ffffffffa058fe50>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      [  850.070049]  [<ffffffffa05edd72>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  850.070493]  [<ffffffffa05f424b>] tgt_request_handle+0xb1b/0x15c0 [ptlrpc]
      [  850.070889]  [<ffffffffa01b76a7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [  850.071272]  [<ffffffffa05995b1>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
      [  850.071964]  [<ffffffffa059d3ce>] ptlrpc_main+0xabe/0x1fd0 [ptlrpc]
      [  850.072373]  [<ffffffff810af904>] ? finish_task_switch+0x44/0x180
      [  850.072758]  [<ffffffff81703c00>] ? __schedule+0x240/0x950
      [  850.073150]  [<ffffffffa059c910>] ? ptlrpc_main+0x0/0x1fd0 [ptlrpc]
      [  850.073545]  [<ffffffff810a2eda>] kthread+0xea/0xf0
      [  850.074636]  [<ffffffff810a2df0>] ? kthread+0x0/0xf0
      [  850.074997]  [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90
      [  850.075354]  [<ffffffff810a2df0>] ? kthread+0x0/0xf0
      

      This actually highlights even bigger problem with this assertion, I think since it does allow various ill-formed requests to cause crashes too.

      Anyway, the specific problem here is lack of "RMF_EADATA" component in the pill selected which is RQF_MDS_REINT_CREATE_ACL, but in reality that's only valid for a regular reint RPC, the intent RPCs already get their capsules extended as part of ldlm processing (and obviously they are not happy we are changing the format) so we can totally skip this step for intents.

      The other problem once we overcome this one is mdt_reint_create unconditionally assumes that any request with ldlm handle in it (determined by info->mti_dlm_req set) is ELC cancel
      request and calls ldlm_request_cancel right away. Which is great for normal reint requests, but crashes for intent requests because the lock handle provided is not yet granted or not properly referenced or some such.

      As such we really need to rework the current intent-create logic not to crash right away.

      Attachments

        Activity

          People

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: