Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.11.0
-
None
-
3
-
9223372036854775807
Description
Testing WBC code against unpatched 2.11 servers it could be easily observed that sending a create intent (a valid intent handled by mdt_intent_reint ) crashes mdt with
[ 850.056294] LustreError: 2568:0:(layout.c:2398:req_capsule_extend()) ASSERTION( fmt->rf_fields[i].nr >= old->rf_fields[i].nr ) failed: [ 850.058033] LustreError: 2568:0:(layout.c:2398:req_capsule_extend()) LBUG [ 850.058796] Pid: 2568, comm: mdt01_002 [ 850.059467] Call Trace: [ 850.060682] [<ffffffffa01ab7ce>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 850.061433] [<ffffffffa01ab85c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 850.062203] [<ffffffffa05b32a9>] req_capsule_extend+0x159/0x1c0 [ptlrpc] [ 850.062920] [<ffffffffa0c5d237>] mdt_create_unpack+0x157/0x4b0 [mdt] [ 850.063630] [<ffffffffa0c5dd78>] mdt_reint_unpack+0xa8/0x210 [mdt] [ 850.064290] [<ffffffffa0c4824f>] mdt_reint_internal+0x3f/0x990 [mdt] [ 850.064992] [<ffffffffa0c54bc7>] mdt_intent_reint+0x157/0x420 [mdt] [ 850.065693] [<ffffffffa0c4b8e2>] mdt_intent_opc+0x442/0xad0 [mdt] [ 850.066381] [<ffffffffa058fdd0>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc] [ 850.067065] [<ffffffffa0c533b6>] mdt_intent_policy+0x1a6/0x360 [mdt] [ 850.067786] [<ffffffffa053ed63>] ldlm_lock_enqueue+0x363/0xa40 [ptlrpc] [ 850.068160] [<ffffffffa01bcb05>] ? cfs_hash_rw_unlock+0x15/0x20 [libcfs] [ 850.068554] [<ffffffffa01bfe96>] ? cfs_hash_add+0xa6/0x180 [libcfs] [ 850.068958] [<ffffffffa05671a3>] ldlm_handle_enqueue0+0x933/0x1540 [ptlrpc] [ 850.069354] [<ffffffffa058fe50>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc] [ 850.070049] [<ffffffffa05edd72>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 850.070493] [<ffffffffa05f424b>] tgt_request_handle+0xb1b/0x15c0 [ptlrpc] [ 850.070889] [<ffffffffa01b76a7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 850.071272] [<ffffffffa05995b1>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc] [ 850.071964] [<ffffffffa059d3ce>] ptlrpc_main+0xabe/0x1fd0 [ptlrpc] [ 850.072373] [<ffffffff810af904>] ? finish_task_switch+0x44/0x180 [ 850.072758] [<ffffffff81703c00>] ? __schedule+0x240/0x950 [ 850.073150] [<ffffffffa059c910>] ? ptlrpc_main+0x0/0x1fd0 [ptlrpc] [ 850.073545] [<ffffffff810a2eda>] kthread+0xea/0xf0 [ 850.074636] [<ffffffff810a2df0>] ? kthread+0x0/0xf0 [ 850.074997] [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90 [ 850.075354] [<ffffffff810a2df0>] ? kthread+0x0/0xf0
This actually highlights even bigger problem with this assertion, I think since it does allow various ill-formed requests to cause crashes too.
Anyway, the specific problem here is lack of "RMF_EADATA" component in the pill selected which is RQF_MDS_REINT_CREATE_ACL, but in reality that's only valid for a regular reint RPC, the intent RPCs already get their capsules extended as part of ldlm processing (and obviously they are not happy we are changing the format) so we can totally skip this step for intents.
The other problem once we overcome this one is mdt_reint_create unconditionally assumes that any request with ldlm handle in it (determined by info->mti_dlm_req set) is ELC cancel
request and calls ldlm_request_cancel right away. Which is great for normal reint requests, but crashes for intent requests because the lock handle provided is not yet granted or not properly referenced or some such.
As such we really need to rework the current intent-create logic not to crash right away.