[LU-10985] Attempting to send a mkdir create intents crashes server Created: 02/May/18  Updated: 21/Jan/19  Resolved: 06/May/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.12.0, Lustre 2.10.7

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Testing WBC code against unpatched 2.11 servers it could be easily observed that sending a create intent (a valid intent handled by mdt_intent_reint ) crashes mdt with

[  850.056294] LustreError: 2568:0:(layout.c:2398:req_capsule_extend()) ASSERTION( fmt->rf_fields[i].nr >= old->rf_fields[i].nr ) failed: 
[  850.058033] LustreError: 2568:0:(layout.c:2398:req_capsule_extend()) LBUG
[  850.058796] Pid: 2568, comm: mdt01_002
[  850.059467] 
Call Trace:
[  850.060682]  [<ffffffffa01ab7ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
[  850.061433]  [<ffffffffa01ab85c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[  850.062203]  [<ffffffffa05b32a9>] req_capsule_extend+0x159/0x1c0 [ptlrpc]
[  850.062920]  [<ffffffffa0c5d237>] mdt_create_unpack+0x157/0x4b0 [mdt]
[  850.063630]  [<ffffffffa0c5dd78>] mdt_reint_unpack+0xa8/0x210 [mdt]
[  850.064290]  [<ffffffffa0c4824f>] mdt_reint_internal+0x3f/0x990 [mdt]
[  850.064992]  [<ffffffffa0c54bc7>] mdt_intent_reint+0x157/0x420 [mdt]
[  850.065693]  [<ffffffffa0c4b8e2>] mdt_intent_opc+0x442/0xad0 [mdt]
[  850.066381]  [<ffffffffa058fdd0>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[  850.067065]  [<ffffffffa0c533b6>] mdt_intent_policy+0x1a6/0x360 [mdt]
[  850.067786]  [<ffffffffa053ed63>] ldlm_lock_enqueue+0x363/0xa40 [ptlrpc]
[  850.068160]  [<ffffffffa01bcb05>] ? cfs_hash_rw_unlock+0x15/0x20 [libcfs]
[  850.068554]  [<ffffffffa01bfe96>] ? cfs_hash_add+0xa6/0x180 [libcfs]
[  850.068958]  [<ffffffffa05671a3>] ldlm_handle_enqueue0+0x933/0x1540 [ptlrpc]
[  850.069354]  [<ffffffffa058fe50>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
[  850.070049]  [<ffffffffa05edd72>] tgt_enqueue+0x62/0x210 [ptlrpc]
[  850.070493]  [<ffffffffa05f424b>] tgt_request_handle+0xb1b/0x15c0 [ptlrpc]
[  850.070889]  [<ffffffffa01b76a7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  850.071272]  [<ffffffffa05995b1>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[  850.071964]  [<ffffffffa059d3ce>] ptlrpc_main+0xabe/0x1fd0 [ptlrpc]
[  850.072373]  [<ffffffff810af904>] ? finish_task_switch+0x44/0x180
[  850.072758]  [<ffffffff81703c00>] ? __schedule+0x240/0x950
[  850.073150]  [<ffffffffa059c910>] ? ptlrpc_main+0x0/0x1fd0 [ptlrpc]
[  850.073545]  [<ffffffff810a2eda>] kthread+0xea/0xf0
[  850.074636]  [<ffffffff810a2df0>] ? kthread+0x0/0xf0
[  850.074997]  [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90
[  850.075354]  [<ffffffff810a2df0>] ? kthread+0x0/0xf0

This actually highlights even bigger problem with this assertion, I think since it does allow various ill-formed requests to cause crashes too.

Anyway, the specific problem here is lack of "RMF_EADATA" component in the pill selected which is RQF_MDS_REINT_CREATE_ACL, but in reality that's only valid for a regular reint RPC, the intent RPCs already get their capsules extended as part of ldlm processing (and obviously they are not happy we are changing the format) so we can totally skip this step for intents.

The other problem once we overcome this one is mdt_reint_create unconditionally assumes that any request with ldlm handle in it (determined by info->mti_dlm_req set) is ELC cancel
request and calls ldlm_request_cancel right away. Which is great for normal reint requests, but crashes for intent requests because the lock handle provided is not yet granted or not properly referenced or some such.

As such we really need to rework the current intent-create logic not to crash right away.



 Comments   
Comment by Gerrit Updater [ 02/May/18 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/32237
Subject: LU-10985 mdt: properly handle unknown intent requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: edec9a5693ca0c749009ff94c5f75abf2bf00679

Comment by Gerrit Updater [ 06/May/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32237/
Subject: LU-10985 mdt: properly handle unknown intent requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6a39600f641cc3e179b0149af5ff17ba44d2319f

Comment by Peter Jones [ 06/May/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 23/May/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32521
Subject: LU-10985 mdt: properly handle unknown intent requests
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: ac535c47902875ac6c7ec7312f9f1ef7526614a0

Comment by Gerrit Updater [ 19/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32521/
Subject: LU-10985 mdt: properly handle unknown intent requests
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 179bf9a009cd27b0055e23c1478d7b298833ce35

Generated at Sat Feb 10 02:39:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.