[LU-9266] Mount hung due to double HSM RESTORE records Created: 27/Mar/17 Updated: 16/Aug/17 Resolved: 09/Aug/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sergey Cheremencev | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Usually when agent sends several RESTORE requests to the same fid MDT processes only the first.
int mdt_hsm_add_actions(struct mdt_thread_info *mti,
struct hsm_action_list *hal, __u64 *compound_id)
{
...
rc = hsm_find_compatible(mti->mti_env, mdt, hal);
...
/* test result of hsm_find_compatible()
* if request redundant or cancel of nothing
* do not record
*/
/* redundant case */
if (hai->hai_action != HSMA_CANCEL && hai->hai_cookie != 0)
continue;
...
/* take LAYOUT lock so that accessing the layout will
* be blocked until the restore is finished */
mdt_lock_reg_init(&crh->crh_lh, LCK_EX);
rc = mdt_object_lock(mti, obj, &crh->crh_lh,
...
/* record request */
rc = mdt_agent_record_add(mti->mti_env, mdt, *compound_id,
archive_id, flags, hai);
Even If MDT doesn't find compatible request in llog it tries to take LAYOUT lock. This lock is already taken by the 1st RESTORE request. lrh=[type=10680000 len=136 idx=1/1] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x58d95f65/0x58d95f65 action=ARCHIVE archive#=2 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[] lrh=[type=10680000 len=136 idx=1/2] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x58d95f66/0x58d95f66 action=RESTORE archive#=2 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] lrh=[type=10680000 len=136 idx=1/3] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x58d95f67/0x58d95f67 action=RESTORE archive#=2 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] Such records causes mount to hung when D: 15524 TASK: ffff880068b5b540 CPU: 4 COMMAND: "lctl" #0 [ffff8800bacd9728] schedule at ffffffff81525d30 #1 [ffff8800bacd97f0] ldlm_completion_ast at ffffffffa08527f5 [ptlrpc] #2 [ffff8800bacd9890] ldlm_cli_enqueue_local at ffffffffa0851b8e [ptlrpc] #3 [ffff8800bacd9910] mdt_object_lock0 at ffffffffa0e4ec4c [mdt] #4 [ffff8800bacd99c0] mdt_object_lock at ffffffffa0e4f694 [mdt] #5 [ffff8800bacd99d0] mdt_object_find_lock at ffffffffa0e4f9c1 [mdt] #6 [ffff8800bacd9a00] hsm_restore_cb at ffffffffa0e9b533 [mdt] #7 [ffff8800bacd9a50] llog_process_thread at ffffffffa05fd699 [obdclass] #8 [ffff8800bacd9b10] llog_process_or_fork at ffffffffa05fdbaf [obdclass] #9 [ffff8800bacd9b60] llog_cat_process_cb at ffffffffa0601250 [obdclass] |
| Comments |
| Comment by Gerrit Updater [ 27/Mar/17 ] |
|
Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/26215 |
| Comment by Peter Jones [ 31/Mar/17 ] |
|
Hongchao Could you please review this proposed change Thanks Peter |
| Comment by Bruno Faccini (Inactive) [ 21/Apr/17 ] |
|
Sergei, Hongchao, |
| Comment by Sergey Cheremencev [ 26/Apr/17 ] |
|
Thanks for feedback.
Yes at first look it is very unlikely. But on the other hand seagate's customer faced this problem.
Correct.
Because easier to don't add new requests when cdt is stopped then parsing llog later during the mount. Furthermore mdt_hsm_add_actions has the same condition(cdt->cdt_state == CDT_STOPPED) at the beginning - so ideally we shouldn't serve any requests when coordinator is stopped. |
| Comment by Gerrit Updater [ 09/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26215/ |
| Comment by Peter Jones [ 09/Aug/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 09/Aug/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28441 |
| Comment by Gerrit Updater [ 16/Aug/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28441/ |