[LU-6802] sanity test_208 fail: “lease not broken over recovery" Created: 06/Jul/15  Updated: 08/Sep/17  Resolved: 04/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Autotest


Issue Links:
Duplicate
is duplicated by LU-6596 GPF: RIP [<ffffffffa076924b>] ptlrpc_... Resolved
Related
is related to LU-8459 Need LU-6596 fix in Lustre 2.5 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 208 is failing occasionally. From the test_log:

==== test 5: lease broken can't be regained by replay
multiop: apply lease error: Device or resource busy
Failing mds1 on shadow-17vm12
…
shadow-17vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 9 sec
shadow-17vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 9 sec
/usr/lib64/lustre/tests/sanity.sh: line 11648: kill: (19839) - No such process
 sanity test_208: @@@@@@ FAIL: lease not broken over recovery 

Test logs for this failure and past failures are at:
2015-07-03 01:45:25 - https://testing.hpdd.intel.com/test_sets/0965361e-213f-11e5-a388-5254006e85c2
2015-07-03 03:06:42 - https://testing.hpdd.intel.com/test_sets/a04d4718-214b-11e5-bb73-5254006e85c2
2015-07-03 16:23:00 - https://testing.hpdd.intel.com/test_sets/38d9e8f0-21de-11e5-a388-5254006e85c2
2015-07-03 18:18:18 - https://testing.hpdd.intel.com/test_sets/b163f5f2-21ef-11e5-a388-5254006e85c2
2015-07-04 19:45:45 - https://testing.hpdd.intel.com/test_sets/2e5a5992-2372-11e5-aced-5254006e85c2
2015-07-05 12:09:44 - https://testing.hpdd.intel.com/test_sets/86e51cfa-234a-11e5-9feb-5254006e85c2
2015-07-05 15:22:59 - https://testing.hpdd.intel.com/test_sets/a26fb9a2-2341-11e5-aced-5254006e85c2
2015-07-05 15:58:01 - https://testing.hpdd.intel.com/test_sets/04aecbb8-236e-11e5-9feb-5254006e85c2



 Comments   
Comment by Peter Jones [ 06/Jul/15 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 10/Jul/15 ]

Hit this three times while testing two patches http://review.whamcloud.com/14911 and http://review.whamcloud.com/14953 .

Comment by Lai Siyao [ 11/Jul/15 ]

sanity 208 failed on different subtest, and it often failed with EBUSY (lockrep>lock_policy_res2 is -EBUSY) in ll_lease_open -> md_intent_lock -> mdc_finish_enqueue, while this value should be 0 since MDS doesn't set it for lease open.

I need more time to review related code.

Comment by Lai Siyao [ 15/Jul/15 ]

The log shows this -EBUSY comes from the end of mdt_object_open_lock():

                if (atomic_read(&obj->mot_open_count) > open_count)
                        GOTO(out, rc = -EBUSY);

I don't understand why this failure is random, and I don't see anything special before subtest 5.

Jinshan, do you have idea?

Comment by James Nunez (Inactive) [ 15/Jul/15 ]

Some recent failures on master in review-dne-part-1:
2015-07-14 09:41:51 - https://testing.hpdd.intel.com/test_sets/7eb0672e-2a4d-11e5-a37e-5254006e85c2
2015-07-14 11:05:48 - https://testing.hpdd.intel.com/test_sets/8bd84aa2-2a5c-11e5-a37e-5254006e85c2
2015-07-14 11:47:39 - https://testing.hpdd.intel.com/test_sets/13d7bb3e-2a5f-11e5-ba1d-5254006e85c2
2015-07-14 12:43:45 - https://testing.hpdd.intel.com/test_sets/32ab3cf8-2a68-11e5-a37e-5254006e85c2
2015-07-14 15:43:16 - https://testing.hpdd.intel.com/test_sets/229bd900-2a7f-11e5-a37e-5254006e85c2
2015-07-14 20:03:58 - https://testing.hpdd.intel.com/test_sets/c3304c60-2ac1-11e5-be51-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 15/Jul/15 ]

Hi Lai,

The lease failed to apply in subtest 5 so it has probably has nothing to do with this test. It's more like that the file was not really closed in subtest 4. Would you please take a further look to verify it?

Comment by Lai Siyao [ 16/Jul/15 ]

so far as I see from the log, the file was closed in subtest 4. I'll take a deeper look.

Comment by Doug Oucharek (Inactive) [ 17/Jul/15 ]

Hi Jinshan. Can you look at this issue which his blocking landings?

Lai is off next week so he won't be able to work on it.

Comment by Jinshan Xiong (Inactive) [ 18/Jul/15 ]

Hi James,

Can you please add the log option of 'INODE' on the MDT side when you're running the test? Just as:

lctl set_param debug+=inode

Comment by Jian Yu [ 20/Jul/15 ]

Here is a for-test-only patch trying to reproduce the failure and gather debug logs with "PTLDEBUG=+inode" in the test parameters: http://review.whamcloud.com/15649

Comment by Jian Yu [ 21/Jul/15 ]

Hi Jinshan,

The failure was reproduced by only running sanity test 208: https://testing.hpdd.intel.com/test_sessions/35e878d2-2f3a-11e5-92dd-5254006e85c2
Please check the debug logs. Thanks.

Comment by Jinshan Xiong (Inactive) [ 21/Jul/15 ]

This turned out to be a recovery problem - though I don't understand why we can only see this problem recently.

in ptlrpc_replay_next(), it moves replay cursor before a request is replayed successfully.

                /* The last request on committed_list hasn't been replayed */
                if (req->rq_transno > last_transno) {
                        /* Since the imp_committed_list is immutable before
                         * all of it's requests being replayed, it's safe to
                         * use a cursor to accelerate the search */
                        imp->imp_replay_cursor = imp->imp_replay_cursor->next;

If that request failed to replay, it may not have a chance to be replayed again.

In this test, there are two open requests needed replaying, but the first one failed due to ptlrpc timeout, then client reconnected and skipped the first open, this caused an open request was leaked, and then caused lease open failed.

It should reset cursor of import every time before replay starts.

Comment by Gerrit Updater [ 21/Jul/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/15669
Subject: LU-6802 ptlrpc: reset replay cursor on reconnection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7fdf6c195899388e87d66a2a84757aefedcac636

Comment by Gerrit Updater [ 27/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15669/
Subject: LU-6802 ptlrpc: reset replay cursor on reconnection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c9f9e9c54f5284093b689dd0d7fa8baf5a5b2c97

Comment by Niu Yawei (Inactive) [ 06/Aug/15 ]

I saw the fix when I'm working on another ticket, looks the fix is incorrect:

        /* LU-6802: Reset replay cursor on successful reconnect */
        spin_lock(&imp->imp_lock);
        imp->imp_replay_cursor = &imp->imp_committed_list;
        imp->imp_last_replay_transno = 0;
        spin_unlock(&imp->imp_lock);

We shouldn't reset the imp_last_replay_transno on each reconnect, otherwise, the request which has been replayed but not committed will be replayed again after reconnect. Furthermore, reset the imp_replay_cursor on each reconnect isn't optimal, we'd just move one step back for such case. (after reconnect, if imp_replay_cursor != imp_committed_list, set imp_replay_cursor to imp_replay_cursor->prev).

Comment by Gerrit Updater [ 06/Aug/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/15871
Subject: LU-6802 recovery: don't skip open replay on reconnect
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2e3a7253ae8537ce35916a218fddf7b861469aef

Comment by Jinshan Xiong (Inactive) [ 06/Aug/15 ]

indeed, this can be optimized. The previous patch is logically correct because a) replaying a replayed request should be okay and target must handle this case and b) it's not common to reconnect during replay. Actually this bug has been existing for long time but never hit until now, it would be more interesting in investigating why reconnecting during replay became so common recently.

Comment by Niu Yawei (Inactive) [ 07/Aug/15 ]

Not sure why the test didn't fail before, but replaying a request twice looks not okay to me, at least the second execution will fail and emit an error.

Comment by Jinshan Xiong (Inactive) [ 07/Aug/15 ]

it will reconstruct a reply

Comment by Niu Yawei (Inactive) [ 07/Aug/15 ]

it will reconstruct a reply

Speaking of resend, the original fix will break resend replay, because it doesn't start replay from the last timedout one on reconnect, so resend flag will be set on wrong request. (actually, on server side, there can be only one reply data available, which is for the last timedout replay, replay data for all other replays must have been freed).

Comment by Gerrit Updater [ 18/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15871/
Subject: LU-6802 recovery: don't skip open replay on reconnect
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 76040fdbbb739053695386e7ed6d0dcb1ea7539a

Comment by Peter Jones [ 18/Aug/15 ]

Landed for 2.8

Comment by Ben Evans (Inactive) [ 02/Sep/15 ]

Should the check made in the patch be made every time the imp_replay_cursor is advanced?

Comment by Gerrit Updater [ 25/Nov/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17351
Subject: LU-6802 ptlrpc: reset imp_replay_cursor
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 24efe8099b751714c6a9bd89745fc6b30b2f542b

Comment by Gerrit Updater [ 04/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17351/
Subject: LU-6802 ptlrpc: reset imp_replay_cursor
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 78335a98654f7c52dbe820cf575d2f9f87a6f595

Comment by Joseph Gmitter (Inactive) [ 04/Dec/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:03:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.