[LU-15086] replay-dual: test_10: @@@@@@ FAIL: test_10 failed with 2 Created: 12/Oct/21  Updated: 20/Nov/21  Resolved: 20/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Vladimir Saveliev <vlaidimir.saveliev@hpe.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2f1edc30-7ddf-4a6c-bc7a-fffbf58435b8

This is relatively regular failure:

Error: 'test_10 failed with 2' 
Failure Rate: 6.00% of most recent 100 runs, 0 skipped (all branches)


 Comments   
Comment by Andreas Dilger [ 12/Oct/21 ]

It looks like this has been failing intermittently (once per 4-5 days on full sessions only) since at least 2021-01-01 (the earliest I checked). Since 2021-10-10 it has also failed on 4 review test sessions, and as such is failing more frequently (6x in 2 days). It would be worthwhile to check what changes landed on 10-10 that might have triggered this.

Comment by Alex Zhuravlev [ 20/Oct/21 ]

my bisection points to one patch:

COMMIT		TESTED	PASSED	 FAILED		COMMIT DESCRIPTION
8a770616a5	5	4	1	BAD	 LU-14797 sec: add projid to nodemap
bbfdc7c167	17	15	2	BAD	 LU-14739 quota: fix quota with root squash enabled
da1d93513f	7	6	1	BAD	 LU-14475 log: Rewrite some log messages
885b494632	18	17	1	BAD	 LU-12362 ptlrpc: use wait_woken() in ptlrpcd()
9ec5e2329b	90	90	0		 LU-6142 lod: return pools_hash_params to being static.
2a24b6ec67	90	90	0		 LU-14734 ldiskfs: improve message for large_dir
f7f31f8f96	90	90	0		 LU-12567 ptlrpc: handle reply and resend reorder
1a409a3e6a	90	90	0		 LU-14711 osc: Do not attempt sending empty pages

I think there are two potential issues in LU-12362:

-               if (timeout == 0)
-                       wait_event_idle(set->set_waitq,
-                                       ptlrpcd_check(&env, pc));
-               else if (wait_event_idle_timeout(set->set_waitq,
-                                                ptlrpcd_check(&env, pc),
-                                                cfs_time_seconds(timeout))
-                        == 0)
+
+               add_wait_queue(&set->set_waitq, &wait);
+               while (!ptlrpcd_check(&env, pc)) {
+                       int ret;
+
+                       if (timeout == 0)
+                               ret = wait_woken(&wait, TASK_IDLE,
+                                                MAX_SCHEDULE_TIMEOUT);
+                       else
+                               ret = wait_woken(&wait, TASK_IDLE,
+                                                cfs_time_seconds(timeout));
+                       if (ret != 0)
+                               continue;
+                       /* Timed out */
                        ptlrpc_expired_set(set);
+                       break;
+               }
+               remove_wait_queue(&set->set_waitq, &wait);
+

1) timeout is reused after spurious wakeup, so this can last longer than expected
2) ptlrpc_expired_set() can be called for the case when timeout=0

Comment by Alex Zhuravlev [ 20/Oct/21 ]

added a simple check to pltrpcd to catch too long waiting:

LustreError: 5019:0:(ptlrpcd.c:515:ptlrpcd()) ASSERTION( timeout == 0 || end - start < timeout + 4 ) failed: timeout 10, end 97, start 83, diff 14, count 3

i.e. 10 seconds wait was requested, but actually spent 14 seconds waiting, wait_woken() was called 3 times.

yet another interesting example from failed replay-dual:

LustreError: 5009:0:(ptlrpcd.c:513:ptlrpcd()) timeout 19, end 711, start 620, diff 91, count 9
Comment by Gerrit Updater [ 20/Oct/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45308
Subject: LU-15086 ptlrpc: fix timeout after spurious wakeup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a495c00874105caa299fdf4dfaa482df9b24ad2e

Comment by Peter Jones [ 20/Oct/21 ]

Would we be better off reverting LU-12362 from master?

Comment by Alex Zhuravlev [ 20/Oct/21 ]

with the patch above I can't reproduce the problem. would like to hear Neil's opinion.

Comment by Gerrit Updater [ 20/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45308/
Subject: LU-15086 ptlrpc: fix timeout after spurious wakeup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b8383035406a4b7bee2e6d8674eaef480b3e3b35

Comment by Peter Jones [ 20/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:15:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.