[LU-769] Test failure on test suite conf-sanity, subtest test_23a Created: 18/Oct/11 Updated: 16/Aug/16 Resolved: 16/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.1.1, Lustre 1.8.7 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 4851 | ||||||||||||
| Description |
|
This issue was created by maloo for jay <jay@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/0d83cd4e-f936-11e0-a451-52540025f9af. The sub-test test_23a failed with the following error:
Info required for matching: conf-sanity 23a |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 18/Oct/11 ] |
|
I took a look at this bug, I think the root cause of this issue is in ptlrpc_set_wait() where it causes the signal never handled: if (timeout == 0 && !cfs_signal_pending())
/*
* No requests are in-flight (ether timed out
* or delayed), so we can allow interrupts.
* We still want to block for a limited time,
* so we allow interrupts during the timeout.
*/
lwi = LWI_TIMEOUT_INTR_ALL(cfs_time_seconds(1),
ptlrpc_expired_set,
ptlrpc_interrupted_set, set);
else
/*
* At least one request is in flight, so no
* interrupts are allowed. Wait until all
* complete, or an in-flight req times out.
*/
lwi = LWI_TIMEOUT(cfs_time_seconds(timeout? timeout : 1),
ptlrpc_expired_set, set);
rc = l_wait_event(set->set_waitq, ptlrpc_check_set(NULL, set), &lwi);
So if there is already signal pending, it will go for non-intr version of LWI_TIMEOUT, then this signal will never be handled. |
| Comment by nasf (Inactive) [ 18/Oct/11 ] |
|
Another failure instance: https://maloo.whamcloud.com/sub_tests/06ffdd26-f6db-11e0-a451-52540025f9af |
| Comment by Peter Jones [ 18/Oct/11 ] |
|
Bobijam This looks similar to work that you did a while back for LU346. Could you please look into whether Jinshan's theory is correct and how we can eliminate these autotest failures? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 19/Oct/11 ] |
|
I don't quite understand this piece of code. From the code, it looks like if RPCs in RQ_PHASE_RPC/BULK/NEW state are considered as in-flight, and if there exists in-flight requests in a set, interrupts are not allowed. However, in ptlrpc_interrupted_set(), only RPCs in RQ_PHASE_RPC/UNREG are allowed to mark interrupted. This is definitely not sync. Not even say in original implementation where it allows interrupts but excludes cfs_signal_pending(). I tend to think ptlrpc experts should be involved to explain it, and the comment should be fixed as well. |
| Comment by Jinshan Xiong (Inactive) [ 19/Oct/11 ] |
|
anyway, i made a patch at: http://review.whamcloud.com/#change,1543 |
| Comment by Oleg Drokin [ 21/Oct/11 ] |
|
I think the problem was introduced by patch from bug 11710, commit 2957787997522e3967176903dd7c6f1e04964dc4 There we start to count unsent RPCs as in-flight and of course it breaks everything. I'll try to submit a patch that reverts this bit. |
| Comment by Jinshan Xiong (Inactive) [ 21/Oct/11 ] |
|
Maybe I missed something, but I'll appreciate if you can point out where I understood wrong. I used this run: https://maloo.whamcloud.com/sub_tests/06ffdd26-f6db-11e0-a451-52540025f9af, and client debug log: https://maloo.whamcloud.com/test_logs/6819eac0-f6db-11e0-a451-52540025f9af/show_text 00000100:00000001:0.0:1318632256.053081:0:17090:0:(client.c:1336:ptlrpc_send_new_req()) Process entered 00000100:00000040:0.0:1318632256.053082:0:17090:0:(lustre_net.h:1649:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff81006b07b400 x1382686103961631/t0(0) o-1->lustre-MDT0000_UUID@10.10.4.186@tcp:12/10 lens 192/336 e 0 to 0 dl 0 ref 2 fl New:/ffffffff/ffffffff rc 0/-1 00000100:00000001:0.0:1318632256.053086:0:17090:0:(client.c:1033:ptlrpc_import_delay_req()) Process entered 00000100:00000001:0.0:1318632256.053087:0:17090:0:(client.c:1077:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00080000:0.0:1318632256.053088:0:17090:0:(client.c:1357:ptlrpc_send_new_req()) @@@ req from PID 17090 waiting for recovery: (FULL != CONNECTING) req@ffff81006b07b400 x1382686103961631/t0(0) o-1->lustre-MDT0000_UUID@10.10.4.186@tcp:12/10 lens 192/336 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 00000100:00000001:0.0:1318632256.053092:0:17090:0:(client.c:1362:ptlrpc_send_new_req()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1318632256.053093:0:17090:0:(client.c:1924:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:0.0:1318632256.053093:0:17090:0:(client.c:1961:ptlrpc_set_next_timeout()) Process leaving (rc=0 : 0 : 0) 00000100:00100000:0.0:1318632256.053094:0:17090:0:(client.c:1993:ptlrpc_set_wait()) set ffff810075732cc0 going to sleep for 0 seconds 00000100:00000001:0.0:1318632256.053096:0:17090:0:(client.c:1416:ptlrpc_check_set()) Process entered 00000100:00000001:0.0:1318632256.053097:0:17090:0:(client.c:2224:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:0.0:1318632256.053098:0:17090:0:(client.c:1033:ptlrpc_import_delay_req()) Process entered 00000100:00000001:0.0:1318632256.053099:0:17090:0:(client.c:1077:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:0.0:1318632256.053100:0:17090:0:(client.c:1747:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1318632256.053101:0:17090:0:(client.c:1416:ptlrpc_check_set()) Process entered 00000100:00000001:0.0:1318632256.053102:0:17090:0:(client.c:2224:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:0.0:1318632256.053103:0:17090:0:(client.c:1033:ptlrpc_import_delay_req()) Process entered 00000100:00000001:0.0:1318632256.053104:0:17090:0:(client.c:1077:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:0.0:1318632256.053106:0:17090:0:(client.c:1747:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) When the process is doing statfs(), it finds that the connection statis CONNECTING, so it will be delayed, ptlrpc_request::rq_waiting will be set to 1. Then let's take a look at ptlrpc_set_next_timeout(): /*
* Request in-flight?
*/
if (!(((req->rq_phase == RQ_PHASE_RPC) && !req->rq_waiting) ||
(req->rq_phase == RQ_PHASE_BULK) ||
(req->rq_phase == RQ_PHASE_NEW)))
continue;
Then the current state of this RPC is in RQ_PHASE_RPC state and rq_waiting is 1, so this statment will be true, so continue, so 0 will be returned by ptlrpc_set_next_timeout(). Then take a look at: ptlrpc_set_wait(), it will go into: if (timeout == 0 && !cfs_signal_pending())
/*
* No requests are in-flight (ether timed out
* or delayed), so we can allow interrupts.
* We still want to block for a limited time,
* so we allow interrupts during the timeout.
*/
lwi = LWI_TIMEOUT_INTR_ALL(cfs_time_seconds(1),
ptlrpc_expired_set,
ptlrpc_interrupted_set, set);
else
/*
* At least one request is in flight, so no
* interrupts are allowed. Wait until all
* complete, or an in-flight req times out.
*/
lwi = LWI_TIMEOUT(cfs_time_seconds(timeout? timeout : 1),
ptlrpc_expired_set, set);
If it has a pending signal, it will go to the 2nd branch, and will then go bust. I'm afraid you cannot fix this problem even you reverted 2957787997522e3967176903dd7c6f1e04964dc4. /* request in-flight? */
if (!((req->rq_phase == RQ_PHASE_RPC && !req->rq_waiting) ||
- (req->rq_phase == RQ_PHASE_BULK)))
+ (req->rq_phase == RQ_PHASE_BULK) ||
+ (req->rq_phase == RQ_PHASE_NEW)))
continue;
It only regarded RQ_PHASE_NEW as in filght, though I don't agree this, it has nothing to do with our problem. |
| Comment by Jinshan Xiong (Inactive) [ 21/Oct/11 ] |
|
I tend to think this piece of logic should be rebuilt overall, I understand every single line, but have no idea when putting them together |
| Comment by Oleg Drokin [ 21/Oct/11 ] |
|
Well, the case that you indicate involves a reconnect in progress. So when connect failed, another one will be filed by pinger soon, that's how everything will get unstuck I think. |
| Comment by Jinshan Xiong (Inactive) [ 21/Oct/11 ] |
|
Well, in this test case, it's impossible for the process to mount the mountpoint because the recovery window on the MDT will not close. So what we're testing is the ability for the process to err our under this case. BTW, what's your expectation of it? |
| Comment by nasf (Inactive) [ 05/Nov/11 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/7b1c412a-0803-11e1-b0d9-52540025f9af |
| Comment by Oleg Drokin [ 06/Nov/11 ] |
|
I have an alternative approach here: http://review.whamcloud.com/1657 and it holds well in my local testing. |
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Jinshan Xiong (Inactive) [ 03/Jan/12 ] |
|
This issue appeared again at: https://maloo.whamcloud.com/test_sets/d093ea76-2d11-11e1-ab2e-5254004bbbd3 |
| Comment by Jinshan Xiong (Inactive) [ 04/Jan/12 ] |
|
my patch caused this problem. Sorry for noise. |
| Comment by Jian Yu [ 13/Feb/12 ] |
|
Lustre Tag: v2_1_1_0_RC2 The same issue occurred: https://maloo.whamcloud.com/test_sets/d92dbaba-55cd-11e1-9aa8-5254004bbbd3 |
| Comment by Peter Jones [ 26/Feb/12 ] |
|
Landed for 2.1.1 and 2.2 |
| Comment by Keith Mannthey (Inactive) [ 10/Jan/13 ] |
|
Well sorry to bring such an old issue back alive again but this was just seen on 1.8: https://maloo.whamcloud.com/test_sets/66e4cc86-fa48-11e0-bbc0-52540025f9af Error: 'MOUNT_PID 29076 and MOUNT_LUSTRE_PID 29077 still not killed in 30 secs' And other related errors. |
| Comment by Keith Mannthey (Inactive) [ 10/Jan/13 ] |
|
This error was seen on 1.8 testing. |
| Comment by James A Simmons [ 16/Aug/16 ] |
|
old ticket for unsupported version |