[LU-8582] Interop: master<->b2_8 - sanity test_255a: test failed to respond and timed out Created: 06/Sep/16 Updated: 14/Jul/23 Resolved: 18/Jul/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Li Xi |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MON | ||
| Environment: |
Interop: 2.8.0 EL7.2 Server/EL7.2 Client |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a3f680b4-7276-11e6-8afd-5254006e85c2. The sub-test test_255a failed with the following error: test failed to respond and timed out Client dmesg: [ 7909.928568] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 19:40:57 (1472870457) [ 7916.983080] Lustre: 13272:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472870457/real 1472870457] req@ffff880043d21b00 x1544415137983184/t0(0) o21->lustre-OST0003-osc-ffff880000b5c800@10.2.4.226@tcp:6/4 lens 504/432 e 0 to 1 dl 1472870464 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 [ 7916.993670] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 7917.002143] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection restored to 10.2.4.226@tcp (at 10.2.4.226@tcp) [ 7917.004970] Lustre: Skipped 1 previous similar message [ 7938.019143] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 7938.030107] Lustre: Skipped 2 previous similar messages [ 7973.060145] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 7973.070547] Lustre: Skipped 10 previous similar messages [ 7987.075859] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection restored to 10.2.4.226@tcp (at 10.2.4.226@tcp) [ 7987.082573] Lustre: Skipped 15 previous similar messages [ 8040.574146] INFO: task lfs:24449 blocked for more than 120 seconds. [ 8040.579853] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 8040.582026] lfs D ffff88003cb47b10 0 24449 24265 0x00000080 [ 8040.584167] ffff88005462fbe0 0000000000000082 ffff880036479700 ffff88005462ffd8 [ 8040.586392] ffff88005462ffd8 ffff88005462ffd8 ffff880036479700 ffff8800378a5790 [ 8040.588540] ffff8800378a5798 7fffffffffffffff ffff880036479700 ffff88003cb47b10 [ 8040.590683] Call Trace: [ 8040.592422] [<ffffffff8163bc59>] schedule+0x29/0x70 [ 8040.594318] [<ffffffff81639949>] schedule_timeout+0x209/0x2d0 [ 8040.596362] [<ffffffffa0945c5c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc] [ 8040.598446] [<ffffffffa0923c30>] ? lustre_swab_ladvise_hdr+0x40/0x40 [ptlrpc] [ 8040.600571] [<ffffffff8163c026>] wait_for_completion+0x116/0x170 [ 8040.602607] [<ffffffff810b8910>] ? wake_up_state+0x20/0x20 [ 8040.604615] [<ffffffffa0b4f3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov] [ 8040.606715] [<ffffffffa0aeee68>] osc_io_ladvise_end+0x38/0x50 [osc] [ 8040.608765] [<ffffffffa075a95d>] cl_io_end+0x5d/0x150 [obdclass] [ 8040.610725] [<ffffffffa0b4f48b>] lov_io_end_wrapper+0xdb/0xe0 [lov] [ 8040.612715] [<ffffffffa0b4f9a6>] lov_io_call.isra.9+0x86/0x140 [lov] [ 8040.614647] [<ffffffffa0b4fa96>] lov_io_end+0x36/0xb0 [lov] [ 8040.616502] [<ffffffffa075a95d>] cl_io_end+0x5d/0x150 [obdclass] [ 8040.618371] [<ffffffffa075d103>] cl_io_loop+0xb3/0x190 [obdclass] [ 8040.620279] [<ffffffffa0bd1c29>] ll_file_ioctl+0x3069/0x37f0 [lustre] [ 8040.622168] [<ffffffff811f2585>] do_vfs_ioctl+0x2e5/0x4c0 [ 8040.624012] [<ffffffff811fd057>] ? __alloc_fd+0xa7/0x130 [ 8040.625813] [<ffffffff811fd167>] ? __fd_install+0x47/0x60 [ 8040.627680] [<ffffffff811f2801>] SyS_ioctl+0xa1/0xc0 [ 8040.629464] [<ffffffff81646c49>] system_call_fastpath+0x16/0x1b OST console: 19:43:40:[ 7912.118286] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 19:40:57 (1472870457) 19:43:40:[ 7912.161108] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0002: no handlers for opcode 0x15 19:43:40:[ 7912.166191] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 6 previous similar messages 19:43:40:[ 7919.176071] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting 19:43:40:[ 7919.185035] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 19:43:40:[ 7926.185274] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting 19:43:40:[ 7926.189136] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 19:43:40:[ 7933.189463] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting 19:43:40:[ 7933.196824] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 19:43:40:[ 7940.211153] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 19:43:40:[ 7947.211321] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting 19:43:40:[ 7947.216921] Lustre: Skipped 1 previous similar message 19:43:40:[ 7954.222110] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 19:43:40:[ 7954.228583] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 1 previous similar message 19:43:40:[ 7968.227179] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting 19:43:40:[ 7968.233536] Lustre: Skipped 8 previous similar messages 19:43:40:[ 7975.251445] LustreError: 23546:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 19:43:40:[ 7975.257113] LustreError: 23546:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 8 previous similar messages 19:43:40:[ 8000.268676] Lustre: lustre-OST0000: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting 19:43:40:[ 8000.275601] Lustre: Skipped 5 previous similar messages 19:43:40:[ 8010.282726] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 19:43:40:[ 8010.287623] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 10 previous similar messages 19:43:40:[ 8066.316121] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting 20:41:39:********** Timeout by autotest system ********** This issue was first seen on 08/16/2016 |
| Comments |
| Comment by Andreas Dilger [ 09/Feb/17 ] |
|
There is a version check in this test that should avoid this interop issue: [ $(lustre_version_code ost1) -lt $(version_code 2.8.54) ] &&
skip "lustre < 2.8.54 does not support ladvise " && return
If this problem is hit again, please investigate why the interop version check is not working. For now I'm going to close this issue, since I believe it is fixed, and there is a different failure being hit on sanity test_255a that is being annotated as this one. |
| Comment by Andreas Dilger [ 10/Feb/17 ] |
|
The version check is not at the start of the test, so some of the lfs ladvise code was still being run before the version was checked. |
| Comment by Andreas Dilger [ 10/Feb/17 ] |
|
Li Xi, could you please take a look at this issue. While it is possible to skip this test during interop testing (which I will push a patch for), the real problem is that the -ENOTSUPP error is not being returned to lfs ladvise and instead the client is hanging forever. The same would be true for regular users if they have a new client with lfs ladvise support connecting to an older server, so this isn't just a testing problem. |
| Comment by Gerrit Updater [ 10/Feb/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25362 |
| Comment by Gerrit Updater [ 09/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25362/ |
| Comment by Peter Jones [ 09/Mar/17 ] |
|
Closing again |
| Comment by Andreas Dilger [ 01/Dec/17 ] |
|
This was not actually fixed, just the test was disabled. |
| Comment by Andreas Dilger [ 01/Dec/17 ] |
|
James wrote:
|
| Comment by Andreas Dilger [ 01/Dec/17 ] |
|
Hi Li Xi, |
| Comment by Andreas Dilger [ 01/Dec/17 ] |
|
A quick look at the code shows it is stuck in wait_for_completion() in: static void osc_io_ladvise_end(const struct lu_env *env, const struct cl_io_slice *slice) { struct cl_io *io = slice->cis_io; struct osc_io *oio = cl2osc_io(env, slice); struct osc_async_cbargs *cbargs = &oio->oi_cbarg; int result = 0; struct cl_ladvise_io *lio = &io->u.ci_ladvise; if ((!(lio->li_flags & LF_ASYNC)) && cbargs->opc_rpc_sent) { wait_for_completion(&cbargs->opc_sync); result = cbargs->opc_rc; } slice->cis_io->ci_result = result; } but my knowledge of CLIO isn't strong enough to know if it is waiting there indefinitely, or just repeating this wait for each failed RPC that it is sending? In any case, the OSS is returning -EOPNOTSUPP to the client, but the client is not handling the RPC error properly and returning it to the client. Also, it would seem to me that these handlers should be using wait_for_completion_interruptible() so that user processes can be killed if the server is not responsive. Ideally, there would be a wait_for_completion_io_interruptible() so that the time could be accounted as IO wait instead of just sleeping, but that doesn't seem to exist. |
| Comment by Andreas Dilger [ 21/Jan/22 ] |
|
This is going to become an issue with patch https://review.whamcloud.com/43170 "LU-14380 statahead: divide hash space evenly among the ranks" adding a new "LU_LADVISE_STATAHEAD" method. The need for wait_for_completion_interruptible() in osc_io_ladvise_end() still exists also. |
| Comment by Li Xi [ 21/Jun/22 ] |
|
The server of Lustre 2.8 does not have ladvise support, thus is not able to handle ladvise RPC, so it complains: 19:43:40:[ 7954.222110] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 opcode 0x15 is OST_LADVISE. It seems no error message about the wrong opcode RPC will be sent back to client, thus client hang there waiting forever. There is no mechanism of sending error reply back if RPC has a wrong opcode? If so, adding new opcode in the protocol might hit similar problem. I don't think this will cause any problem of new ladvise types. LU-14380 should work fine when testing interop with old servers (not < 2.9). But it worth to confirm that. I will push a patch to use wait_for_completion_interruptible() in osc_io_ladvise_end(). |
| Comment by Li Xi [ 21/Jun/22 ] |
|
Hmm, seems like changing to wait_for_completion_interruptible() does not fix the problem. It just enables users to interrupt the problem. Client was not able to get reply, so it stuck there and got disconnected. In the case of new client connects to server, what would be the mechanisms to check whether server supports a feature? The client could add another OBD_CONNECT_ e.g. OBD_CONNECT2_LADVISE to check whether the feature is supported on server. But is there any way for the server to sent a error reply to a PRC with unknown opcode? |
| Comment by Andreas Dilger [ 21/Jun/22 ] |
|
We had a similar issue with fallocate - the bad opcode caused clients to retry forever. It would be better to fix the server to return -EOPNOTSUPP for bad opcodes (which will be returned to userspace), but also to allow clients to handle -EPROTO and abort instead of retry. You are correct that wait_event_idle_interruptible() is only to allow users to interrupt the process (which should preferably be used in places that user applications are waiting), but that will only help the next problem that is hit in this code. The above two changes should prevent old/new client/server from getting stuck at all. |
| Comment by Li Xi [ 24/Jun/22 ] |
|
A fix patch is comming. If we apply the following debug change: osc_ladvise_base() - rc = ptlrpc_request_pack(req, LUSTRE_OST_VERSION, OST_LADVISE); + rc = ptlrpc_request_pack(req, LUSTRE_OST_VERSION, OST_QUOTACHECK); Without the fix patch: With the fix patch, command quit immediately: # lfs ladvise -a willread a lfs ladvise: cannot give advice: Operation not supported (95) ladvise: cannot give advice 'willread' to file 'a': Operation not supported |
| Comment by Gerrit Updater [ 24/Jun/22 ] |
|
"Li Xi <lixi@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47761 |
| Comment by Gerrit Updater [ 18/Jul/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47761/ |
| Comment by Peter Jones [ 18/Jul/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 05/Jul/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51568 |
| Comment by Gerrit Updater [ 14/Jul/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51568/ |