[LU-8582] Interop: master<->b2_8 - sanity test_255a: test failed to respond and timed out Created: 06/Sep/16  Updated: 14/Jul/23  Resolved: 18/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Li Xi
Resolution: Fixed Votes: 0
Labels: MON
Environment:

Interop: 2.8.0 EL7.2 Server/EL7.2 Client
Server: b2_8, build# 12
Client: master, build# 3431


Issue Links:
Duplicate
Related
is related to LU-14139 batched statahead processing Resolved
is related to LU-14380 Make statahead better support Breadth... Open
is related to LU-8811 mdc_close() may return a freed reques... Resolved
is related to LU-9097 sanity test_253 test_255: ZFS list co... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a3f680b4-7276-11e6-8afd-5254006e85c2.

The sub-test test_255a failed with the following error:

test failed to respond and timed out

Client dmesg:

[ 7909.928568] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 19:40:57 (1472870457)
[ 7916.983080] Lustre: 13272:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472870457/real 1472870457]  req@ffff880043d21b00 x1544415137983184/t0(0) o21->lustre-OST0003-osc-ffff880000b5c800@10.2.4.226@tcp:6/4 lens 504/432 e 0 to 1 dl 1472870464 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[ 7916.993670] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 7917.002143] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection restored to 10.2.4.226@tcp (at 10.2.4.226@tcp)
[ 7917.004970] Lustre: Skipped 1 previous similar message
[ 7938.019143] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 7938.030107] Lustre: Skipped 2 previous similar messages
[ 7973.060145] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 7973.070547] Lustre: Skipped 10 previous similar messages
[ 7987.075859] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection restored to 10.2.4.226@tcp (at 10.2.4.226@tcp)
[ 7987.082573] Lustre: Skipped 15 previous similar messages
[ 8040.574146] INFO: task lfs:24449 blocked for more than 120 seconds.
[ 8040.579853] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8040.582026] lfs             D ffff88003cb47b10     0 24449  24265 0x00000080
[ 8040.584167]  ffff88005462fbe0 0000000000000082 ffff880036479700 ffff88005462ffd8
[ 8040.586392]  ffff88005462ffd8 ffff88005462ffd8 ffff880036479700 ffff8800378a5790
[ 8040.588540]  ffff8800378a5798 7fffffffffffffff ffff880036479700 ffff88003cb47b10
[ 8040.590683] Call Trace:
[ 8040.592422]  [<ffffffff8163bc59>] schedule+0x29/0x70
[ 8040.594318]  [<ffffffff81639949>] schedule_timeout+0x209/0x2d0
[ 8040.596362]  [<ffffffffa0945c5c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 8040.598446]  [<ffffffffa0923c30>] ? lustre_swab_ladvise_hdr+0x40/0x40 [ptlrpc]
[ 8040.600571]  [<ffffffff8163c026>] wait_for_completion+0x116/0x170
[ 8040.602607]  [<ffffffff810b8910>] ? wake_up_state+0x20/0x20
[ 8040.604615]  [<ffffffffa0b4f3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 8040.606715]  [<ffffffffa0aeee68>] osc_io_ladvise_end+0x38/0x50 [osc]
[ 8040.608765]  [<ffffffffa075a95d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8040.610725]  [<ffffffffa0b4f48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 8040.612715]  [<ffffffffa0b4f9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 8040.614647]  [<ffffffffa0b4fa96>] lov_io_end+0x36/0xb0 [lov]
[ 8040.616502]  [<ffffffffa075a95d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8040.618371]  [<ffffffffa075d103>] cl_io_loop+0xb3/0x190 [obdclass]
[ 8040.620279]  [<ffffffffa0bd1c29>] ll_file_ioctl+0x3069/0x37f0 [lustre]
[ 8040.622168]  [<ffffffff811f2585>] do_vfs_ioctl+0x2e5/0x4c0
[ 8040.624012]  [<ffffffff811fd057>] ? __alloc_fd+0xa7/0x130
[ 8040.625813]  [<ffffffff811fd167>] ? __fd_install+0x47/0x60
[ 8040.627680]  [<ffffffff811f2801>] SyS_ioctl+0xa1/0xc0
[ 8040.629464]  [<ffffffff81646c49>] system_call_fastpath+0x16/0x1b

OST console:

19:43:40:[ 7912.118286] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 19:40:57 (1472870457)
19:43:40:[ 7912.161108] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0002: no handlers for opcode 0x15
19:43:40:[ 7912.166191] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 6 previous similar messages
19:43:40:[ 7919.176071] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
19:43:40:[ 7919.185035] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
19:43:40:[ 7926.185274] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
19:43:40:[ 7926.189136] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
19:43:40:[ 7933.189463] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
19:43:40:[ 7933.196824] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
19:43:40:[ 7940.211153] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
19:43:40:[ 7947.211321] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
19:43:40:[ 7947.216921] Lustre: Skipped 1 previous similar message
19:43:40:[ 7954.222110] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
19:43:40:[ 7954.228583] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 1 previous similar message
19:43:40:[ 7968.227179] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
19:43:40:[ 7968.233536] Lustre: Skipped 8 previous similar messages
19:43:40:[ 7975.251445] LustreError: 23546:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
19:43:40:[ 7975.257113] LustreError: 23546:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 8 previous similar messages
19:43:40:[ 8000.268676] Lustre: lustre-OST0000: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
19:43:40:[ 8000.275601] Lustre: Skipped 5 previous similar messages
19:43:40:[ 8010.282726] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
19:43:40:[ 8010.287623] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 10 previous similar messages
19:43:40:[ 8066.316121] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
20:41:39:********** Timeout by autotest system **********

This issue was first seen on 08/16/2016



 Comments   
Comment by Andreas Dilger [ 09/Feb/17 ]

There is a version check in this test that should avoid this interop issue:

        [ $(lustre_version_code ost1) -lt $(version_code 2.8.54) ] &&
                skip "lustre < 2.8.54 does not support ladvise " && return

If this problem is hit again, please investigate why the interop version check is not working.

For now I'm going to close this issue, since I believe it is fixed, and there is a different failure being hit on sanity test_255a that is being annotated as this one.

Comment by Andreas Dilger [ 10/Feb/17 ]

The version check is not at the start of the test, so some of the lfs ladvise code was still being run before the version was checked.

Comment by Andreas Dilger [ 10/Feb/17 ]

Li Xi, could you please take a look at this issue. While it is possible to skip this test during interop testing (which I will push a patch for), the real problem is that the -ENOTSUPP error is not being returned to lfs ladvise and instead the client is hanging forever. The same would be true for regular users if they have a new client with lfs ladvise support connecting to an older server, so this isn't just a testing problem.

Comment by Gerrit Updater [ 10/Feb/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25362
Subject: LU-8582 tests: skip sanity test_255a for interop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a5fa2a6f2f2ee7138d9bdd3560fd73dd8c6c040d

Comment by Gerrit Updater [ 09/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25362/
Subject: LU-8582 tests: skip sanity test_255a for interop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 35958f8c07beb7cff0763232c1aa16f0080ec38f

Comment by Peter Jones [ 09/Mar/17 ]

Closing again

Comment by Andreas Dilger [ 01/Dec/17 ]

This was not actually fixed, just the test was disabled.

Comment by Andreas Dilger [ 01/Dec/17 ]

James wrote:

For sanity test 255a, an ladvise test, we have

test_255a() {
	[ $(lustre_version_code ost1) -lt $(version_code 2.8.54) ] &&
		skip "lustre < 2.8.54 does not support ladvise " && return

We need to add the same check to test 255b.

Comment by Andreas Dilger [ 01/Dec/17 ]

Hi Li Xi,
could you please look into this problem. Running lfs ladvise with a new (2.9+) client causes it to hang when using an old (2.8-) server. It should just return an -EOPNOTSUPP error to the caller.

Comment by Andreas Dilger [ 01/Dec/17 ]

A quick look at the code shows it is stuck in wait_for_completion() in:

static void osc_io_ladvise_end(const struct lu_env *env, 
                               const struct cl_io_slice *slice)
{               
        struct cl_io            *io = slice->cis_io;
        struct osc_io           *oio = cl2osc_io(env, slice);
        struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
        int                      result = 0;
        struct cl_ladvise_io    *lio = &io->u.ci_ladvise;
        
        if ((!(lio->li_flags & LF_ASYNC)) && cbargs->opc_rpc_sent) {
                wait_for_completion(&cbargs->opc_sync);
                result = cbargs->opc_rc;
        }
        slice->cis_io->ci_result = result;
}

but my knowledge of CLIO isn't strong enough to know if it is waiting there indefinitely, or just repeating this wait for each failed RPC that it is sending? In any case, the OSS is returning -EOPNOTSUPP to the client, but the client is not handling the RPC error properly and returning it to the client.

Also, it would seem to me that these handlers should be using wait_for_completion_interruptible() so that user processes can be killed if the server is not responsive. Ideally, there would be a wait_for_completion_io_interruptible() so that the time could be accounted as IO wait instead of just sleeping, but that doesn't seem to exist.

Comment by Andreas Dilger [ 21/Jan/22 ]

This is going to become an issue with patch https://review.whamcloud.com/43170 "LU-14380 statahead: divide hash space evenly among the ranks" adding a new "LU_LADVISE_STATAHEAD" method.

The need for wait_for_completion_interruptible() in osc_io_ladvise_end() still exists also.

Comment by Li Xi [ 21/Jun/22 ]

The server of Lustre 2.8 does not have ladvise support, thus is not able to handle ladvise RPC, so it complains:

19:43:40:[ 7954.222110] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15

opcode 0x15 is OST_LADVISE.

It seems no error message about the wrong opcode RPC will be sent back to client, thus client hang there waiting forever. There is no mechanism of sending error reply back if RPC has a wrong opcode? If so, adding new opcode in the protocol might hit similar problem.

I don't think this will cause any problem of new ladvise types. LU-14380 should work fine when testing interop with old servers (not < 2.9). But it worth to confirm that.

I will push a patch to use wait_for_completion_interruptible() in osc_io_ladvise_end().

Comment by Li Xi [ 21/Jun/22 ]

Hmm, seems like changing to wait_for_completion_interruptible() does not fix the problem. It just enables users to interrupt the problem.

Client was not able to get reply, so it stuck there and got disconnected.

In the case of new client connects to server, what would be the mechanisms to check whether server supports a feature? The client could add another OBD_CONNECT_ e.g. OBD_CONNECT2_LADVISE to check whether the feature is supported on server. But is there any way for the server to sent a error reply to a PRC with unknown opcode?

Comment by Andreas Dilger [ 21/Jun/22 ]

We had a similar issue with fallocate - the bad opcode caused clients to retry forever. It would be better to fix the server to return -EOPNOTSUPP for bad opcodes (which will be returned to userspace), but also to allow clients to handle -EPROTO and abort instead of retry.

You are correct that wait_event_idle_interruptible() is only to allow users to interrupt the process (which should preferably be used in places that user applications are waiting), but that will only help the next problem that is hit in this code. The above two changes should prevent old/new client/server from getting stuck at all.

Comment by Li Xi [ 24/Jun/22 ]

A fix patch is comming. If we apply the following debug change:

osc_ladvise_base()

 - rc = ptlrpc_request_pack(req, LUSTRE_OST_VERSION, OST_LADVISE);
 + rc = ptlrpc_request_pack(req, LUSTRE_OST_VERSION, OST_QUOTACHECK);
 

Without the fix patch:
The ladise command will stuck there forever

With the fix patch, command quit immediately:

# lfs ladvise -a willread a
lfs ladvise: cannot give advice: Operation not supported (95)
ladvise: cannot give advice 'willread' to file 'a': Operation not supported
Comment by Gerrit Updater [ 24/Jun/22 ]

"Li Xi <lixi@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47761
Subject: LU-8582 target: send error reply on wrong opcode
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b089334b46d19b5efe4e2197ffdca9f3dbc70916

Comment by Gerrit Updater [ 18/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47761/
Subject: LU-8582 target: send error reply on wrong opcode
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 566edb8c43e65ed4cb99614c7c753965bb201366

Comment by Peter Jones [ 18/Jul/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 05/Jul/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51568
Subject: LU-8582 tests: skip sanity/905 for old OSTs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2ced1e0898aacd741c95c25d44350dfefa953853

Comment by Gerrit Updater [ 14/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51568/
Subject: LU-8582 tests: skip sanity/905 for old OSTs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 839b43515004828dec212ed0183fe51929c91b5b

Generated at Sat Feb 10 02:18:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.