Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8582

Interop: master<->b2_8 - sanity test_255a: test failed to respond and timed out

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.9.0
    • Interop: 2.8.0 EL7.2 Server/EL7.2 Client
      Server: b2_8, build# 12
      Client: master, build# 3431
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a3f680b4-7276-11e6-8afd-5254006e85c2.

      The sub-test test_255a failed with the following error:

      test failed to respond and timed out
      

      Client dmesg:

      [ 7909.928568] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 19:40:57 (1472870457)
      [ 7916.983080] Lustre: 13272:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472870457/real 1472870457]  req@ffff880043d21b00 x1544415137983184/t0(0) o21->lustre-OST0003-osc-ffff880000b5c800@10.2.4.226@tcp:6/4 lens 504/432 e 0 to 1 dl 1472870464 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      [ 7916.993670] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 7917.002143] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection restored to 10.2.4.226@tcp (at 10.2.4.226@tcp)
      [ 7917.004970] Lustre: Skipped 1 previous similar message
      [ 7938.019143] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 7938.030107] Lustre: Skipped 2 previous similar messages
      [ 7973.060145] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection to lustre-OST0003 (at 10.2.4.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 7973.070547] Lustre: Skipped 10 previous similar messages
      [ 7987.075859] Lustre: lustre-OST0003-osc-ffff880000b5c800: Connection restored to 10.2.4.226@tcp (at 10.2.4.226@tcp)
      [ 7987.082573] Lustre: Skipped 15 previous similar messages
      [ 8040.574146] INFO: task lfs:24449 blocked for more than 120 seconds.
      [ 8040.579853] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 8040.582026] lfs             D ffff88003cb47b10     0 24449  24265 0x00000080
      [ 8040.584167]  ffff88005462fbe0 0000000000000082 ffff880036479700 ffff88005462ffd8
      [ 8040.586392]  ffff88005462ffd8 ffff88005462ffd8 ffff880036479700 ffff8800378a5790
      [ 8040.588540]  ffff8800378a5798 7fffffffffffffff ffff880036479700 ffff88003cb47b10
      [ 8040.590683] Call Trace:
      [ 8040.592422]  [<ffffffff8163bc59>] schedule+0x29/0x70
      [ 8040.594318]  [<ffffffff81639949>] schedule_timeout+0x209/0x2d0
      [ 8040.596362]  [<ffffffffa0945c5c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
      [ 8040.598446]  [<ffffffffa0923c30>] ? lustre_swab_ladvise_hdr+0x40/0x40 [ptlrpc]
      [ 8040.600571]  [<ffffffff8163c026>] wait_for_completion+0x116/0x170
      [ 8040.602607]  [<ffffffff810b8910>] ? wake_up_state+0x20/0x20
      [ 8040.604615]  [<ffffffffa0b4f3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
      [ 8040.606715]  [<ffffffffa0aeee68>] osc_io_ladvise_end+0x38/0x50 [osc]
      [ 8040.608765]  [<ffffffffa075a95d>] cl_io_end+0x5d/0x150 [obdclass]
      [ 8040.610725]  [<ffffffffa0b4f48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
      [ 8040.612715]  [<ffffffffa0b4f9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
      [ 8040.614647]  [<ffffffffa0b4fa96>] lov_io_end+0x36/0xb0 [lov]
      [ 8040.616502]  [<ffffffffa075a95d>] cl_io_end+0x5d/0x150 [obdclass]
      [ 8040.618371]  [<ffffffffa075d103>] cl_io_loop+0xb3/0x190 [obdclass]
      [ 8040.620279]  [<ffffffffa0bd1c29>] ll_file_ioctl+0x3069/0x37f0 [lustre]
      [ 8040.622168]  [<ffffffff811f2585>] do_vfs_ioctl+0x2e5/0x4c0
      [ 8040.624012]  [<ffffffff811fd057>] ? __alloc_fd+0xa7/0x130
      [ 8040.625813]  [<ffffffff811fd167>] ? __fd_install+0x47/0x60
      [ 8040.627680]  [<ffffffff811f2801>] SyS_ioctl+0xa1/0xc0
      [ 8040.629464]  [<ffffffff81646c49>] system_call_fastpath+0x16/0x1b
      

      OST console:

      19:43:40:[ 7912.118286] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 19:40:57 (1472870457)
      19:43:40:[ 7912.161108] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0002: no handlers for opcode 0x15
      19:43:40:[ 7912.166191] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 6 previous similar messages
      19:43:40:[ 7919.176071] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
      19:43:40:[ 7919.185035] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
      19:43:40:[ 7926.185274] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
      19:43:40:[ 7926.189136] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
      19:43:40:[ 7933.189463] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
      19:43:40:[ 7933.196824] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
      19:43:40:[ 7940.211153] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
      19:43:40:[ 7947.211321] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
      19:43:40:[ 7947.216921] Lustre: Skipped 1 previous similar message
      19:43:40:[ 7954.222110] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
      19:43:40:[ 7954.228583] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 1 previous similar message
      19:43:40:[ 7968.227179] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
      19:43:40:[ 7968.233536] Lustre: Skipped 8 previous similar messages
      19:43:40:[ 7975.251445] LustreError: 23546:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
      19:43:40:[ 7975.257113] LustreError: 23546:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 8 previous similar messages
      19:43:40:[ 8000.268676] Lustre: lustre-OST0000: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
      19:43:40:[ 8000.275601] Lustre: Skipped 5 previous similar messages
      19:43:40:[ 8010.282726] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
      19:43:40:[ 8010.287623] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) Skipped 10 previous similar messages
      19:43:40:[ 8066.316121] Lustre: lustre-OST0003: Client 95455275-3e21-dec7-8ff1-6b8c07c6d061 (at 10.2.4.219@tcp) reconnecting
      20:41:39:********** Timeout by autotest system **********
      

      This issue was first seen on 08/16/2016

      Attachments

        Issue Links

          Activity

            [LU-8582] Interop: master<->b2_8 - sanity test_255a: test failed to respond and timed out

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51568/
            Subject: LU-8582 tests: skip sanity/905 for old OSTs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 839b43515004828dec212ed0183fe51929c91b5b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51568/ Subject: LU-8582 tests: skip sanity/905 for old OSTs Project: fs/lustre-release Branch: master Current Patch Set: Commit: 839b43515004828dec212ed0183fe51929c91b5b

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51568
            Subject: LU-8582 tests: skip sanity/905 for old OSTs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2ced1e0898aacd741c95c25d44350dfefa953853

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51568 Subject: LU-8582 tests: skip sanity/905 for old OSTs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2ced1e0898aacd741c95c25d44350dfefa953853
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47761/
            Subject: LU-8582 target: send error reply on wrong opcode
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 566edb8c43e65ed4cb99614c7c753965bb201366

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47761/ Subject: LU-8582 target: send error reply on wrong opcode Project: fs/lustre-release Branch: master Current Patch Set: Commit: 566edb8c43e65ed4cb99614c7c753965bb201366

            "Li Xi <lixi@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47761
            Subject: LU-8582 target: send error reply on wrong opcode
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b089334b46d19b5efe4e2197ffdca9f3dbc70916

            gerrit Gerrit Updater added a comment - "Li Xi <lixi@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47761 Subject: LU-8582 target: send error reply on wrong opcode Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b089334b46d19b5efe4e2197ffdca9f3dbc70916
            lixi_wc Li Xi added a comment - - edited

            A fix patch is comming. If we apply the following debug change:

            osc_ladvise_base()

             - rc = ptlrpc_request_pack(req, LUSTRE_OST_VERSION, OST_LADVISE);
             + rc = ptlrpc_request_pack(req, LUSTRE_OST_VERSION, OST_QUOTACHECK);
             

            Without the fix patch:
            The ladise command will stuck there forever

            With the fix patch, command quit immediately:

            # lfs ladvise -a willread a
            lfs ladvise: cannot give advice: Operation not supported (95)
            ladvise: cannot give advice 'willread' to file 'a': Operation not supported
            
            lixi_wc Li Xi added a comment - - edited A fix patch is comming. If we apply the following debug change: osc_ladvise_base() - rc = ptlrpc_request_pack(req, LUSTRE_OST_VERSION, OST_LADVISE); + rc = ptlrpc_request_pack(req, LUSTRE_OST_VERSION, OST_QUOTACHECK); Without the fix patch: The ladise command will stuck there forever With the fix patch, command quit immediately: # lfs ladvise -a willread a lfs ladvise: cannot give advice: Operation not supported (95) ladvise: cannot give advice 'willread' to file 'a': Operation not supported

            We had a similar issue with fallocate - the bad opcode caused clients to retry forever. It would be better to fix the server to return -EOPNOTSUPP for bad opcodes (which will be returned to userspace), but also to allow clients to handle -EPROTO and abort instead of retry.

            You are correct that wait_event_idle_interruptible() is only to allow users to interrupt the process (which should preferably be used in places that user applications are waiting), but that will only help the next problem that is hit in this code. The above two changes should prevent old/new client/server from getting stuck at all.

            adilger Andreas Dilger added a comment - We had a similar issue with fallocate - the bad opcode caused clients to retry forever. It would be better to fix the server to return -EOPNOTSUPP for bad opcodes (which will be returned to userspace), but also to allow clients to handle -EPROTO and abort instead of retry. You are correct that wait_event_idle_interruptible() is only to allow users to interrupt the process (which should preferably be used in places that user applications are waiting), but that will only help the next problem that is hit in this code. The above two changes should prevent old/new client/server from getting stuck at all.
            lixi_wc Li Xi added a comment -

            Hmm, seems like changing to wait_for_completion_interruptible() does not fix the problem. It just enables users to interrupt the problem.

            Client was not able to get reply, so it stuck there and got disconnected.

            In the case of new client connects to server, what would be the mechanisms to check whether server supports a feature? The client could add another OBD_CONNECT_ e.g. OBD_CONNECT2_LADVISE to check whether the feature is supported on server. But is there any way for the server to sent a error reply to a PRC with unknown opcode?

            lixi_wc Li Xi added a comment - Hmm, seems like changing to wait_for_completion_interruptible() does not fix the problem. It just enables users to interrupt the problem. Client was not able to get reply, so it stuck there and got disconnected. In the case of new client connects to server, what would be the mechanisms to check whether server supports a feature? The client could add another OBD_CONNECT_ e.g. OBD_CONNECT2_LADVISE to check whether the feature is supported on server. But is there any way for the server to sent a error reply to a PRC with unknown opcode?
            lixi_wc Li Xi added a comment -

            The server of Lustre 2.8 does not have ladvise support, thus is not able to handle ladvise RPC, so it complains:

            19:43:40:[ 7954.222110] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15
            

            opcode 0x15 is OST_LADVISE.

            It seems no error message about the wrong opcode RPC will be sent back to client, thus client hang there waiting forever. There is no mechanism of sending error reply back if RPC has a wrong opcode? If so, adding new opcode in the protocol might hit similar problem.

            I don't think this will cause any problem of new ladvise types. LU-14380 should work fine when testing interop with old servers (not < 2.9). But it worth to confirm that.

            I will push a patch to use wait_for_completion_interruptible() in osc_io_ladvise_end().

            lixi_wc Li Xi added a comment - The server of Lustre 2.8 does not have ladvise support, thus is not able to handle ladvise RPC, so it complains: 19:43:40:[ 7954.222110] LustreError: 12896:0:(tgt_handler.c:593:tgt_handler_find_check()) lustre-OST0003: no handlers for opcode 0x15 opcode 0x15 is OST_LADVISE. It seems no error message about the wrong opcode RPC will be sent back to client, thus client hang there waiting forever. There is no mechanism of sending error reply back if RPC has a wrong opcode? If so, adding new opcode in the protocol might hit similar problem. I don't think this will cause any problem of new ladvise types. LU-14380 should work fine when testing interop with old servers (not < 2.9). But it worth to confirm that. I will push a patch to use wait_for_completion_interruptible() in osc_io_ladvise_end().

            This is going to become an issue with patch https://review.whamcloud.com/43170 "LU-14380 statahead: divide hash space evenly among the ranks" adding a new "LU_LADVISE_STATAHEAD" method.

            The need for wait_for_completion_interruptible() in osc_io_ladvise_end() still exists also.

            adilger Andreas Dilger added a comment - This is going to become an issue with patch https://review.whamcloud.com/43170 " LU-14380 statahead: divide hash space evenly among the ranks " adding a new " LU_LADVISE_STATAHEAD " method. The need for wait_for_completion_interruptible() in osc_io_ladvise_end() still exists also.

            People

              lixi_wc Li Xi
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: