Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3187

Interop 2.3.64<->2.3/2.1: sanity test_180a: lustre-OST0000: client sent bad object 0x2ce2:2: rc = -EPROTO

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0, Lustre 2.1.6
    • Lustre 2.4.0
    • server: lustre-master tag-2.3.64 build #1411
      client: 2.3.0
    • 3
    • 7775

    Description

      Hit following error when running sanity subtest 180a:

      client console:

      Lustre: DEBUG MARKER: == sanity test 180a: test obdecho on osc == 09:59:45 (1366304385)
      Lustre: 26569:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1366304386/real 1366304386]  req@ffff88030bf2cc00 x1432673905016940/t0(0) o4->lustre-OST0000_osc@10.10.4.134@tcp:6/4 lens 488/448 e 0 to 1 dl 1366304398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      Lustre: lustre-OST0000_osc: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
      Lustre: lustre-OST0000_osc: Connection restored to lustre-OST0000 (at 10.10.4.134@tcp)
      Lustre: 26569:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1366304398/real 1366304398]  req@ffff88030bf2cc00 x1432673905016959/t0(0) o4->lustre-OST0000_osc@10.10.4.134@tcp:6/4 lens 488/448 e 0 to 1 dl 1366304410 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
      Lustre: lustre-OST0000_osc: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
      Lustre: lustre-OST0000_osc: Connection restored to lustre-OST0000 (at 10.10.4.134@tcp)
      Lustre: 26569:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1366304410/real 1366304410]  req@ffff88030bf2cc00 x1432673905016978/t0(0) o4->lustre-OST0000_osc@10.10.4.134@tcp:6/4 lens 488/448 e 0 to 1 dl 1366304422 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
      Lustre: lustre-OST0000_osc: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
      

      OST console:

      Lustre: DEBUG MARKER: == sanity test 180a: test obdecho on osc == 01:56:15 (1366188975)
      LustreError: 16656:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0000: client 10.10.4.5@tcp sent bad object 0x2ce2:2: rc = -EPROTO
      LustreError: 16656:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids
      Lustre: lustre-OST0000: Client lustre-OST0000_osc_UUID (at 10.10.4.5@tcp) reconnecting
      LustreError: 16656:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0000: client 10.10.4.5@tcp sent bad object 0x2ce2:2: rc = -EPROTO
      LustreError: 16656:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids
      Lustre: lustre-OST0000: Client lustre-OST0000_osc_UUID (at 10.10.4.5@tcp) reconnecting
      LustreError: 16656:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0000: client 10.10.4.5@tcp sent bad object 0x2ce2:2: rc = -EPROTO
      

      Attachments

        Issue Links

          Activity

            [LU-3187] Interop 2.3.64<->2.3/2.1: sanity test_180a: lustre-OST0000: client sent bad object 0x2ce2:2: rc = -EPROTO

            Patch landed to master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to master.
            adilger Andreas Dilger added a comment - Minor fixup for b2_1: http://review.whamcloud.com/6319
            di.wang Di Wang added a comment - http://review.whamcloud.com/6287
            di.wang Di Wang added a comment -

            Hmm, echo client does not use OBD_CONNECT_FID yet, so we can still add OBD_CONNECT_FID for 2.4 echo client. Then ost_validate_obdo can tell new and old echo client. I will cook a patch.

            di.wang Di Wang added a comment - Hmm, echo client does not use OBD_CONNECT_FID yet, so we can still add OBD_CONNECT_FID for 2.4 echo client. Then ost_validate_obdo can tell new and old echo client. I will cook a patch.
            pjones Peter Jones added a comment -

            Di

            Could you please comment?

            Thanks

            Peter

            pjones Peter Jones added a comment - Di Could you please comment? Thanks Peter
            adilger Andreas Dilger added a comment - - edited

            This was introduced with the LU-2684 landing to swap o_id/o_seq with f_seq/f_oid.

            Bumping to be a blocker until we decide whether interop of echo_client and obdfilter-survey is a requirement for 2.4.0. I don't think this could easily be fixed after 2.4.0 anymore.

            Possible options for fixing this include:

            • increase LUSTRE_FID_INIT_OID to be "2", so that the first allocated OID is "3" and we can ensure that o_seq=2 is not f_oid=2.
            • check on the server for older clients, but this won't fix new clients on older servers
            • introduce a new OBD_CONNECT_OSTFID flag that decides what order the fields are in, with a fixup at the OST level so that lower layers of the code do not need to handle this everywhere. Sadly, the obvious OBD_CONNECT_FID flag is useless, since older clients send these to the OSTs even if they mean nothing.

            If we don't consider this fix itself to be a blocker, it probably makes sense to have the clients and MDS send OBD_CONNECT_OSTFID and the servers accept this, so that it is at least possible to detect and fix this situation later.

            adilger Andreas Dilger added a comment - - edited This was introduced with the LU-2684 landing to swap o_id/o_seq with f_seq/f_oid. Bumping to be a blocker until we decide whether interop of echo_client and obdfilter-survey is a requirement for 2.4.0. I don't think this could easily be fixed after 2.4.0 anymore. Possible options for fixing this include: increase LUSTRE_FID_INIT_OID to be "2", so that the first allocated OID is "3" and we can ensure that o_seq=2 is not f_oid=2. check on the server for older clients, but this won't fix new clients on older servers introduce a new OBD_CONNECT_OSTFID flag that decides what order the fields are in, with a fixup at the OST level so that lower layers of the code do not need to handle this everywhere. Sadly, the obvious OBD_CONNECT_FID flag is useless, since older clients send these to the OSTs even if they mean nothing. If we don't consider this fix itself to be a blocker, it probably makes sense to have the clients and MDS send OBD_CONNECT_OSTFID and the servers accept this, so that it is at least possible to detect and fix this situation later.
            sarah Sarah Liu added a comment -

            Also hit the similar issue between 2.1.5 client and 2.4 server in bdfilter-survey

            https://maloo.whamcloud.com/test_sets/777db7c8-a78b-11e2-b3cc-52540035b04c

            OST console:

            10:02:39:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0003: client 10.10.4.117@tcp sent bad object 0x49a62:2: rc = -EPROTO
            10:02:39:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) Skipped 2799 previous similar messages
            10:02:39:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids
            10:02:39:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) Skipped 2799 previous similar messages
            10:07:50:LustreError: 32677:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0006: cli lustre-OST0006_osc_UUID/ffff88007ab45c00 left 4009984 < tot_grant 4750380 unstable 0 pending 0
            10:07:52:LustreError: 32677:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 239 previous similar messages
            10:12:17:Lustre: lustre-OST0003: Client lustre-OST0003_osc_UUID (at 10.10.4.117@tcp) reconnecting
            10:12:17:Lustre: Skipped 349 previous similar messages
            10:12:40:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0003: client 10.10.4.117@tcp sent bad object 0x49a62:2: rc = -EPROTO
            10:12:41:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) Skipped 2799 previous similar messages
            10:12:41:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids
            10:12:41:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) Skipped 2799 previous similar messages
            10:17:50:LustreError: 32686:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0006: cli lustre-OST0006_osc_UUID/ffff88007ab45c00 left 4009984 < tot_grant 4750380 unstable 0 pending 0
            10:17:50:LustreError: 32686:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 239 previous similar messages
            
            sarah Sarah Liu added a comment - Also hit the similar issue between 2.1.5 client and 2.4 server in bdfilter-survey https://maloo.whamcloud.com/test_sets/777db7c8-a78b-11e2-b3cc-52540035b04c OST console: 10:02:39:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0003: client 10.10.4.117@tcp sent bad object 0x49a62:2: rc = -EPROTO 10:02:39:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) Skipped 2799 previous similar messages 10:02:39:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids 10:02:39:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) Skipped 2799 previous similar messages 10:07:50:LustreError: 32677:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0006: cli lustre-OST0006_osc_UUID/ffff88007ab45c00 left 4009984 < tot_grant 4750380 unstable 0 pending 0 10:07:52:LustreError: 32677:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 239 previous similar messages 10:12:17:Lustre: lustre-OST0003: Client lustre-OST0003_osc_UUID (at 10.10.4.117@tcp) reconnecting 10:12:17:Lustre: Skipped 349 previous similar messages 10:12:40:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0003: client 10.10.4.117@tcp sent bad object 0x49a62:2: rc = -EPROTO 10:12:41:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) Skipped 2799 previous similar messages 10:12:41:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids 10:12:41:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) Skipped 2799 previous similar messages 10:17:50:LustreError: 32686:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0006: cli lustre-OST0006_osc_UUID/ffff88007ab45c00 left 4009984 < tot_grant 4750380 unstable 0 pending 0 10:17:50:LustreError: 32686:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 239 previous similar messages
            sarah Sarah Liu added a comment -

            The system just hang there, I just updated the client console logs, what other logs do you need?
            This is hit in the interop testing between 2.3.0 client and master server

            sarah Sarah Liu added a comment - The system just hang there, I just updated the client console logs, what other logs do you need? This is hit in the interop testing between 2.3.0 client and master server

            Do you have a link to more logs? Your output just show the start of 180a.

            This was 2.3 master interop testing?

            keith Keith Mannthey (Inactive) added a comment - Do you have a link to more logs? Your output just show the start of 180a. This was 2.3 master interop testing?

            People

              di.wang Di Wang
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: