[LU-3187] Interop 2.3.64<->2.3/2.1: sanity test_180a: lustre-OST0000: client sent bad object 0x2ce2:2: rc = -EPROTO Created: 17/Apr/13  Updated: 20/May/13  Resolved: 13/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0, Lustre 2.1.6

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: MB
Environment:

server: lustre-master tag-2.3.64 build #1411
client: 2.3.0


Issue Links:
Related
is related to LU-2684 convert ost_id to lu_fid for FID_SEQ_... Resolved
Severity: 3
Rank (Obsolete): 7775

 Description   

Hit following error when running sanity subtest 180a:

client console:

Lustre: DEBUG MARKER: == sanity test 180a: test obdecho on osc == 09:59:45 (1366304385)
Lustre: 26569:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1366304386/real 1366304386]  req@ffff88030bf2cc00 x1432673905016940/t0(0) o4->lustre-OST0000_osc@10.10.4.134@tcp:6/4 lens 488/448 e 0 to 1 dl 1366304398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Lustre: lustre-OST0000_osc: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-OST0000_osc: Connection restored to lustre-OST0000 (at 10.10.4.134@tcp)
Lustre: 26569:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1366304398/real 1366304398]  req@ffff88030bf2cc00 x1432673905016959/t0(0) o4->lustre-OST0000_osc@10.10.4.134@tcp:6/4 lens 488/448 e 0 to 1 dl 1366304410 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Lustre: lustre-OST0000_osc: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-OST0000_osc: Connection restored to lustre-OST0000 (at 10.10.4.134@tcp)
Lustre: 26569:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1366304410/real 1366304410]  req@ffff88030bf2cc00 x1432673905016978/t0(0) o4->lustre-OST0000_osc@10.10.4.134@tcp:6/4 lens 488/448 e 0 to 1 dl 1366304422 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Lustre: lustre-OST0000_osc: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete

OST console:

Lustre: DEBUG MARKER: == sanity test 180a: test obdecho on osc == 01:56:15 (1366188975)
LustreError: 16656:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0000: client 10.10.4.5@tcp sent bad object 0x2ce2:2: rc = -EPROTO
LustreError: 16656:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids
Lustre: lustre-OST0000: Client lustre-OST0000_osc_UUID (at 10.10.4.5@tcp) reconnecting
LustreError: 16656:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0000: client 10.10.4.5@tcp sent bad object 0x2ce2:2: rc = -EPROTO
LustreError: 16656:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids
Lustre: lustre-OST0000: Client lustre-OST0000_osc_UUID (at 10.10.4.5@tcp) reconnecting
LustreError: 16656:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0000: client 10.10.4.5@tcp sent bad object 0x2ce2:2: rc = -EPROTO


 Comments   
Comment by Keith Mannthey (Inactive) [ 17/Apr/13 ]

Do you have a link to more logs? Your output just show the start of 180a.

This was 2.3 master interop testing?

Comment by Sarah Liu [ 18/Apr/13 ]

The system just hang there, I just updated the client console logs, what other logs do you need?
This is hit in the interop testing between 2.3.0 client and master server

Comment by Sarah Liu [ 22/Apr/13 ]

Also hit the similar issue between 2.1.5 client and 2.4 server in bdfilter-survey

https://maloo.whamcloud.com/test_sets/777db7c8-a78b-11e2-b3cc-52540035b04c

OST console:

10:02:39:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0003: client 10.10.4.117@tcp sent bad object 0x49a62:2: rc = -EPROTO
10:02:39:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) Skipped 2799 previous similar messages
10:02:39:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids
10:02:39:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) Skipped 2799 previous similar messages
10:07:50:LustreError: 32677:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0006: cli lustre-OST0006_osc_UUID/ffff88007ab45c00 left 4009984 < tot_grant 4750380 unstable 0 pending 0
10:07:52:LustreError: 32677:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 239 previous similar messages
10:12:17:Lustre: lustre-OST0003: Client lustre-OST0003_osc_UUID (at 10.10.4.117@tcp) reconnecting
10:12:17:Lustre: Skipped 349 previous similar messages
10:12:40:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) lustre-OST0003: client 10.10.4.117@tcp sent bad object 0x49a62:2: rc = -EPROTO
10:12:41:LustreError: 31434:0:(ost_handler.c:123:ost_validate_obdo()) Skipped 2799 previous similar messages
10:12:41:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) invalid object ids
10:12:41:LustreError: 31434:0:(ost_handler.c:2107:ost_io_hpreq_handler()) Skipped 2799 previous similar messages
10:17:50:LustreError: 32686:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0006: cli lustre-OST0006_osc_UUID/ffff88007ab45c00 left 4009984 < tot_grant 4750380 unstable 0 pending 0
10:17:50:LustreError: 32686:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 239 previous similar messages
Comment by Andreas Dilger [ 07/May/13 ]

This was introduced with the LU-2684 landing to swap o_id/o_seq with f_seq/f_oid.

Bumping to be a blocker until we decide whether interop of echo_client and obdfilter-survey is a requirement for 2.4.0. I don't think this could easily be fixed after 2.4.0 anymore.

Possible options for fixing this include:

  • increase LUSTRE_FID_INIT_OID to be "2", so that the first allocated OID is "3" and we can ensure that o_seq=2 is not f_oid=2.
  • check on the server for older clients, but this won't fix new clients on older servers
  • introduce a new OBD_CONNECT_OSTFID flag that decides what order the fields are in, with a fixup at the OST level so that lower layers of the code do not need to handle this everywhere. Sadly, the obvious OBD_CONNECT_FID flag is useless, since older clients send these to the OSTs even if they mean nothing.

If we don't consider this fix itself to be a blocker, it probably makes sense to have the clients and MDS send OBD_CONNECT_OSTFID and the servers accept this, so that it is at least possible to detect and fix this situation later.

Comment by Peter Jones [ 07/May/13 ]

Di

Could you please comment?

Thanks

Peter

Comment by Di Wang [ 08/May/13 ]

Hmm, echo client does not use OBD_CONNECT_FID yet, so we can still add OBD_CONNECT_FID for 2.4 echo client. Then ost_validate_obdo can tell new and old echo client. I will cook a patch.

Comment by Di Wang [ 08/May/13 ]

http://review.whamcloud.com/6287

Comment by Andreas Dilger [ 10/May/13 ]

Minor fixup for b2_1: http://review.whamcloud.com/6319

Comment by Jodi Levi (Inactive) [ 13/May/13 ]

Patch landed to master.

Generated at Sat Feb 10 01:31:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.