[LU-521] lnet-selftest add_test failure Created: 20/Jul/11  Updated: 24/Oct/12  Resolved: 24/Oct/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.3.0
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL6.1 between a PPC64 node and an x86_64 node, QDR infiniband, o2iblnd


Attachments: File ion.log     File server1.log    
Severity: 3
Rank (Obsolete): 4453

 Description   

Running the simple lnet-selftest script below, the "lst add_test" line fails with:

add test RPC failed on 12345-172.20.203.24@o2ib1: Unknown error 18446744073709551506

I am the script on the "server1" node, which is an x86_64 architecture RHEL6.1 system. The "ion" node is a ppc64 architecture with RHEL6.1. Note that ppc64 has a 64k page size now by default.

server1 has this message on the console:

LustreError: 21942:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-172.20.203.24@o2ib1, match 11222407321460450620 length 65536 too big: 4096 left, 4096 allowed

and the ion has this on the console:

LustreError: 14210:0:(framework.c:1298:sfw_bulk_ready()) Bulk transfer failed for RPC: service test service, peer 12345-172.20.250.1@o2ib1, status -61

I have attached lustre kernel logs with "+ net rpctrace" added.

Script:

lst new_session read/write
lst add_group ion 172.20.203.24@o2ib1
lst add_group server1 172.20.250.1@o2ib1

lst add_batch bulk_rw
lst add_test --batch bulk_rw --concurrency 16 --from ion --to server1 brw write size=1M
lst run bulk_rw
lst stat ion & sleep 30; kill $!
lst end_session


 Comments   
Comment by Peter Jones [ 21/Jul/11 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Peter Jones [ 25/Jul/11 ]

Lai

Could you please provide an update on this ticket?

Thanks

Peter

Comment by Liang Zhen (Inactive) [ 25/Jul/11 ]

I think LST is assuming PAGE_SIZE on both side are same, and this should be reason we got this failure.

Comment by Christopher Morrone [ 01/Aug/11 ]

Where is that assumption made, exactly? I've started looking at the code, but I am not at all clear about what is going on.

An lst write from ion to server fails, but lst write from server to ion test works fine.

Digging into it a bit, it looks like the ion tries to do srpc_post_active_rdma(), where the "len" parameter is 1. It is doing an LNetGet to portal 52 (SRPC_RDMA_PORTAL), and sets up a KIOV with only a single 64K page.

It looks like a single page is the smallest that CAN be set up, is that correct? So is the receiver (server1) setting up the md on its side with only 1 page incorrectly somewhere? Does it just need to allocate a larger buffer to support peers with larger page size?

Comment by Liang Zhen (Inactive) [ 08/Aug/11 ]

Chris, I think you are right, LST assume a single page is the smallest that can be set up, so for the add_test command, the console will setup one 4K page to transfer servers' NIDs to the test client, but the test client has 64K page size and it will try to LNetGet one(64K) page from the console, so we will see this mismatch error on the console:
(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-172.20.203.24@o2ib1, match 11222407321460450620 length 65536 too big: 4096 left, 4096 allowed

I'm afraid this is a design/protocol bug and it's not just in add_test command, it will impact on bulk_test too.

  • add_test command RPC
    I think we can fix it without protocol change, because we do transfer srpc_test_reqst_t::tsr_ndest so we can calculate real buffer size on test client
  • bulk test
    Unfortunately, console wouldn't send buffer size to remote side, instead it just sends number of pages to test client/server (test_bulk_req_t), . I think there is no easy fix for this, because we don't even have protocol handshake in selftest so we will get compatibility issue if we changed wired-data.
    I think one option is we can create a new BRW test case (i.e: advanced brw), so we can define a new frame RPC for it (instead of), this is a meaningful thing to have, since long time ago we planed to have advanced brw_test to support not-page-aligned brw, so we can fix this issue in advanced brw.
    Of course, this will require some other changes to selftest/rpc.c to make some functions to support not-aligned bulks.

We will need a little more time to estimate the efforts for fixing both issues.

Comment by Peter Jones [ 09/Aug/11 ]

Liang is going to finish this one off.

Comment by Liang Zhen (Inactive) [ 29/Aug/11 ]

I plan to post the patch by end of last week, but I didn't finish it, though I have done most code but need a few more days to code review and do some local tests.

It's a little more complex than I thought. I originally want to fix this by adding a new command but it's more like a workaround so I made a more complete patch which can support multiple session versions, here is the description:

  • we use one of those reserved bytes in srpc_msg_t to transfer "session version number" between nodes
  • session version of unpatched LST is V0
  • session version of patched LST is V1
  • user can set preferred session version on console by env LST_VERSION, this version number will be sent to test nodes while running "add_group" command on console
  • remote test node can accept / deny this session version:
    • unpatched test node (V0) will create a session with version 0 because it doesn't aware existence of "session version" in message
    • patched test node (V1) will create a session with version information from console
      • unpatched console (V0) always set 0 for the "session version" of srpc_msg_t (reserved __u32)
      • patched console (V1) set "session version" of srpc_msg_t to number specified by user (env LST_SESSION), it's 1 by default
  • remote test node will send back version information back to console
  • console can learn about real session version number from the reply
    • session created by V0 console is always V0
    • session created by V1 console is V1 if all test nodes are V1
    • session created by V1 console is V0 if all test nodes are V0
    • V1 console will complain to user if there are mixed version numbers on test nodes, i.e: some of them are V0 and some of them are V1, in this case, user should set LST_VERSION=0 and retry.
Comment by Christopher Morrone [ 30/Aug/11 ]

Sounds good. Thank you for the update!

Comment by Liang Zhen (Inactive) [ 05/Sep/11 ]

Patch is ready for review: http://review.whamcloud.com/#change,1338

Comment by James A Simmons [ 21/Sep/11 ]

LU-445 wanted to add a timestamp to sfw_session_t. This change looks like we could add the timestamp support safely since you can test for different session version.

Comment by Isaac Huang (Inactive) [ 10/Jan/12 ]

Liang, I think your design for LST version compatibility would work. But I'd rather not to use the msg_reserved[0-2] fields unless there's no alternative. We already have the msg_version field that carries a version #. I think alternatively it could be done via msg_version:

1. Console has newer LST version: If console sends a make-session command to a older test node, upon lst add_group. The test node would reply with reply->status == EPROTO, and msg_version set to its older version #. Now, the console knows about the test node version. Then the console can just resend the make-session command with msg_version set to the test node's older version. And everything else goes on: test node believes the console has a same version, console knows about test node version and how to cope with it.

2. Test node:

  • When console sends me a make-session command with a lower version #, I just reply using console's version and behave like that version.
  • When console sends me a make-session command with a higher version #, I reply with EPROTO and my own version #, to tell the console what my version is.

I haven't gone through all the details, but it seems like this way the version negotiation could be done without using any new message fields. What do you think?

BTW, as to setting a preferred session version on console, I don't see why that's necessary if we have version negotiation implemented. If it's needed for some reason, it appears to me that it should be an option to the new_session command, rather that an environment variable.

Comment by Liang Zhen (Inactive) [ 11/Jan/12 ]

Isaac, I didn't use srpc_msg_t::msg_version because message version is something we check/set in RPC module, not test-framework module, so test-node shouldn't change RPC version for test-framework command make_session. I think msg_version can be used only if we are going to change header of srpc_msg_t, session_version is kind of thing belongs to upper layer and could have more chance to be changed than msg_version. I admit putting session_version in header of srpc_msg_t is not very clean, but we can minimize changes by doing this.

Also, I didn't implement resend in console, because it will make the patch more complex, we can just let administrator to set version and retry or make small change to scripts, which should be very rare and easy.

I don't have strong reason to use environment variable instead of option of new_session, but just because:

  • session_key is env variable already, it's make me feel consistent to use env for session_version as well
  • if the console is new version and test nodes are old, we don't have to set version again and again while make sessions.
Comment by Christopher Morrone [ 17/Jan/12 ]

We need a final ruling on this very soon. Some BGQ hardware is now on the floor, and I will need to start testing it soon.

Comment by Liang Zhen (Inactive) [ 19/Jan/12 ]

If Isaac didn't object my previous comment, I tend to keep our current implementation. I just made some small fixes due to suggestions from Andreas. patch is still here: http://review.whamcloud.com/#change,1338
I will test framework version negotiation again after autotest.

Because Andreas suggested to use old version as default value, so you will need to set env variable LST_VERSION=1 on console node before testing.

Comment by Andreas Dilger [ 19/Jan/12 ]

Just an idea that came to mind - if the protocol change only depends on the number of pages vs. the number of bytes, it would be possible to handle this in a forward compatible manner by assuming values below 4096 are page counts, and values 4096+ are byte counts. This could be added to 2.2.0 and 2.1.1 still, and would allow some future version to auto-detect which format is being used, and eventually start using byte counts in an interoperable manner.

Comment by Liang Zhen (Inactive) [ 19/Jan/12 ]

Andreas, I'm thinking about adding some new command/advanced testing options for long time, so we would need this anyway and I think it's a good time for adding version compatibility code.

Comment by Brian Behlendorf [ 08/May/12 ]

We've been testing this patch for severals months now can be please come to some conclusion of whether it will be merged in its current form or not.

Comment by Liang Zhen (Inactive) [ 14/May/12 ]

Brian, good to hear this patch can work fine, I will land it to 2.3.

Liang

Comment by Christopher Morrone [ 14/May/12 ]

We've had the patch in our tree, I don't think that anyone has actually tested it though.

Comment by Ned Bass [ 18/Jun/12 ]

We still see failures running lnet selftest between x86_64 and ppc64 with this patch. lst add_group produces this:

# lst add_group servers 172.21.1.201@o2ib200
create session RPC failed on 12345-172.21.1.201@o2ib200: Unknown error 18446744073709551542

and the following console message:

Lustre: 2735:0:(rpc.c:1102:srpc_send_rpc()) Bad message from 12345-172.21.1.201@o2ib200: type 1 (1 expected), magic
233877742 (-290394099 expected).

Note that it appears the received magic value was not correctly byte-swapped:

-bash-4.1# printf "%016x\n%016x\n" -290394099 233877742
ffffffffeeb0f00d
000000000df0b0ee

Here is a table of the combinations for which we have or have not seen the error.

lst node arch lst node lustre version remote node arch remote node lustre version add_group works?
x86_64 2.1.1-11chaos ppc64 orion-2_2_49_56_7-38chaos
x86_64 orion-2_3_49_54-52chaos ppc64 orion_2_3_49_54-53chaos
ppc64 orion_2_3_49_54-53chaos x86_64 orion-2_3_49_54-52chaos
ppc64 orion_2_3_49_54-53chaos ppc64 orion_2_3_49_54-53chaos
ppc64 orion-2_2_49_56_7-38chaos x86_64 2.1.1-11chaos
Comment by Ned Bass [ 18/Jun/12 ]

Also note that setting LST_VERSION seems to have no effect on the above results.

Comment by Doug Oucharek (Inactive) [ 19/Jun/12 ]

The LNet header fields are all put on the wire in little endian format. This patch forgot to swap between little endian and host byte order on the new fields so PPC systems are not seeing the new fields properly.

I am updating the patch now.

Comment by Doug Oucharek (Inactive) [ 19/Jun/12 ]

My mistake: only the magic number itself is not being swapped in the routine: srpc_unpack_msg_hdr(). Fix is being done.

Comment by Doug Oucharek (Inactive) [ 05/Jul/12 ]

The fix for this issue is now complete at: http://review.whamcloud.com/#change,1338

On top of what LLNL has currently for their patch, the following has been done:

  • The endian issue with the magic number (covered above) has been resolved.
  • Rather than having a version number set with LST_VERSION for backwards compatibility, it has been changed to be a "features flag" set with LST_FEATURES. Currently this is the only feature setting: Variable Page Size (value of 1). This feature is off by default.

The change to being a features flag has been done such that it is backwards compatible to previous versions (i.e. 2.1) and to the patch LLNL is currently running with (using LST_VERSION). Using a flag rather than a version number will be much more flexible for turning on/off features and for dealing with backwards compatibility issues.

It is recommended to use Git to upgrade to the latest patch. Please advise if a delta-patch from the previous patch to this one is required.

Comment by Christopher Morrone [ 06/Jul/12 ]

Ok, sounds good!

No delta should be necessary. When we update patches, we revert the old one and apply the new. That makes it easy to drop the out-of-date patch and its reversion when we rebase onto newer upstream tags. And if the newer patch has been applied, then it often drops out of the rebase cleanly.

We'll need the lustre manual updated to document LST_FEATURES.

Comment by Ned Bass [ 18/Jul/12 ]

We're still unable to get lnet_selftest working between x86_64 and ppc64 nodes. With latest patch we're getting ENETDOWN errors on lst add_group or lst add_test. With +net debugging, I see

00000001:00000200:0.0:1342633584.088489:2976:5998:0:(conrpc.c:1126:lstcon_rpc_trans_ndlist()) Condition error while creating RPC for transaction 4114: -100

Comment by Doug Oucharek (Inactive) [ 31/Jul/12 ]

Ned,

Are you running the script at the top of this ticket? If so, it is adding ion or server1 which triggers the error message?

Comment by Christopher Morrone [ 15/Aug/12 ]

Doug, we are trying to get a ppc node set up with infiniband in hyperion. If you don't already have an account there, it would be good to start the paperwork for an account on hyperion. With the right architectures, it will probably be easier to debug the problems.

Comment by Doug Oucharek (Inactive) [ 15/Aug/12 ]

Thanks Chris. I'm starting the process to get an account. Pretending I'm PPC while looking at the code has not been helping :^).

Comment by Doug Oucharek (Inactive) [ 30/Aug/12 ]

After staring at the code for awhile again, I think I have figure out what is going on here. The change for this ticket introduced a new routine, srpc_unpack_msg_hdr(), which byte swaps the message header fields as needed. We did not do this before.

Where this becomes a problem is srpc_unpack_msg_hdr() flips the magic number. That is used later to determine if we need to flip the fields of the body. Since it has been flipped already, we never correct the message bodies for endianess. As I indicated, this was not done before and would explain why issues have cropped up with this change.

Now, since this change has landed already, I assume I cannot redo it and will need to create a new patch to master and the 2.3 branch to address this new bug.

Comment by Doug Oucharek (Inactive) [ 30/Aug/12 ]

The new patch to fix the issue introduced by change 1338 is:

For master: http://review.whamcloud.com/#change,3831
For 2.3: http://review.whamcloud.com/#change,3832

Comment by Peter Jones [ 05/Sep/12 ]

Can LLNL please confirm whether this fix works?

Comment by Christopher Morrone [ 05/Sep/12 ]

I'll get it in our tree. Not sure when we'll get to testing it though.

Comment by Prakash Surya (Inactive) [ 05/Sep/12 ]

Chris, If you let me know when you pull it in I'll try to get it tested on the PPC64 nodes.

Comment by Oleg Drokin [ 07/Sep/12 ]

I tried the master patch attached and it still does not work with a different error.

When I am trying to add the ppc node, I get this:

[root@rhel6 utils]# ./lst add_group ion 192.168.1.171@tcp
create session RPC failed on 12345-192.168.1.171@tcp: Unknown error 18446744073709551506
Comment by Oleg Drokin [ 07/Sep/12 ]

Actually, it turns out that the error comes from firewall that I forgot to disable. Once the firewall is stopped the test seems to be able to proceeed.

We definitely seem to have a problem with readability of messages from lnet selftest. The unisgned printing of errors + I totally love the message that's printed if lnet selftest module is not loaded.

Comment by Oleg Drokin [ 07/Sep/12 ]

Ok, so further update.

I can now do lnet selftest as per the script to ppc node with 4k pagesize.

64k pagesized-node still gives me this on x86 server:

[ 3247.485042] LNetError: 1664:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.1.171@tcp, match 1227793848411881528 length 65536 too big: 4096 left, 4096 allowed

no errors on ppc side.

Comment by Jodi Levi (Inactive) [ 10/Sep/12 ]

Oleg will retest today.

Comment by Oleg Drokin [ 10/Sep/12 ]

Ok,testing with export LST_FEATURES=1 works even between x86 and ppc64 with 64k pages.

Comment by Peter Jones [ 10/Sep/12 ]

Landed for 2.3 and 2.4

Comment by Christopher Morrone [ 24/Sep/12 ]

Still cannot run lnet selftest using our orion branch using patch http://review.whamcloud.com/#change,3831.

This is running the control from the x86_64 side (4k pages) with one ppc64 node (64k pages).

Adding nodes to the group appears to work, but add_test fails with:

add test RPC failed on 12345-172.20.12.1@o2ib500: Unknown error 18446744073709551506

and the console on the ppc64 node says:

LNetError: 3756:0:(framework.c:1366:sfw_bulk_ready()) Bulk transfer failed for RPC: service test service, peer 12345-172.20.5.1@o2ib500, status -61
Comment by Prakash Surya (Inactive) [ 24/Sep/12 ]

Does it work when running the control from the ppc64 node?

Comment by Christopher Morrone [ 24/Sep/12 ]

Here's my script:

export LST_FEATURES=1
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 172.20.1.1@o2ib500
lst add_group ions 172.20.12.1@o2ib500
lst add_batch bulk_rw
lst add_test --batch bulk_rw --concurrency 16 --from ions --to servers brw write size=1M
lst run bulk_rw
lst stat ions & sleep 10; kill $!
lst end_session

This does not work, as mentioned in my previous comment. I found though, that if I reverse the from/to fields to be "--from servers --to ions" things DO work.

So we've still got a bug in the system.

Comment by Christopher Morrone [ 24/Sep/12 ]

Only partially fixed.

Comment by Christopher Morrone [ 24/Sep/12 ]

Good call, Prakash, it works when I run control on ppc64. All combinatinos of to/from and read/write work when control is on the 64k page ppc64 node.

Comment by Doug Oucharek (Inactive) [ 25/Sep/12 ]

Can you confirm what version of Lustre you are running? I have looked at both b2_3 and master and neither have the error message you are getting at line 1366 of framework.c. Rather it seems to be at line 1383. I'd just like to understand this difference to make sure it is not a critical code difference for this feature.

Is there a source tree you build from I can look at (on hyperion)?

Comment by Ned Bass [ 25/Sep/12 ]

Doug, that message appears to be from our llnl orion branch prior to the recent rebase. The tree is up on github:

https://github.com/chaos/lustre/tree/orion-2_3_49_54_2-69chaos

Tags newer than 69chaos are post-rebase and contain the line number discrepancy you saw.

Comment by Doug Oucharek (Inactive) [ 25/Sep/12 ]

As I look along the branch leading up to 69chaos, I do find the original change (1338) for this ticket (LU-521) but it does not appear to the be the latest patch set. There are 13 patch sets to change 1338 as it was being "evolved". The patch set applied to the 69chaos branch looks like an early patch set, possibly patch set 1, where we were still using a version number to determine supported features (LST_VERSION). That was changed later on to be LST_FEATURES along with other fixes.

I suspect that the missing code changes from 1338 may be the cause of the current problems.

Comment by Prakash Surya (Inactive) [ 26/Sep/12 ]

Ah, yes we're definitely running an older version of that patch; and of course we left out the "Revision-Id" field so it's hard to tell which one we applied.

Generated at Sat Feb 10 01:07:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.