[LU-521] lnet-selftest add_test failure Created: 20/Jul/11 Updated: 24/Oct/12 Resolved: 24/Oct/12 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6.1 between a PPC64 node and an x86_64 node, QDR infiniband, o2iblnd |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4453 |
| Description |
|
Running the simple lnet-selftest script below, the "lst add_test" line fails with: add test RPC failed on 12345-172.20.203.24@o2ib1: Unknown error 18446744073709551506 I am the script on the "server1" node, which is an x86_64 architecture RHEL6.1 system. The "ion" node is a ppc64 architecture with RHEL6.1. Note that ppc64 has a 64k page size now by default. server1 has this message on the console: LustreError: 21942:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-172.20.203.24@o2ib1, match 11222407321460450620 length 65536 too big: 4096 left, 4096 allowed and the ion has this on the console: LustreError: 14210:0:(framework.c:1298:sfw_bulk_ready()) Bulk transfer failed for RPC: service test service, peer 12345-172.20.250.1@o2ib1, status -61 I have attached lustre kernel logs with "+ net rpctrace" added. Script: lst new_session read/write lst add_group ion 172.20.203.24@o2ib1 lst add_group server1 172.20.250.1@o2ib1 lst add_batch bulk_rw lst add_test --batch bulk_rw --concurrency 16 --from ion --to server1 brw write size=1M lst run bulk_rw lst stat ion & sleep 30; kill $! lst end_session |
| Comments |
| Comment by Peter Jones [ 21/Jul/11 ] | ||||||||||||||||||||||||||||||
|
Lai Could you please look into this one? Thanks Peter | ||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 25/Jul/11 ] | ||||||||||||||||||||||||||||||
|
Lai Could you please provide an update on this ticket? Thanks Peter | ||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 25/Jul/11 ] | ||||||||||||||||||||||||||||||
|
I think LST is assuming PAGE_SIZE on both side are same, and this should be reason we got this failure. | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 01/Aug/11 ] | ||||||||||||||||||||||||||||||
|
Where is that assumption made, exactly? I've started looking at the code, but I am not at all clear about what is going on. An lst write from ion to server fails, but lst write from server to ion test works fine. Digging into it a bit, it looks like the ion tries to do srpc_post_active_rdma(), where the "len" parameter is 1. It is doing an LNetGet to portal 52 (SRPC_RDMA_PORTAL), and sets up a KIOV with only a single 64K page. It looks like a single page is the smallest that CAN be set up, is that correct? So is the receiver (server1) setting up the md on its side with only 1 page incorrectly somewhere? Does it just need to allocate a larger buffer to support peers with larger page size? | ||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 08/Aug/11 ] | ||||||||||||||||||||||||||||||
|
Chris, I think you are right, LST assume a single page is the smallest that can be set up, so for the add_test command, the console will setup one 4K page to transfer servers' NIDs to the test client, but the test client has 64K page size and it will try to LNetGet one(64K) page from the console, so we will see this mismatch error on the console: I'm afraid this is a design/protocol bug and it's not just in add_test command, it will impact on bulk_test too.
We will need a little more time to estimate the efforts for fixing both issues. | ||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 09/Aug/11 ] | ||||||||||||||||||||||||||||||
|
Liang is going to finish this one off. | ||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 29/Aug/11 ] | ||||||||||||||||||||||||||||||
|
I plan to post the patch by end of last week, but I didn't finish it, though I have done most code but need a few more days to code review and do some local tests. It's a little more complex than I thought. I originally want to fix this by adding a new command but it's more like a workaround so I made a more complete patch which can support multiple session versions, here is the description:
| ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 30/Aug/11 ] | ||||||||||||||||||||||||||||||
|
Sounds good. Thank you for the update! | ||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 05/Sep/11 ] | ||||||||||||||||||||||||||||||
|
Patch is ready for review: http://review.whamcloud.com/#change,1338 | ||||||||||||||||||||||||||||||
| Comment by James A Simmons [ 21/Sep/11 ] | ||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||
| Comment by Isaac Huang (Inactive) [ 10/Jan/12 ] | ||||||||||||||||||||||||||||||
|
Liang, I think your design for LST version compatibility would work. But I'd rather not to use the msg_reserved[0-2] fields unless there's no alternative. We already have the msg_version field that carries a version #. I think alternatively it could be done via msg_version: 1. Console has newer LST version: If console sends a make-session command to a older test node, upon lst add_group. The test node would reply with reply->status == EPROTO, and msg_version set to its older version #. Now, the console knows about the test node version. Then the console can just resend the make-session command with msg_version set to the test node's older version. And everything else goes on: test node believes the console has a same version, console knows about test node version and how to cope with it. 2. Test node:
I haven't gone through all the details, but it seems like this way the version negotiation could be done without using any new message fields. What do you think? BTW, as to setting a preferred session version on console, I don't see why that's necessary if we have version negotiation implemented. If it's needed for some reason, it appears to me that it should be an option to the new_session command, rather that an environment variable. | ||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 11/Jan/12 ] | ||||||||||||||||||||||||||||||
|
Isaac, I didn't use srpc_msg_t::msg_version because message version is something we check/set in RPC module, not test-framework module, so test-node shouldn't change RPC version for test-framework command make_session. I think msg_version can be used only if we are going to change header of srpc_msg_t, session_version is kind of thing belongs to upper layer and could have more chance to be changed than msg_version. I admit putting session_version in header of srpc_msg_t is not very clean, but we can minimize changes by doing this. Also, I didn't implement resend in console, because it will make the patch more complex, we can just let administrator to set version and retry or make small change to scripts, which should be very rare and easy. I don't have strong reason to use environment variable instead of option of new_session, but just because:
| ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 17/Jan/12 ] | ||||||||||||||||||||||||||||||
|
We need a final ruling on this very soon. Some BGQ hardware is now on the floor, and I will need to start testing it soon. | ||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 19/Jan/12 ] | ||||||||||||||||||||||||||||||
|
If Isaac didn't object my previous comment, I tend to keep our current implementation. I just made some small fixes due to suggestions from Andreas. patch is still here: http://review.whamcloud.com/#change,1338 Because Andreas suggested to use old version as default value, so you will need to set env variable LST_VERSION=1 on console node before testing. | ||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 19/Jan/12 ] | ||||||||||||||||||||||||||||||
|
Just an idea that came to mind - if the protocol change only depends on the number of pages vs. the number of bytes, it would be possible to handle this in a forward compatible manner by assuming values below 4096 are page counts, and values 4096+ are byte counts. This could be added to 2.2.0 and 2.1.1 still, and would allow some future version to auto-detect which format is being used, and eventually start using byte counts in an interoperable manner. | ||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 19/Jan/12 ] | ||||||||||||||||||||||||||||||
|
Andreas, I'm thinking about adding some new command/advanced testing options for long time, so we would need this anyway and I think it's a good time for adding version compatibility code. | ||||||||||||||||||||||||||||||
| Comment by Brian Behlendorf [ 08/May/12 ] | ||||||||||||||||||||||||||||||
|
We've been testing this patch for severals months now can be please come to some conclusion of whether it will be merged in its current form or not. | ||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 14/May/12 ] | ||||||||||||||||||||||||||||||
|
Brian, good to hear this patch can work fine, I will land it to 2.3. Liang | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 14/May/12 ] | ||||||||||||||||||||||||||||||
|
We've had the patch in our tree, I don't think that anyone has actually tested it though. | ||||||||||||||||||||||||||||||
| Comment by Ned Bass [ 18/Jun/12 ] | ||||||||||||||||||||||||||||||
|
We still see failures running lnet selftest between x86_64 and ppc64 with this patch. lst add_group produces this: # lst add_group servers 172.21.1.201@o2ib200 create session RPC failed on 12345-172.21.1.201@o2ib200: Unknown error 18446744073709551542 and the following console message: Lustre: 2735:0:(rpc.c:1102:srpc_send_rpc()) Bad message from 12345-172.21.1.201@o2ib200: type 1 (1 expected), magic 233877742 (-290394099 expected). Note that it appears the received magic value was not correctly byte-swapped: -bash-4.1# printf "%016x\n%016x\n" -290394099 233877742 ffffffffeeb0f00d 000000000df0b0ee Here is a table of the combinations for which we have or have not seen the error.
| ||||||||||||||||||||||||||||||
| Comment by Ned Bass [ 18/Jun/12 ] | ||||||||||||||||||||||||||||||
|
Also note that setting LST_VERSION seems to have no effect on the above results. | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 19/Jun/12 ] | ||||||||||||||||||||||||||||||
|
The LNet header fields are all put on the wire in little endian format. This patch forgot to swap between little endian and host byte order on the new fields so PPC systems are not seeing the new fields properly. I am updating the patch now. | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 19/Jun/12 ] | ||||||||||||||||||||||||||||||
|
My mistake: only the magic number itself is not being swapped in the routine: srpc_unpack_msg_hdr(). Fix is being done. | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 05/Jul/12 ] | ||||||||||||||||||||||||||||||
|
The fix for this issue is now complete at: http://review.whamcloud.com/#change,1338 On top of what LLNL has currently for their patch, the following has been done:
The change to being a features flag has been done such that it is backwards compatible to previous versions (i.e. 2.1) and to the patch LLNL is currently running with (using LST_VERSION). Using a flag rather than a version number will be much more flexible for turning on/off features and for dealing with backwards compatibility issues. It is recommended to use Git to upgrade to the latest patch. Please advise if a delta-patch from the previous patch to this one is required. | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 06/Jul/12 ] | ||||||||||||||||||||||||||||||
|
Ok, sounds good! No delta should be necessary. When we update patches, we revert the old one and apply the new. That makes it easy to drop the out-of-date patch and its reversion when we rebase onto newer upstream tags. And if the newer patch has been applied, then it often drops out of the rebase cleanly. We'll need the lustre manual updated to document LST_FEATURES. | ||||||||||||||||||||||||||||||
| Comment by Ned Bass [ 18/Jul/12 ] | ||||||||||||||||||||||||||||||
|
We're still unable to get lnet_selftest working between x86_64 and ppc64 nodes. With latest patch we're getting ENETDOWN errors on lst add_group or lst add_test. With +net debugging, I see 00000001:00000200:0.0:1342633584.088489:2976:5998:0:(conrpc.c:1126:lstcon_rpc_trans_ndlist()) Condition error while creating RPC for transaction 4114: -100 | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 31/Jul/12 ] | ||||||||||||||||||||||||||||||
|
Ned, Are you running the script at the top of this ticket? If so, it is adding ion or server1 which triggers the error message? | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 15/Aug/12 ] | ||||||||||||||||||||||||||||||
|
Doug, we are trying to get a ppc node set up with infiniband in hyperion. If you don't already have an account there, it would be good to start the paperwork for an account on hyperion. With the right architectures, it will probably be easier to debug the problems. | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 15/Aug/12 ] | ||||||||||||||||||||||||||||||
|
Thanks Chris. I'm starting the process to get an account. Pretending I'm PPC while looking at the code has not been helping :^). | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 30/Aug/12 ] | ||||||||||||||||||||||||||||||
|
After staring at the code for awhile again, I think I have figure out what is going on here. The change for this ticket introduced a new routine, srpc_unpack_msg_hdr(), which byte swaps the message header fields as needed. We did not do this before. Where this becomes a problem is srpc_unpack_msg_hdr() flips the magic number. That is used later to determine if we need to flip the fields of the body. Since it has been flipped already, we never correct the message bodies for endianess. As I indicated, this was not done before and would explain why issues have cropped up with this change. Now, since this change has landed already, I assume I cannot redo it and will need to create a new patch to master and the 2.3 branch to address this new bug. | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 30/Aug/12 ] | ||||||||||||||||||||||||||||||
|
The new patch to fix the issue introduced by change 1338 is: For master: http://review.whamcloud.com/#change,3831 | ||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 05/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Can LLNL please confirm whether this fix works? | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 05/Sep/12 ] | ||||||||||||||||||||||||||||||
|
I'll get it in our tree. Not sure when we'll get to testing it though. | ||||||||||||||||||||||||||||||
| Comment by Prakash Surya (Inactive) [ 05/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Chris, If you let me know when you pull it in I'll try to get it tested on the PPC64 nodes. | ||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 07/Sep/12 ] | ||||||||||||||||||||||||||||||
|
I tried the master patch attached and it still does not work with a different error. When I am trying to add the ppc node, I get this: [root@rhel6 utils]# ./lst add_group ion 192.168.1.171@tcp create session RPC failed on 12345-192.168.1.171@tcp: Unknown error 18446744073709551506 | ||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 07/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Actually, it turns out that the error comes from firewall that I forgot to disable. Once the firewall is stopped the test seems to be able to proceeed. We definitely seem to have a problem with readability of messages from lnet selftest. The unisgned printing of errors + I totally love the message that's printed if lnet selftest module is not loaded. | ||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 07/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Ok, so further update. I can now do lnet selftest as per the script to ppc node with 4k pagesize. 64k pagesized-node still gives me this on x86 server: [ 3247.485042] LNetError: 1664:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.1.171@tcp, match 1227793848411881528 length 65536 too big: 4096 left, 4096 allowed no errors on ppc side. | ||||||||||||||||||||||||||||||
| Comment by Jodi Levi (Inactive) [ 10/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Oleg will retest today. | ||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 10/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Ok,testing with export LST_FEATURES=1 works even between x86 and ppc64 with 64k pages. | ||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 10/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Landed for 2.3 and 2.4 | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 24/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Still cannot run lnet selftest using our orion branch using patch http://review.whamcloud.com/#change,3831. This is running the control from the x86_64 side (4k pages) with one ppc64 node (64k pages). Adding nodes to the group appears to work, but add_test fails with: add test RPC failed on 12345-172.20.12.1@o2ib500: Unknown error 18446744073709551506 and the console on the ppc64 node says: LNetError: 3756:0:(framework.c:1366:sfw_bulk_ready()) Bulk transfer failed for RPC: service test service, peer 12345-172.20.5.1@o2ib500, status -61 | ||||||||||||||||||||||||||||||
| Comment by Prakash Surya (Inactive) [ 24/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Does it work when running the control from the ppc64 node? | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 24/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Here's my script: export LST_FEATURES=1 export LST_SESSION=$$ lst new_session read/write lst add_group servers 172.20.1.1@o2ib500 lst add_group ions 172.20.12.1@o2ib500 lst add_batch bulk_rw lst add_test --batch bulk_rw --concurrency 16 --from ions --to servers brw write size=1M lst run bulk_rw lst stat ions & sleep 10; kill $! lst end_session This does not work, as mentioned in my previous comment. I found though, that if I reverse the from/to fields to be "--from servers --to ions" things DO work. So we've still got a bug in the system. | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 24/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Only partially fixed. | ||||||||||||||||||||||||||||||
| Comment by Christopher Morrone [ 24/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Good call, Prakash, it works when I run control on ppc64. All combinatinos of to/from and read/write work when control is on the 64k page ppc64 node. | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 25/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Can you confirm what version of Lustre you are running? I have looked at both b2_3 and master and neither have the error message you are getting at line 1366 of framework.c. Rather it seems to be at line 1383. I'd just like to understand this difference to make sure it is not a critical code difference for this feature. Is there a source tree you build from I can look at (on hyperion)? | ||||||||||||||||||||||||||||||
| Comment by Ned Bass [ 25/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Doug, that message appears to be from our llnl orion branch prior to the recent rebase. The tree is up on github: https://github.com/chaos/lustre/tree/orion-2_3_49_54_2-69chaos Tags newer than 69chaos are post-rebase and contain the line number discrepancy you saw. | ||||||||||||||||||||||||||||||
| Comment by Doug Oucharek (Inactive) [ 25/Sep/12 ] | ||||||||||||||||||||||||||||||
|
As I look along the branch leading up to 69chaos, I do find the original change (1338) for this ticket ( I suspect that the missing code changes from 1338 may be the cause of the current problems. | ||||||||||||||||||||||||||||||
| Comment by Prakash Surya (Inactive) [ 26/Sep/12 ] | ||||||||||||||||||||||||||||||
|
Ah, yes we're definitely running an older version of that patch; and of course we left out the "Revision-Id" field so it's hard to tell which one we applied. |