[LU-6808] Interop 2.5.3<->master sanity test_224c: Bulk IO write error Created: 07/Jul/15 Updated: 01/Feb/18 Resolved: 25/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server: 2.5.3 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ff285e72-24dd-11e5-bf7b-5254006e85c2. The sub-test test_224c failed with the following error: test failed to respond and timed out OST log 16:17:08:Lustre: DEBUG MARKER: == sanity test 224c: Don't hang if one of md lost during large bulk RPC ============================== 16:16:38 (1436199398) 16:17:08:LustreError: 8997:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ truncated bulk GET 1048576(4194304) req@ffff88003db95000 x1505956611442792/t0(0) o4->8e3c8004-e713-9f15-13d2-22232628f874@10.1.4.159@tcp:0/0 lens 608/448 e 0 to 0 dl 1436199422 ref 1 fl Interpret:/0/0 rc 0/0 16:17:08:Lustre: lustre-OST0004: Bulk IO write error with 8e3c8004-e713-9f15-13d2-22232628f874 (at 10.1.4.159@tcp), client will retry: rc -110 16:17:08:Lustre: lustre-OST0004: Client 8e3c8004-e713-9f15-13d2-22232628f874 (at 10.1.4.159@tcp) reconnecting 16:17:08:LustreError: 8990:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ truncated bulk GET 1048576(3805696) req@ffff88004901f800 x1505956611442980/t0(0) o4->8e3c8004-e713-9f15-13d2-22232628f874@10.1.4.159@tcp:0/0 lens 608/448 e 0 to 0 dl 1436199427 ref 1 fl Interpret:/2/0 rc 0/0 16:17:08:LustreError: 8990:0:(ldlm_lib.c:2718:target_bulk_io()) Skipped 1 previous similar message 16:17:08:Lustre: lustre-OST0004: Bulk IO write error with 8e3c8004-e713-9f15-13d2-22232628f874 (at 10.1.4.159@tcp), client will retry: rc -110 16:17:08:Lustre: Skipped 1 previous similar message 16:21:30:LustreError: 8990:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ truncated bulk GET 1048576(3805696) req@ffff88003596d000 x1505956611443196/t0(0) o4->8e3c8004-e713-9f15-13d2-22232628f874@10.1.4.159@tcp:0/0 lens 608/448 e 0 to 0 dl 1436199432 ref 1 fl Interpret:/2/0 rc 0/0 16:21:30:LustreError: 8990:0:(ldlm_lib.c:2718:target_bulk_io()) Skipped 1 previous similar message 16:21:30:Lustre: lustre-OST0004: Bulk IO write error with 8e3c8004-e713-9f15-13d2-22232628f874 (at 10.1.4.159@tcp), client will retry: rc -110 16:21:30:Lustre: Skipped 1 previous similar message |
| Comments |
| Comment by Saurabh Tandan (Inactive) [ 09/Oct/15 ] |
|
https://testing.hpdd.intel.com/test_sets/588f0ce8-6b5d-11e5-94a7-5254006e85c2 |
| Comment by Saurabh Tandan (Inactive) [ 24/Nov/15 ] |
|
Instance also found for 2.7.63 tag with following config: |
| Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ] |
|
Another instance for following interop config |
| Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ] |
|
Server: 2.5.5, b2_5_fe/62 |
| Comment by Saurabh Tandan (Inactive) [ 23/Dec/15 ] |
|
Another instance found for : |
| Comment by Saurabh Tandan (Inactive) [ 24/Dec/15 ] |
|
Another instance found for the following config: |
| Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ] |
|
Another instance found for interop : 2.7.1 Server/EL7 Client |
| Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ] |
|
Another instance found for interop : 2.7.1 Server/EL6.7 Client |
| Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ] |
|
Another instance found for interop tag 2.7.66 - 2.7.1 Server/EL7 Client, build# 3316 Another instance found for interop tag 2.7.66 - 2.7.1 Server/EL6.7 Client, build# 3316 Another instance found for interop tag 2.7.66 -2.5.5 Server/EL6.7 Client, build# 3316 Another instance found for interop tag 2.7.66 - 2.5.5 Server/EL7 Client, build# 3316 |
| Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ] |
|
Another instance found for interop - 2.7.1 Server/EL7 Client, tag 2.7.90. |
| Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ] |
|
This issue has been seen almost 30 times in past 30 days |
| Comment by Joseph Gmitter (Inactive) [ 24/Feb/16 ] |
|
Hi Emoly, |
| Comment by Åke Sandgren [ 23/Aug/16 ] |
|
Is there any progress on this problem? We are trying to use 2.8.55 client (we need support for a 4.4 kernel) against our 2.5.41 servers and are hitting Bulk IO write errors (and Bulk IO read errors) very repeatedly on normal client activity. |
| Comment by James A Simmons [ 23/Aug/16 ] |
|
Can you reproduce this? If so can you run lctl set_param debug=+"net neterror nettrace" and collect debug logs. |
| Comment by Åke Sandgren [ 23/Aug/16 ] |
|
I can reproduce it easily (at least the IO read error) But i haven't used set_param debug before, so in case i need to do anything else you need to tell me exactly what. |
| Comment by Åke Sandgren [ 25/Aug/16 ] |
|
Collected debug logs from both client and server. |
| Comment by Jian Yu [ 25/Aug/16 ] |
|
Hi Ake, Please refer to https://wiki.hpdd.intel.com/display/WC/Uploading+Large+Logs. Thank you. |
| Comment by Peter Jones [ 25/Aug/16 ] |
|
Jian Ake will not be able to access that page on the wiki Ake It looks like you are using an Intel EE release rather than a community release so you should report this issue through your support channels. Peter |
| Comment by Åke Sandgren [ 25/Aug/16 ] |
|
The client is taken from git://git.hpdd.intel.com/fs/lustre-release.git using tag 2.8.55 So, what route should i take? |
| Comment by Jian Yu [ 28/Aug/16 ] |
|
Hi Ake, Sorry for the late reply and thank you for collecting the debug logs. I reproduced the failure on the following combination and gathered full debug logs: Maloo report: https://testing.hpdd.intel.com/test_sets/8cc945de-6d71-11e6-8afd-5254006e85c2 |
| Comment by Jian Yu [ 03/Sep/16 ] |
|
Hi Nasf, The server side patch http://review.whamcloud.com/14399 was included in build https://build.hpdd.intel.com/job/lustre-b2_7_fe/95/. Could you please look into the issue? Thank you. |
| Comment by nasf (Inactive) [ 08/Sep/16 ] |
|
The root reason is the interoperability issue introduced by the patch http://review.whamcloud.com/15421. In such patch, we replaced the "rq_xid" with "rq_mbits" as matchbits of bulk data. For be interoperable with old server, it introduced new connection flag: OBD_CONNECT_BULK_MBITS. If the server does not support such feature, then the "rq_xid" would be set the same value as "rq_mbits". Unfortunately, it forgot to handle multiple bulk operations, for example 4MB IO. If the new client (such as master, b2_8_fe) wants to make 4MB IO, then the new client may send a small "rq_xid" to the old server (such as b2_7_fe, IEEL2), as to the old server will regard it as an 1MB IO or 2MB IO. So the data transfer will NOT complete because of only part of data transferred. Then the client will get timeout failure and retry again and again. |
| Comment by Gerrit Updater [ 08/Sep/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22373 |
| Comment by Åke Sandgren [ 08/Sep/16 ] |
|
That patch fixes our problem at least. |
| Comment by Gerrit Updater [ 13/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22373/ |
| Comment by Peter Jones [ 13/Sep/16 ] |
|
Landed for 2.9 |
| Comment by Andreas Dilger [ 04/Oct/16 ] |
|
Reopen to get fix http://review.whamcloud.com/22906 landed for 2.9.0. |
| Comment by Gerrit Updater [ 05/Oct/16 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22957 |
| Comment by Gerrit Updater [ 10/Oct/16 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/23048 |
| Comment by Gerrit Updater [ 25/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23048/ |
| Comment by Peter Jones [ 25/Oct/16 ] |
|
Landed for 2.9 |
| Comment by Tyson Whitehead [ 31/Aug/17 ] |
|
Just finished running into this in the in-tree lustre client driver for the 4.12 kernel. That is, larger reading/writing operations on a mount would hang any further file reading/writing operations on that mount and the system would endlessly spit ... Applying the two merged patches from this ticket resolved the issue https://review.whamcloud.com/#/c/22373/ It would be great if these could both see their way into the in-tree lustre client driver. Thanks! -Tyson |
| Comment by Peter Jones [ 31/Aug/17 ] |
|
simmonsja are these patches on your radar? |
| Comment by James A Simmons [ 31/Aug/17 ] |
|
No. I missed those. Will port them to upstream client. |
| Comment by Andreas Dilger [ 01/Sep/17 ] |
|
twhitehead if you have already ported those patches to the upstream kernel, please feel free to submit the patches directly to the mailing list. This has the advantage that:
See https://wiki.hpdd.intel.com/display/PUB/Commit+Comments#Additional_commit_tags for labels to include in patches ported from the Lustre master tree to the upstream kernel, and the script scripts/get_maintainer.pl drivers/staging/lustre to get a list of email addresses to sent patches to. |
| Comment by Tyson Whitehead [ 02/Sep/17 ] |
|
Will do. The script gives me 7 addresses, 3 of which are lists. Do I send it to all of them? With regard to the commit message, I just did a format-patch on the two commit above and then applied them with am and fixed up the issues. This retained the entire original commit message. It references the issue number, but it doesn't include any info about the exact original commit. Is this sufficient, or should I be trying to add something more? Thanks! -Tyson |
| Comment by Andreas Dilger [ 05/Sep/17 ] |
|
twhitehead, thanks for following up on this. The https://wiki.hpdd.intel.com/display/PUB/Commit+Comments page has a section Patch porting examples that lists the commit comment tags that should be used when porting a patch to the upstream kernel. The patch should be sent to at least Greg KH, CC Oleg, James, and myself, along with lustre-devel and devel. |
| Comment by Tyson Whitehead [ 06/Sep/17 ] |
|
Ok. Have it ready to go, but am holding off because I would like to know everything is stable first and we are still experiencing instabilities that I believe are lustre related. That is, bulk operations seems okay now, but within 24hrs the machine falls off the network and the kernel spits soft lockup messages with osc_io_unplug0 showing up in the backtrace. https://photos.app.goo.gl/zcbG6GGmB0eBgo0G3 I believe it is possibly |
| Comment by James A Simmons [ 06/Sep/17 ] |
|
I already push Feel free to send me the patches for the upstream client. I will look at them. My email is in the MAINTAINERS file for lustre (James Simmons) |
| Comment by Tyson Whitehead [ 14/Sep/17 ] |
|
Just a quick follow up on this to let you know I believe things are now stable in 4.12 with this and the LU 8680 patch. The machine has been up under load for 8 days now without issue. Prior to these patches it wouldn't last 24hrs. Very happy about this. I've ported the patch to staging (and, IIRC, emailed it to James), but am not in a position to test as we really need to avoid any further downtime on this system. Thanks! -Tyson |
| Comment by James A Simmons [ 19/Sep/17 ] |
|
Sorry I have been busy with sysfs porting/fixing. So I looked at the latest staging I see one of the patches for |
| Comment by Tyson Whitehead [ 01/Feb/18 ] |
|
Rebuilt for the spectre/meltdown bugs and discovered this hasn't full landed in upstream yet. Specifically, if I try and do something like a copy on a larger file, it just hangs while the kernel endlessly spits messages like these [60679.700856] Lustre: covework-OST0017-osc-ffff888d70ebf800: Connection to covework-OST0017 (at 10.18.26.8@tcp) was lost; in progress operations using this service will wait for recovery to complete Dug through the prior patches and compared against the current master kernel. Looks like part of the the patch set never landed upstream https://review.whamcloud.com/#/c/22373/5/lustre/ptlrpc/client.c Thanks! -Tyson |