[LU-6808] Interop 2.5.3<->master sanity test_224c: Bulk IO write error Created: 07/Jul/15  Updated: 01/Feb/18  Resolved: 25/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

server: 2.5.3
client: lustre-master build # 3092 EL7


Issue Links:
Duplicate
Related
is related to LU-6441 OST problems following router node cr... Resolved
is related to LU-484 LASSERT(inode->i_nlink > 0) failed in... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ff285e72-24dd-11e5-bf7b-5254006e85c2.

The sub-test test_224c failed with the following error:

test failed to respond and timed out

OST log

16:17:08:Lustre: DEBUG MARKER: == sanity test 224c: Don't hang if one of md lost during large bulk RPC ============================== 16:16:38 (1436199398)
16:17:08:LustreError: 8997:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ truncated bulk GET 1048576(4194304)  req@ffff88003db95000 x1505956611442792/t0(0) o4->8e3c8004-e713-9f15-13d2-22232628f874@10.1.4.159@tcp:0/0 lens 608/448 e 0 to 0 dl 1436199422 ref 1 fl Interpret:/0/0 rc 0/0
16:17:08:Lustre: lustre-OST0004: Bulk IO write error with 8e3c8004-e713-9f15-13d2-22232628f874 (at 10.1.4.159@tcp), client will retry: rc -110
16:17:08:Lustre: lustre-OST0004: Client 8e3c8004-e713-9f15-13d2-22232628f874 (at 10.1.4.159@tcp) reconnecting
16:17:08:LustreError: 8990:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ truncated bulk GET 1048576(3805696)  req@ffff88004901f800 x1505956611442980/t0(0) o4->8e3c8004-e713-9f15-13d2-22232628f874@10.1.4.159@tcp:0/0 lens 608/448 e 0 to 0 dl 1436199427 ref 1 fl Interpret:/2/0 rc 0/0
16:17:08:LustreError: 8990:0:(ldlm_lib.c:2718:target_bulk_io()) Skipped 1 previous similar message
16:17:08:Lustre: lustre-OST0004: Bulk IO write error with 8e3c8004-e713-9f15-13d2-22232628f874 (at 10.1.4.159@tcp), client will retry: rc -110
16:17:08:Lustre: Skipped 1 previous similar message
16:21:30:LustreError: 8990:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ truncated bulk GET 1048576(3805696)  req@ffff88003596d000 x1505956611443196/t0(0) o4->8e3c8004-e713-9f15-13d2-22232628f874@10.1.4.159@tcp:0/0 lens 608/448 e 0 to 0 dl 1436199432 ref 1 fl Interpret:/2/0 rc 0/0
16:21:30:LustreError: 8990:0:(ldlm_lib.c:2718:target_bulk_io()) Skipped 1 previous similar message
16:21:30:Lustre: lustre-OST0004: Bulk IO write error with 8e3c8004-e713-9f15-13d2-22232628f874 (at 10.1.4.159@tcp), client will retry: rc -110
16:21:30:Lustre: Skipped 1 previous similar message


 Comments   
Comment by Saurabh Tandan (Inactive) [ 09/Oct/15 ]

https://testing.hpdd.intel.com/test_sets/588f0ce8-6b5d-11e5-94a7-5254006e85c2
server:2.5.5
client: 2.7.60/ build# 3203

Comment by Saurabh Tandan (Inactive) [ 24/Nov/15 ]

Instance also found for 2.7.63 tag with following config:
Server: 2.7.0, b2_7/29
Client: Master, build# 3251, RHEL 7
https://testing.hpdd.intel.com/test_sets/fd0cce2c-90a1-11e5-a833-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ]

Another instance for following interop config
Server: 2.5.5, b2_5_fe/62
Client: Master, Build# 3266, Tag 2.7.64
https://testing.hpdd.intel.com/test_sets/8a0f2bb4-a04a-11e5-a33d-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ]

Server: 2.5.5, b2_5_fe/62
Client: Master, Build# 3266, Tag 2.7.64 , RHEL 7
https://testing.hpdd.intel.com/test_sets/fd8752d8-a05e-11e5-90cc-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 23/Dec/15 ]

Another instance found for :
Server: b2_7_fe/34
Client: Master , Build# 3276
https://testing.hpdd.intel.com/test_sets/79afc16e-a5e2-11e5-a028-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Dec/15 ]

Another instance found for the following config:
Server: 2.7.1 , b2_7_fe/34
Client: Master, build# 3276, RHEL 6.7
https://testing.hpdd.intel.com/test_sets/610eff92-a602-11e5-a14c-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ]

Another instance found for interop : 2.7.1 Server/EL7 Client
Server: 2.7.1, b2_7_fe/34
Client: master, build# 3303, RHEL 7
https://testing.hpdd.intel.com/test_sets/23af3494-bb03-11e5-9137-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ]

Another instance found for interop : 2.7.1 Server/EL6.7 Client
Server: 2.7.1, b2_7_fe/34
Client: master, build# 3303, RHEL 6.7
https://testing.hpdd.intel.com/test_sets/04afed48-bb05-11e5-9137-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for interop tag 2.7.66 - 2.7.1 Server/EL7 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/5e49e47c-ccde-11e5-8b0e-5254006e85c2

Another instance found for interop tag 2.7.66 - 2.7.1 Server/EL6.7 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/c23f2696-ccdd-11e5-b80c-5254006e85c2

Another instance found for interop tag 2.7.66 -2.5.5 Server/EL6.7 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/fad4bcda-ccf8-11e5-b1fa-5254006e85c2

Another instance found for interop tag 2.7.66 - 2.5.5 Server/EL7 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/f385c93c-ccc7-11e5-b80c-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

Another instance found for interop - 2.7.1 Server/EL7 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/3b9722f8-d2f8-11e5-bf08-5254006e85c2
Another instance found for interop - 2.7.1 Server/EL6.7 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/f371534e-d573-11e5-bc47-5254006e85c2
Another instance found for interop - 2.5.5 Server/EL6.7 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/e16b8c0c-d634-11e5-82a0-5254006e85c2
Another instance found for interop - 2.5.5 Server/EL7 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/ba9d84fe-d300-11e5-be5c-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

This issue has been seen almost 30 times in past 30 days

Comment by Joseph Gmitter (Inactive) [ 24/Feb/16 ]

Hi Emoly,
Can you investigate this one?
Thanks.
Joe

Comment by Åke Sandgren [ 23/Aug/16 ]

Is there any progress on this problem?

We are trying to use 2.8.55 client (we need support for a 4.4 kernel) against our 2.5.41 servers and are hitting Bulk IO write errors (and Bulk IO read errors) very repeatedly on normal client activity.

Comment by James A Simmons [ 23/Aug/16 ]

Can you reproduce this? If so can you run lctl set_param debug=+"net neterror nettrace" and collect debug logs.

Comment by Åke Sandgren [ 23/Aug/16 ]

I can reproduce it easily (at least the IO read error)

But i haven't used set_param debug before, so in case i need to do anything else you need to tell me exactly what.
Meanwhile i'll do just the above on the client and the OSS:es where the problem shows up.

Comment by Åke Sandgren [ 25/Aug/16 ]

Collected debug logs from both client and server.
Where do you want them? client ~100M, server ~500M uncompressed

Comment by Jian Yu [ 25/Aug/16 ]

Hi Ake,

Please refer to https://wiki.hpdd.intel.com/display/WC/Uploading+Large+Logs. Thank you.

Comment by Peter Jones [ 25/Aug/16 ]

Jian

Ake will not be able to access that page on the wiki

Ake

It looks like you are using an Intel EE release rather than a community release so you should report this issue through your support channels.

Peter

Comment by Åke Sandgren [ 25/Aug/16 ]

The client is taken from git://git.hpdd.intel.com/fs/lustre-release.git using tag 2.8.55
The server is from DDN.

So, what route should i take?

Comment by Jian Yu [ 28/Aug/16 ]

Hi Ake,

Sorry for the late reply and thank you for collecting the debug logs.

I reproduced the failure on the following combination and gathered full debug logs:
Lustre client build: https://build.hpdd.intel.com/job/lustre-master/3425/
Lustre server build: https://build.hpdd.intel.com/job/lustre-b2_7_fe/95/

Maloo report: https://testing.hpdd.intel.com/test_sets/8cc945de-6d71-11e6-8afd-5254006e85c2

Comment by Jian Yu [ 03/Sep/16 ]

Hi Nasf,

The server side patch http://review.whamcloud.com/14399 was included in build https://build.hpdd.intel.com/job/lustre-b2_7_fe/95/. Could you please look into the issue? Thank you.

Comment by nasf (Inactive) [ 08/Sep/16 ]

The root reason is the interoperability issue introduced by the patch http://review.whamcloud.com/15421. In such patch, we replaced the "rq_xid" with "rq_mbits" as matchbits of bulk data. For be interoperable with old server, it introduced new connection flag: OBD_CONNECT_BULK_MBITS. If the server does not support such feature, then the "rq_xid" would be set the same value as "rq_mbits". Unfortunately, it forgot to handle multiple bulk operations, for example 4MB IO. If the new client (such as master, b2_8_fe) wants to make 4MB IO, then the new client may send a small "rq_xid" to the old server (such as b2_7_fe, IEEL2), as to the old server will regard it as an 1MB IO or 2MB IO. So the data transfer will NOT complete because of only part of data transferred. Then the client will get timeout failure and retry again and again.

Comment by Gerrit Updater [ 08/Sep/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22373
Subject: LU-6808 ptlrpc: properly set "rq_xid" for 4MB IO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 56766651d8a124156c615aef60c850cf41d9d5e6

Comment by Åke Sandgren [ 08/Sep/16 ]

That patch fixes our problem at least.
I.e., I can't trigger it the way I could before so I'm very happy.
Thanks alot.

Comment by Gerrit Updater [ 13/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22373/
Subject: LU-6808 ptlrpc: properly set "rq_xid" for 4MB IO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bb75072cb679bf52e00537c19e42f8e4e95255b6

Comment by Peter Jones [ 13/Sep/16 ]

Landed for 2.9

Comment by Andreas Dilger [ 04/Oct/16 ]

Reopen to get fix http://review.whamcloud.com/22906 landed for 2.9.0.

Comment by Gerrit Updater [ 05/Oct/16 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22957
Subject: LU-6808 ptlrpc: BULK_BITS recovery for old servers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d58ad7550497e5766fe46582272ca3d2636bd0f5

Comment by Gerrit Updater [ 10/Oct/16 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/23048
Subject: LU-6808 ptlrpc: no need to reassign mbits for replay
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 487586d8d40506f1e9980ecd8f8aad5b5277f519

Comment by Gerrit Updater [ 25/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23048/
Subject: LU-6808 ptlrpc: no need to reassign mbits for replay
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 48488fa271e4444e4a2ab4f3babb6c91b779634e

Comment by Peter Jones [ 25/Oct/16 ]

Landed for 2.9

Comment by Tyson Whitehead [ 31/Aug/17 ]

Just finished running into this in the in-tree lustre client driver for the 4.12 kernel.

That is, larger reading/writing operations on a mount would hang any further file reading/writing operations on that mount and the system would endlessly spit

...
Connection to <OST> was lost; in progress operations using this service will wait for recovery to complete
Connection restored to <OST>
Connection to <OST> was lost; in progress operations using this service will wait for recovery to complete
Connection restored to <OST>
...

Applying the two merged patches from this ticket resolved the issue

https://review.whamcloud.com/#/c/22373/
https://review.whamcloud.com/#/c/23048/

It would be great if these could both see their way into the in-tree lustre client driver.

Thanks! -Tyson

Comment by Peter Jones [ 31/Aug/17 ]

simmonsja are these patches on your radar?

Comment by James A Simmons [ 31/Aug/17 ]

No. I missed those. Will port them to upstream client.

Comment by Andreas Dilger [ 01/Sep/17 ]

twhitehead if you have already ported those patches to the upstream kernel, please feel free to submit the patches directly to the mailing list. This has the advantage that:

  • James doesn't have to do more work to port and test the patches himself
  • it shows the kernel maintainers that people are interested and using the upstream kernel code
  • you become familiar with the process yourself and can potentially help with other patches in the future

See https://wiki.hpdd.intel.com/display/PUB/Commit+Comments#Additional_commit_tags for labels to include in patches ported from the Lustre master tree to the upstream kernel, and the script scripts/get_maintainer.pl drivers/staging/lustre to get a list of email addresses to sent patches to.

Comment by Tyson Whitehead [ 02/Sep/17 ]

Will do. The script gives me 7 addresses, 3 of which are lists. Do I send it to all of them?

With regard to the commit message, I just did a format-patch on the two commit above and then applied them with am and fixed up the issues. This retained the entire original commit message. It references the issue number, but it doesn't include any info about the exact original commit.

Is this sufficient, or should I be trying to add something more?

Thanks! -Tyson

Comment by Andreas Dilger [ 05/Sep/17 ]

twhitehead, thanks for following up on this. The https://wiki.hpdd.intel.com/display/PUB/Commit+Comments page has a section Patch porting examples that lists the commit comment tags that should be used when porting a patch to the upstream kernel. The patch should be sent to at least Greg KH, CC Oleg, James, and myself, along with lustre-devel and devel.

Comment by Tyson Whitehead [ 06/Sep/17 ]

Ok. Have it ready to go, but am holding off because I would like to know everything is stable first and we are still experiencing instabilities that I believe are lustre related.

That is, bulk operations seems okay now, but within 24hrs the machine falls off the network and the kernel spits soft lockup messages with osc_io_unplug0 showing up in the backtrace.

https://photos.app.goo.gl/zcbG6GGmB0eBgo0G3

I believe it is possibly LU-8680, so I've ported that patch as well. Just rebooted it into the new kernel last night. So far it is still up with no signs of problems. Fingers crossed!

Comment by James A Simmons [ 06/Sep/17 ]

I already push LU-8680 but it wouldn't show up in Linus tree until the 4.14-rc1 merge window. Luckly 4.14 is a LTS kernel so the client will be usable for general users. The tree to work against is the staging tree. Specifically the staging-next branch. Normally the staging-tree is about 100+ patches ahead of Linus tree at any time. If you need help moving to the staging tree let me know. I will gladly help.

Feel free to send me the patches for the upstream client. I will look at them. My email is in the MAINTAINERS file for lustre (James Simmons)

Comment by Tyson Whitehead [ 14/Sep/17 ]

Just a quick follow up on this to let you know I believe things are now stable in 4.12 with this and the LU 8680 patch.

The machine has been up under load for 8 days now without issue. Prior to these patches it wouldn't last 24hrs. Very happy about this.

I've ported the patch to staging (and, IIRC, emailed it to James), but am not in a position to test as we really need to avoid any further downtime on this system.

Thanks! -Tyson

Comment by James A Simmons [ 19/Sep/17 ]

Sorry I have been busy with sysfs porting/fixing. So I looked at the latest staging I see one of the patches for LU-6808 already landed. I missed the other one. I just got the latest staging-next tree to build and I'm looking to port stuff so I will include your other patch. Thanks.

Comment by Tyson Whitehead [ 01/Feb/18 ]

Rebuilt for the spectre/meltdown bugs and discovered this hasn't full landed in upstream yet. Specifically, if I try and do something like a copy on a larger file, it just hangs while the kernel endlessly spits messages like these

[60679.700856] Lustre: covework-OST0017-osc-ffff888d70ebf800: Connection to covework-OST0017 (at 10.18.26.8@tcp) was lost; in progress operations using this service will wait for recovery to complete
[60679.719316] Lustre: Skipped 49 previous similar messages
[60679.740153] Lustre: covework-OST0017-osc-ffff888d70ebf800: Connection restored to covework-OST0017 (at 10.18.26.8@tcp)
[60679.751843] Lustre: Skipped 49 previous similar messages
[61279.724063] Lustre: 1144:0:(client.c:2068:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1517498129/real 1517498129] req@ffff888d3c128c00 x1591147922869136/t0(0) o3->covework-OST0017-osc-ffff888d70ebf800@10.18.26.8@tcp:6/4 lens 608/432 e 0 to 1 dl 1517498141 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
[61279.756551] Lustre: 1144:0:(client.c:2068:ptlrpc_expire_one_request()) Skipped 49 previous similar messages

Dug through the prior patches and compared against the current master kernel. Looks like part of the the patch set never landed upstream

https://review.whamcloud.com/#/c/22373/5/lustre/ptlrpc/client.c
https://github.com/torvalds/linux/blob/255442c93843f52b6891b21d0b485bf2c97f93c3/drivers/staging/lustre/lustre/ptlrpc/client.c#L3133
https://github.com/torvalds/linux/blob/255442c93843f52b6891b21d0b485bf2c97f93c3/drivers/staging/lustre/lustre/ptlrpc/client.c#L3163

Thanks! -Tyson

Generated at Sat Feb 10 02:03:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.