[LU-9242] Applications are failing to complete due to connection loss with OSS servers Created: 22/Mar/17  Updated: 26/Apr/17  Resolved: 25/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Jian Yu
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre 2.9.54 running on servers using RHEL7 and using ldiskfs. Client side is Cray SLES11SP4 also running lustre 2.9.54


Attachments: File dump.log.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With my testing of the latest master branch I see jobs failing due to the lose of communication with the OSS servers. I see a reconnect storm but in the end the application error out.



 Comments   
Comment by Peter Jones [ 22/Mar/17 ]

Jian

Could you please advise on this one?

Thanks

Peter

Comment by Jian Yu [ 22/Mar/17 ]

Hi James,

From the attached debug log, I saw a lot of the following error messages:

00000100:00000040:20.0:1490199768.961594:0:20430:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22  req@ffff880775b17800 x1562585786929952/t0(0) o104->sultan-OST000c@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22
00000100:00000040:20.0:1490199768.961736:0:20430:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22  req@ffff880775b11800 x1562585786930480/t0(0) o104->sultan-OST000c@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22
00000100:00000040:20.0:1490199768.961887:0:20430:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22  req@ffff880775b14200 x1562585786930880/t0(0) o104->sultan-OST000c@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22
00000100:00000040:19.0:1490199768.961965:0:20412:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22  req@ffff88077b6f8f00 x1562585786934704/t0(0) o104->sultan-OST0018@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22
00000100:00000040:3.0:1490199768.962021:0:20501:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22  req@ffff88006cc7d700 x1562585786931872/t0(0) o104->sultan-OST0028@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22
00000100:00000040:6.0:1490199768.962058:0:20511:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22  req@ffff880068527800 x1562585786934880/t0(0) o104->sultan-OST0014@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22
00000100:00000040:19.0:1490199768.962109:0:20412:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22  req@ffff88077b6f9e00 x1562585786936128/t0(0) o104->sultan-OST0018@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22
......

And the context of one of the above error messages is sth like:

00000100:00000001:20.0:1490199768.961593:0:20430:0:(client.c:1251:ptlrpc_check_status()) Process entered
......
00000800:00000200:30.0:1490199768.961594:0:16000:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff880fed77f600] (200)++
00000100:00000040:20.0:1490199768.961594:0:20430:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22 req@ffff880775b17800 x1562585786929952/t0(0) o104->sultan-OST000c@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22
......
00000100:00000001:1.0:1490199768.961594:0:15994:0:(events.c:91:reply_in_callback()) Process entered
00000400:00000200:31.0:1490199768.961595:0:27379:0:(lib-move.c:2762:LNetPut()) LNetPut -> 12345-3@gni1
......
00000020:00000001:25.0:1490199768.961595:0:27333:0:(tgt_handler.c:1203:tgt_blocking_ast()) Process entered
......
00000800:00000200:8.0:1490199768.961595:0:15999:0:(o2iblnd_cb.c:787:kiblnd_post_tx_locked()) 10.37.202.61@o2ib1: no credits
00000100:00000200:1.0:1490199768.961595:0:15994:0:(events.c:93:reply_in_callback()) @@@ type 2, status 0 req@ffff880068525100 x1562585786933744/t0(0) o104->sultan-OST0014@28@gni1:15/16 lens 296/224 e 0 to 0 dl 1490199781 ref 1 fl Rpc:/2/ffffffff rc 0/-1
00010000:00010000:26.0:1490199768.961596:0:27337:0:(ldlm_lockd.c:541:ldlm_del_waiting_lock()) ### removed ns: filter-sultan-OST0028_UUID lock: ffff88079d7d4e00/0x43103176140b8536 lrc: 4/0,0 mode: PR/PR res: [0x1204:0x0:0x0].0x0 rrc: 78 type: EXT [0\->18446744073709551615] (req 0->18446744073709551615) flags: 0x50000400010020 nid: 2@gni1 remote: 0xeadfeaea960a0c6d expref: 25 pid: 20407 timeout: 4296946798 lvb_type: 1
00010000:00000001:25.0:1490199768.961596:0:27333:0:(ldlm_lockd.c:847:ldlm_server_blocking_ast()) Process leaving (rc=0 : 0 : 0)
00000100:00000040:18.0:1490199768.961596:0:20474:0:(lustre_net.h:2451:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff88077b61c200 x1562585786932288/t0(0) o104->sultan-OST0008@28@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Interpret:R/2/0 rc 0/0
00000100:00000200:15.0:1490199768.961596:0:17866:0:(niobuf.c:85:ptl_send_buf()) Sending 192 bytes to portal 18, xid 1562506803021520, offset 192
00000800:00000200:8.0:1490199768.961596:0:15999:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff880fed77f600] (201)--
00000800:00000200:30.0:1490199768.961597:0:16000:0:(o2iblnd_cb.c:787:kiblnd_post_tx_locked()) 10.37.202.61@o2ib1: no credits
00000020:00000001:25.0:1490199768.961597:0:27333:0:(tgt_handler.c:1258:tgt_blocking_ast()) Process leaving (rc=0 : 0 : 0)
00000100:00000040:24.0:1490199768.961597:0:27378:0:(lustre_net.h:2451:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff880fb7677050 x1562506803021552/t0(0) o103->2d014be5-e7a1-c537-05ce-72f61bb4bd8d@3@gni1:-1/-1 lens 328/192 e 0 to 0 dl 1490200085 ref 1 fl Interpret:H/0/0 rc 0/0
00000800:00000200:8.0:1490199768.961597:0:15999:0:(o2iblnd_cb.c:3652:kiblnd_scheduler()) conn[ffff880fed77f600] (200)--
00000100:00000001:31.0:1490199768.961598:0:27379:0:(niobuf.c:100:ptl_send_buf()) Process leaving (rc=0 : 0 : 0)
00000800:00000200:30.0:1490199768.961598:0:16000:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff880fed77f600] (200)--
00000400:00000010:15.0:1490199768.961598:0:17866:0:(lib-lnet.h:444:lnet_msg_alloc()) alloc '(msg)': 400 at ffff88077a3e9600 (tot -230794364).
00000400:00000100:9.0:1490199768.961598:0:15997:0:(lib-move.c:2024:lnet_parse_put()) Dropping PUT from 12345-3@gni1 portal 16 match 1562585786931952 offset 192 length 192: 4
00000100:00000001:31.0:1490199768.961599:0:27379:0:(connection.c:91:ptlrpc_connection_put()) Process entered
00000800:00000200:30.0:1490199768.961599:0:16000:0:(o2iblnd_cb.c:3652:kiblnd_scheduler()) conn[ffff880fed77f600] (198)--
00000020:00000001:26.0:1490199768.961599:0:27337:0:(tgt_handler.c:1203:tgt_blocking_ast()) Process entered
00010000:00000001:25.0:1490199768.961599:0:27333:0:(ldlm_lock.c:411:ldlm_lock_destroy_nolock()) Process entered
00000100:00000001:20.0:1490199768.961599:0:20430:0:(client.c:1277:ptlrpc_check_status()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)

Are there any error messages in the syslog/console logs on the Client and OSS nodes? Could you please also gather the debug log on OSS node? Thank you.

Comment by James A Simmons [ 24/Mar/17 ]

The attached debug logs are from the OSS nodes. I'm seeing the console messages on the OSS server

[16278.899599] Lustre: sultan-OST0000: Client ecf7eb35-3b79-6131-416a-61beb23f4a96 (at 30@gni1) reconnecting
[16278.899639] Lustre: sultan-OST000c: Connection restored to bde05182-67f0-f31e-0491-5a81dc20c8d4 (at 30@gni1)
[16278.899640] Lustre: Skipped 106 previous similar messages
[16278.927976] Lustre: Skipped 17 previous similar messages
[16315.958925] Lustre: sultan-OST0000: Client ecf7eb35-3b79-6131-416a-61beb23f4a96 (at 30@gni1) reconnecting
[16315.969803] Lustre: Skipped 21 previous similar messages
[16352.961881] Lustre: sultan-OST0008: Client ecf7eb35-3b79-6131-416a-61beb23f4a96 (at 30@gni1) reconnecting
[16352.972685] Lustre: Skipped 17 previous similar messages
[16446.989954] Lustre: sultan-OST0000: Client ecf7eb35-3b79-6131-416a-61beb23f4a96 (at 30@gni1) reconnecting
[16446.989955] Lustre: sultan-OST0004: Client ecf7eb35-3b79-6131-416a-61beb23f4a96 (at 30@gni1) reconnecting
[16446.989974] Lustre: Skipped 34 previous similar messages
[16447.018332] Lustre: Skipped 8 previous similar messages
[16517.552863] LNet: 17652:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out tx for 10.37.202.59@o2ib1: 0 seconds
[16715.012794] Lustre: sultan-OST0004: Client ecf7eb35-3b79-6131-416a-61beb23f4a96 (at 30@gni1) reconnecting
[16715.023784] Lustre: Skipped 61 previous similar messages
[16986.035621] Lustre: sultan-OST0018: Connection restored to bde05182-67f0-f31e-0491-5a81dc20c8d4 (at 30@gni1)
[16986.046729] Lustre: Skipped 207 previous similar messages
[17385.169047] Lustre: sultan-OST0008: Client b0eed11e-57d2-c40f-9e0b-eadc461d30f5 (at 26@gni1) reconnecting
[17385.179909] Lustre: Skipped 145 previous similar messages
[17627.146131] Lustre: sultan-OST0024: Connection restored to bde05182-67f0-f31e-0491-5a81dc20c8d4 (at 30@gni1)
[17627.157290] Lustre: Skipped 191 previous similar messages
[17994.177003] Lustre: sultan-OST0010: Client ecf7eb35-3b79-6131-416a-61beb23f4a96 (at 30@gni1) reconnecting
[17994.187967] Lustre: Skipped 195 previous similar messages

On the client node I see in the console logs:

stre: Skipped 55 previous similar messages
Lustre: 6048:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 411 previous similar messages
Lustre: sultan-OST0026-osc-ffff8803bd927800: Connection restored to 10.37.248.83@o2ib1 (at 10.37.248.83@o2ib1)
Lustre: Skipped 55 previous similar messages
Lustre: sultan-OST0032-osc-ffff8803bd927800: Connection to sultan-OST0032 (at 10.37.248.83@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 54 previous similar messages
Lustre: 6050:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1490308346/real 1490308346] req@ffff8803b4dedcc0 x1562697286159168/t0(0) o400->sultan-OST0026-osc-ffff8803bd927800@10.37.248.83@o2ib1:28/4 lens 224/224 e 0 to 1 dl 1490308353 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 6050:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 460 previous similar messages
Lustre: sultan-OST0026-osc-ffff8803bd927800: Connection to sultan-OST0026 (at 10.37.248.83@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 110 previous similar messages
Lustre: sultan-OST0026-osc-ffff8803bd927800: Connection restored to 10.37.248.83@o2ib1 (at 10.37.248.83@o2ib1)
Lustre: Skipped 167 previous similar messages

I will get the client logs as well tomorrow.

 

 

 

 

 

 

 

Comment by Jian Yu [ 25/Mar/17 ]

Hi James,
Could you please also run "lctl set_param printk=+neterror" on both Client and OSS nodes to see if there are any LNet network errors on console?

Comment by Peter Jones [ 17/Apr/17 ]

James

Do you still have this test environment setup? Is this simply due to Multi-Rail LNET not being able to work in this configuration?

Peter

Comment by Peter Jones [ 25/Apr/17 ]

James

It does not seem that this is still a live issue for you so I will close for the time being. If you do see this issue on more current builds and can provide enough data to debug then we'll of course reopen

Peter

Comment by James A Simmons [ 26/Apr/17 ]

I would say its safe to keep this closed. I have updated my test bed to the latest master and I'm not seeing the application failures.

Comment by Peter Jones [ 26/Apr/17 ]

thanks for confirming James

Generated at Sat Feb 10 02:24:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.