[LU-5072] Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: Created: 16/May/14  Updated: 04/Oct/19  Resolved: 01/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Rajeshwaran Ganesan Assignee: Hongchao Zhang
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

CentOS release 6.2 (Final)
Lustre 2.1.2


Attachments: Text File 2014-05-11-es_lustre_showall_2014-05-10_223732_mds02.tar.bz2     Text File 2014-05-12-es_lustre_showall_2014-05-12_151412_SR32706.tar.bz2    
Severity: 3
Rank (Obsolete): 14003

 Description   

One of the OSS server (Oss3) had the following error. and it got rebooted.

MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672339 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 9 23:52:28 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 8. Is it stuck?
May 9 23:52:29 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 16. Is it stuck?
May 9 23:52:49 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672358/real 1399672358] req@ffff8802ee5cac00 x1459215404517420/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 9 23:52:51 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.25@o2ib (no target)
May 9 23:52:51 oss03 kernel: : LustreError: Skipped 445 previous similar messages
May 9 23:53:00 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 8. Is it stuck?
May 9 23:53:14 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672383/real 1399672383] req@ffff8802f2b72400 x1459215404517421/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672394 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 9 23:53:44 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672408/real 1399672408] req@ffff8802f2db2c00 x1459215404517422/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672424 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 9 23:53:55 oss03 kernel: : LustreError: 137-5: home-OST0002: Not available for connect from 10.243.201.4@o2ib (no target)
May 9 23:53:55 oss03 kernel: : LustreError: Skipped 1562 previous similar messages
May 9 23:54:04 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 8. Is it stuck?
May 9 23:54:04 oss03 kernel: : Lustre: Skipped 1 previous similar message
May 9 23:54:09 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672433/real 1399672433] req@ffff8802ef55bc00 x1459215404517423/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672449 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672483/real 1399672483] req@ffff88007b818800 x1459215404517425/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672504 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message
May 9 23:56:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.248.2.21@tcp1 (no target)
May 9 23:56:03 oss03 kernel: : LustreError: Skipped 3114 previous similar messages
May 9 23:56:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 8. Is it stuck?
May 9 23:56:12 oss03 kernel: : Lustre: Skipped 1 previous similar message



 Comments   
Comment by Peter Jones [ 16/May/14 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 20/May/14 ]

It could be related the network, the bulk io can't be aborted,

LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 11 previous similar messages
LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.243.200.24@o2ib (153): c: 6, oc: 0, rc: 8
LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 11 previous similar messages
Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 12. Is it stuck?
Lustre: 5910:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1390835719/real 0]  req@ffff880115b66800 x1457743842346619/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1390835735 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 32 previous similar messages
LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.243.201.5@o2ib (159): c: 6, oc: 0, rc: 8
LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 32 previous similar messages
...
Lustre: Service thread pid 6164 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 27 16:16:20 oss03 kernel: : Pid: 6164, comm: ll_ost_io_110
Jan 27 16:16:20 oss03 kernel: :
Jan 27 16:16:20 oss03 kernel: : Call Trace:
Jan 27 16:16:20 oss03 kernel: : [<ffffffff8107c24c>] ? lock_timer_base+0x3c/0x70
Jan 27 16:16:20 oss03 kernel: : [<ffffffff814ee1d2>] schedule_timeout+0x192/0x2e0
Jan 27 16:16:20 oss03 kernel: : [<ffffffff8107c360>] ? process_timeout+0x0/0x10
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa04276d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07d0cff>] ptlrpc_abort_bulk+0x1bf/0x2f0 [ptlrpc]
Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07a30b5>] target_bulk_io+0x575/0x930 [ptlrpc]
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa0427a13>] ? cfs_alloc+0x63/0x90 [libcfs]
Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa06f6854>] ost_brw_write+0x1614/0x1850 [ost]
Jan 27 16:16:20 oss03 kernel: : [<ffffffff810551d4>] ? find_busiest_group+0x244/0x9f0
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07979c0>] ? target_bulk_timeout+0x0/0xe0 [ptlrpc]
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa06fa6e5>] ost_handle+0x3325/0x4ba0 [ost]
Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea42>] ? default_wake_function+0x12/0x20
Jan 27 16:16:20 oss03 kernel: : [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07d5e74>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6bd9>] ptlrpc_main+0xbb9/0x1990 [ptlrpc]
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
Jan 27 16:16:20 oss03 kernel: : [<ffffffff8100c14a>] child_rip+0xa/0x20
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
Jan 27 16:16:20 oss03 kernel: : [<ffffffff8100c140>] ? child_rip+0x0/0x20

Did you stop the OSS03 while doing heavy IO?

Comment by Rajeshwaran Ganesan [ 20/May/14 ]

The issue occurred on May 9th. The OSS3 got rebooted, Above error is very older Jan 27.

Comment by Hongchao Zhang [ 23/May/14 ]

sorry, but it could be the same issue,

May  9 23:52:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.11@o2ib (no target)
May  9 23:52:03 oss03 kernel: : LustreError: Skipped 56 previous similar messages
May  9 23:52:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 8. Is it stuck?
May  9 23:52:13 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 16. Is it stuck?
May  9 23:52:19 oss03 kernel: : LustreError: 137-5: scratch-OST0002: Not available for connect from 10.243.200.32@o2ib (stopping)
May  9 23:52:19 oss03 kernel: : LustreError: Skipped 631 previous similar messages
May  9 23:52:19 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672333/real 1399672333]  req@ffff88022be1ac00 x1459215404517419/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672339 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May  9 23:52:28 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 8. Is it stuck?
May  9 23:52:29 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 16. Is it stuck?
May  9 23:52:49 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672358/real 1399672358]  req@ffff8802ee5cac00 x1459215404517420/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May  9 23:52:51 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.25@o2ib (no target)
May  9 23:52:51 oss03 kernel: : LustreError: Skipped 445 previous similar messages
May  9 23:53:00 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 8. Is it stuck?
May  9 23:53:14 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672383/real 1399672383]  req@ffff8802f2b72400 x1459215404517421/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672394 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May  9 23:53:44 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672408/real 1399672408]  req@ffff8802f2db2c00 x1459215404517422/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672424 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May  9 23:53:55 oss03 kernel: : LustreError: 137-5: home-OST0002: Not available for connect from 10.243.201.4@o2ib (no target)
May  9 23:53:55 oss03 kernel: : LustreError: Skipped 1562 previous similar messages
May  9 23:54:04 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 8. Is it stuck?
May  9 23:54:04 oss03 kernel: : Lustre: Skipped 1 previous similar message
May  9 23:54:09 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672433/real 1399672433]  req@ffff8802ef55bc00 x1459215404517423/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672449 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May  9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672483/real 1399672483]  req@ffff88007b818800 x1459215404517425/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672504 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May  9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message
May  9 23:56:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.248.2.21@tcp1 (no target)
May  9 23:56:03 oss03 kernel: : LustreError: Skipped 3114 previous similar messages
May  9 23:56:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 8. Is it stuck?
May  9 23:56:12 oss03 kernel: : Lustre: Skipped 1 previous similar message
May  9 23:56:24 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672558/real 1399672558]  req@ffff88009a249000 x1459215404517427/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672584 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May  9 23:56:24 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message
May  9 23:57:05 oss03 syslog-ng[2142]: Log statistics; processed='src.internal(s_sys#2)=13445', stamp='src.internal(s_sys#2)=1399672025', processed='destination(d_ha)=875062', processed='destination(d_ha_useful)=54291', processed='destination(d_mesg)=16138', processed='destination(d_auth)=34', processed='source(s_sys)=905384', processed='global(payload_reallocs)=26836', processed='destination(d_mail)=2', processed='destination(d_kern)=3148', processed='destination(d_mlal)=3', processed='global(msg_clones)=0', processed='destination(d_spol)=0', processed='destination(d_boot)=0', processed='global(sdata_updates)=0', processed='center(received)=0', processed='destination(d_cron)=13411', processed='center(queued)=0'

the timeout of MGC connection request(op=250) should be not related to the issue, for the OST is shutting down and waiting for the exports to be cleaned up.

Is the Lustre debug log available when the issue occurs? for it will contain the info about the exports which can't be cleanup in time and will be helpful to track the problem.
Thanks

Comment by Nathaniel Clark [ 20/Oct/14 ]

Ran into this on review-zfs on master:
https://testing.hpdd.intel.com/test_sets/eeb23958-56bd-11e4-851f-5254006e85c2

Comment by John Fuchs-Chesney (Inactive) [ 19/May/16 ]

Hello Rajesh,

Is there any more work for us to do on this ticket? If not, can we go ahead and mark it as 'Resolved/Won't Fix'?

Thanks,
~ jfc.

Comment by Hongchao Zhang [ 01/Jul/16 ]

Mark the issue as "Resolve/Won't Fix", and please reopen it if more work is needed.

Generated at Sat Feb 10 01:48:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.