[LU-5072] Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: Created: 16/May/14 Updated: 04/Oct/19 Resolved: 01/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Rajeshwaran Ganesan | Assignee: | Hongchao Zhang |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS release 6.2 (Final) |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 14003 |
| Description |
|
One of the OSS server (Oss3) had the following error. and it got rebooted. MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672339 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 |
| Comments |
| Comment by Peter Jones [ 16/May/14 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 20/May/14 ] |
|
It could be related the network, the bulk io can't be aborted, LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 11 previous similar messages LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.243.200.24@o2ib (153): c: 6, oc: 0, rc: 8 LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 11 previous similar messages Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 12. Is it stuck? Lustre: 5910:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1390835719/real 0] req@ffff880115b66800 x1457743842346619/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1390835735 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 32 previous similar messages LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.243.201.5@o2ib (159): c: 6, oc: 0, rc: 8 LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 32 previous similar messages ... Lustre: Service thread pid 6164 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 27 16:16:20 oss03 kernel: : Pid: 6164, comm: ll_ost_io_110 Jan 27 16:16:20 oss03 kernel: : Jan 27 16:16:20 oss03 kernel: : Call Trace: Jan 27 16:16:20 oss03 kernel: : [<ffffffff8107c24c>] ? lock_timer_base+0x3c/0x70 Jan 27 16:16:20 oss03 kernel: : [<ffffffff814ee1d2>] schedule_timeout+0x192/0x2e0 Jan 27 16:16:20 oss03 kernel: : [<ffffffff8107c360>] ? process_timeout+0x0/0x10 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa04276d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07d0cff>] ptlrpc_abort_bulk+0x1bf/0x2f0 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07a30b5>] target_bulk_io+0x575/0x930 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa0427a13>] ? cfs_alloc+0x63/0x90 [libcfs] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa06f6854>] ost_brw_write+0x1614/0x1850 [ost] Jan 27 16:16:20 oss03 kernel: : [<ffffffff810551d4>] ? find_busiest_group+0x244/0x9f0 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07979c0>] ? target_bulk_timeout+0x0/0xe0 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa06fa6e5>] ost_handle+0x3325/0x4ba0 [ost] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea42>] ? default_wake_function+0x12/0x20 Jan 27 16:16:20 oss03 kernel: : [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07d5e74>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6bd9>] ptlrpc_main+0xbb9/0x1990 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8100c14a>] child_rip+0xa/0x20 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8100c140>] ? child_rip+0x0/0x20 Did you stop the OSS03 while doing heavy IO? |
| Comment by Rajeshwaran Ganesan [ 20/May/14 ] |
|
The issue occurred on May 9th. The OSS3 got rebooted, Above error is very older Jan 27. |
| Comment by Hongchao Zhang [ 23/May/14 ] |
|
sorry, but it could be the same issue, May 9 23:52:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.11@o2ib (no target) May 9 23:52:03 oss03 kernel: : LustreError: Skipped 56 previous similar messages May 9 23:52:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 8. Is it stuck? May 9 23:52:13 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 16. Is it stuck? May 9 23:52:19 oss03 kernel: : LustreError: 137-5: scratch-OST0002: Not available for connect from 10.243.200.32@o2ib (stopping) May 9 23:52:19 oss03 kernel: : LustreError: Skipped 631 previous similar messages May 9 23:52:19 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672333/real 1399672333] req@ffff88022be1ac00 x1459215404517419/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672339 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:52:28 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 8. Is it stuck? May 9 23:52:29 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 16. Is it stuck? May 9 23:52:49 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672358/real 1399672358] req@ffff8802ee5cac00 x1459215404517420/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:52:51 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.25@o2ib (no target) May 9 23:52:51 oss03 kernel: : LustreError: Skipped 445 previous similar messages May 9 23:53:00 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 8. Is it stuck? May 9 23:53:14 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672383/real 1399672383] req@ffff8802f2b72400 x1459215404517421/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672394 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:53:44 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672408/real 1399672408] req@ffff8802f2db2c00 x1459215404517422/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672424 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:53:55 oss03 kernel: : LustreError: 137-5: home-OST0002: Not available for connect from 10.243.201.4@o2ib (no target) May 9 23:53:55 oss03 kernel: : LustreError: Skipped 1562 previous similar messages May 9 23:54:04 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 8. Is it stuck? May 9 23:54:04 oss03 kernel: : Lustre: Skipped 1 previous similar message May 9 23:54:09 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672433/real 1399672433] req@ffff8802ef55bc00 x1459215404517423/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672449 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672483/real 1399672483] req@ffff88007b818800 x1459215404517425/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672504 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message May 9 23:56:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.248.2.21@tcp1 (no target) May 9 23:56:03 oss03 kernel: : LustreError: Skipped 3114 previous similar messages May 9 23:56:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 8. Is it stuck? May 9 23:56:12 oss03 kernel: : Lustre: Skipped 1 previous similar message May 9 23:56:24 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672558/real 1399672558] req@ffff88009a249000 x1459215404517427/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672584 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:56:24 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message May 9 23:57:05 oss03 syslog-ng[2142]: Log statistics; processed='src.internal(s_sys#2)=13445', stamp='src.internal(s_sys#2)=1399672025', processed='destination(d_ha)=875062', processed='destination(d_ha_useful)=54291', processed='destination(d_mesg)=16138', processed='destination(d_auth)=34', processed='source(s_sys)=905384', processed='global(payload_reallocs)=26836', processed='destination(d_mail)=2', processed='destination(d_kern)=3148', processed='destination(d_mlal)=3', processed='global(msg_clones)=0', processed='destination(d_spol)=0', processed='destination(d_boot)=0', processed='global(sdata_updates)=0', processed='center(received)=0', processed='destination(d_cron)=13411', processed='center(queued)=0' the timeout of MGC connection request(op=250) should be not related to the issue, for the OST is shutting down and waiting for the exports to be cleaned up. Is the Lustre debug log available when the issue occurs? for it will contain the info about the exports which can't be cleanup in time and will be helpful to track the problem. |
| Comment by Nathaniel Clark [ 20/Oct/14 ] |
|
Ran into this on review-zfs on master: |
| Comment by John Fuchs-Chesney (Inactive) [ 19/May/16 ] |
|
Hello Rajesh, Is there any more work for us to do on this ticket? If not, can we go ahead and mark it as 'Resolved/Won't Fix'? Thanks, |
| Comment by Hongchao Zhang [ 01/Jul/16 ] |
|
Mark the issue as "Resolve/Won't Fix", and please reopen it if more work is needed. |