Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5072

Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply:

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.1.2
    • None
    • CentOS release 6.2 (Final)
      Lustre 2.1.2
    • 3
    • 14003

    Description

      One of the OSS server (Oss3) had the following error. and it got rebooted.

      MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672339 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 9 23:52:28 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 8. Is it stuck?
      May 9 23:52:29 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 16. Is it stuck?
      May 9 23:52:49 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672358/real 1399672358] req@ffff8802ee5cac00 x1459215404517420/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 9 23:52:51 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.25@o2ib (no target)
      May 9 23:52:51 oss03 kernel: : LustreError: Skipped 445 previous similar messages
      May 9 23:53:00 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 8. Is it stuck?
      May 9 23:53:14 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672383/real 1399672383] req@ffff8802f2b72400 x1459215404517421/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672394 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 9 23:53:44 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672408/real 1399672408] req@ffff8802f2db2c00 x1459215404517422/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672424 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 9 23:53:55 oss03 kernel: : LustreError: 137-5: home-OST0002: Not available for connect from 10.243.201.4@o2ib (no target)
      May 9 23:53:55 oss03 kernel: : LustreError: Skipped 1562 previous similar messages
      May 9 23:54:04 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 8. Is it stuck?
      May 9 23:54:04 oss03 kernel: : Lustre: Skipped 1 previous similar message
      May 9 23:54:09 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672433/real 1399672433] req@ffff8802ef55bc00 x1459215404517423/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672449 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672483/real 1399672483] req@ffff88007b818800 x1459215404517425/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672504 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      May 9 23:56:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.248.2.21@tcp1 (no target)
      May 9 23:56:03 oss03 kernel: : LustreError: Skipped 3114 previous similar messages
      May 9 23:56:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 8. Is it stuck?
      May 9 23:56:12 oss03 kernel: : Lustre: Skipped 1 previous similar message

      Attachments

        Activity

          [LU-5072] Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply:

          Mark the issue as "Resolve/Won't Fix", and please reopen it if more work is needed.

          hongchao.zhang Hongchao Zhang added a comment - Mark the issue as "Resolve/Won't Fix", and please reopen it if more work is needed.

          Hello Rajesh,

          Is there any more work for us to do on this ticket? If not, can we go ahead and mark it as 'Resolved/Won't Fix'?

          Thanks,
          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - - edited Hello Rajesh, Is there any more work for us to do on this ticket? If not, can we go ahead and mark it as 'Resolved/Won't Fix'? Thanks, ~ jfc.
          utopiabound Nathaniel Clark added a comment - Ran into this on review-zfs on master: https://testing.hpdd.intel.com/test_sets/eeb23958-56bd-11e4-851f-5254006e85c2

          sorry, but it could be the same issue,

          May  9 23:52:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.11@o2ib (no target)
          May  9 23:52:03 oss03 kernel: : LustreError: Skipped 56 previous similar messages
          May  9 23:52:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 8. Is it stuck?
          May  9 23:52:13 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 16. Is it stuck?
          May  9 23:52:19 oss03 kernel: : LustreError: 137-5: scratch-OST0002: Not available for connect from 10.243.200.32@o2ib (stopping)
          May  9 23:52:19 oss03 kernel: : LustreError: Skipped 631 previous similar messages
          May  9 23:52:19 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672333/real 1399672333]  req@ffff88022be1ac00 x1459215404517419/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672339 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          May  9 23:52:28 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 8. Is it stuck?
          May  9 23:52:29 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 16. Is it stuck?
          May  9 23:52:49 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672358/real 1399672358]  req@ffff8802ee5cac00 x1459215404517420/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          May  9 23:52:51 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.25@o2ib (no target)
          May  9 23:52:51 oss03 kernel: : LustreError: Skipped 445 previous similar messages
          May  9 23:53:00 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 8. Is it stuck?
          May  9 23:53:14 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672383/real 1399672383]  req@ffff8802f2b72400 x1459215404517421/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672394 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          May  9 23:53:44 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672408/real 1399672408]  req@ffff8802f2db2c00 x1459215404517422/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672424 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          May  9 23:53:55 oss03 kernel: : LustreError: 137-5: home-OST0002: Not available for connect from 10.243.201.4@o2ib (no target)
          May  9 23:53:55 oss03 kernel: : LustreError: Skipped 1562 previous similar messages
          May  9 23:54:04 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 8. Is it stuck?
          May  9 23:54:04 oss03 kernel: : Lustre: Skipped 1 previous similar message
          May  9 23:54:09 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672433/real 1399672433]  req@ffff8802ef55bc00 x1459215404517423/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672449 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          May  9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672483/real 1399672483]  req@ffff88007b818800 x1459215404517425/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672504 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          May  9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message
          May  9 23:56:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.248.2.21@tcp1 (no target)
          May  9 23:56:03 oss03 kernel: : LustreError: Skipped 3114 previous similar messages
          May  9 23:56:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 8. Is it stuck?
          May  9 23:56:12 oss03 kernel: : Lustre: Skipped 1 previous similar message
          May  9 23:56:24 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1399672558/real 1399672558]  req@ffff88009a249000 x1459215404517427/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672584 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          May  9 23:56:24 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message
          May  9 23:57:05 oss03 syslog-ng[2142]: Log statistics; processed='src.internal(s_sys#2)=13445', stamp='src.internal(s_sys#2)=1399672025', processed='destination(d_ha)=875062', processed='destination(d_ha_useful)=54291', processed='destination(d_mesg)=16138', processed='destination(d_auth)=34', processed='source(s_sys)=905384', processed='global(payload_reallocs)=26836', processed='destination(d_mail)=2', processed='destination(d_kern)=3148', processed='destination(d_mlal)=3', processed='global(msg_clones)=0', processed='destination(d_spol)=0', processed='destination(d_boot)=0', processed='global(sdata_updates)=0', processed='center(received)=0', processed='destination(d_cron)=13411', processed='center(queued)=0'
          

          the timeout of MGC connection request(op=250) should be not related to the issue, for the OST is shutting down and waiting for the exports to be cleaned up.

          Is the Lustre debug log available when the issue occurs? for it will contain the info about the exports which can't be cleanup in time and will be helpful to track the problem.
          Thanks

          hongchao.zhang Hongchao Zhang added a comment - sorry, but it could be the same issue, May 9 23:52:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.11@o2ib (no target) May 9 23:52:03 oss03 kernel: : LustreError: Skipped 56 previous similar messages May 9 23:52:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 8. Is it stuck? May 9 23:52:13 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 16. Is it stuck? May 9 23:52:19 oss03 kernel: : LustreError: 137-5: scratch-OST0002: Not available for connect from 10.243.200.32@o2ib (stopping) May 9 23:52:19 oss03 kernel: : LustreError: Skipped 631 previous similar messages May 9 23:52:19 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672333/real 1399672333] req@ffff88022be1ac00 x1459215404517419/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672339 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:52:28 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 8. Is it stuck? May 9 23:52:29 oss03 kernel: : Lustre: scratch-OST0003 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 16. Is it stuck? May 9 23:52:49 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672358/real 1399672358] req@ffff8802ee5cac00 x1459215404517420/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:52:51 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.243.200.25@o2ib (no target) May 9 23:52:51 oss03 kernel: : LustreError: Skipped 445 previous similar messages May 9 23:53:00 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 8. Is it stuck? May 9 23:53:14 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672383/real 1399672383] req@ffff8802f2b72400 x1459215404517421/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672394 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:53:44 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672408/real 1399672408] req@ffff8802f2db2c00 x1459215404517422/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.16@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672424 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:53:55 oss03 kernel: : LustreError: 137-5: home-OST0002: Not available for connect from 10.243.201.4@o2ib (no target) May 9 23:53:55 oss03 kernel: : LustreError: Skipped 1562 previous similar messages May 9 23:54:04 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 8. Is it stuck? May 9 23:54:04 oss03 kernel: : Lustre: Skipped 1 previous similar message May 9 23:54:09 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672433/real 1399672433] req@ffff8802ef55bc00 x1459215404517423/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672449 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672483/real 1399672483] req@ffff88007b818800 x1459215404517425/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672504 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:55:04 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message May 9 23:56:03 oss03 kernel: : LustreError: 137-5: home-OST0003: Not available for connect from 10.248.2.21@tcp1 (no target) May 9 23:56:03 oss03 kernel: : LustreError: Skipped 3114 previous similar messages May 9 23:56:12 oss03 kernel: : Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 8. Is it stuck? May 9 23:56:12 oss03 kernel: : Lustre: Skipped 1 previous similar message May 9 23:56:24 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1399672558/real 1399672558] req@ffff88009a249000 x1459215404517427/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1399672584 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 9 23:56:24 oss03 kernel: : Lustre: 7036:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 1 previous similar message May 9 23:57:05 oss03 syslog-ng[2142]: Log statistics; processed='src.internal(s_sys#2)=13445', stamp='src.internal(s_sys#2)=1399672025', processed='destination(d_ha)=875062', processed='destination(d_ha_useful)=54291', processed='destination(d_mesg)=16138', processed='destination(d_auth)=34', processed='source(s_sys)=905384', processed='global(payload_reallocs)=26836', processed='destination(d_mail)=2', processed='destination(d_kern)=3148', processed='destination(d_mlal)=3', processed='global(msg_clones)=0', processed='destination(d_spol)=0', processed='destination(d_boot)=0', processed='global(sdata_updates)=0', processed='center(received)=0', processed='destination(d_cron)=13411', processed='center(queued)=0' the timeout of MGC connection request(op=250) should be not related to the issue, for the OST is shutting down and waiting for the exports to be cleaned up. Is the Lustre debug log available when the issue occurs? for it will contain the info about the exports which can't be cleanup in time and will be helpful to track the problem. Thanks

          The issue occurred on May 9th. The OSS3 got rebooted, Above error is very older Jan 27.

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - The issue occurred on May 9th. The OSS3 got rebooted, Above error is very older Jan 27.

          It could be related the network, the bulk io can't be aborted,

          LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
          LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 11 previous similar messages
          LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.243.200.24@o2ib (153): c: 6, oc: 0, rc: 8
          LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 11 previous similar messages
          Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 12. Is it stuck?
          Lustre: 5910:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1390835719/real 0]  req@ffff880115b66800 x1457743842346619/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1390835735 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
          LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 32 previous similar messages
          LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.243.201.5@o2ib (159): c: 6, oc: 0, rc: 8
          LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 32 previous similar messages
          ...
          Lustre: Service thread pid 6164 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
          Jan 27 16:16:20 oss03 kernel: : Pid: 6164, comm: ll_ost_io_110
          Jan 27 16:16:20 oss03 kernel: :
          Jan 27 16:16:20 oss03 kernel: : Call Trace:
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff8107c24c>] ? lock_timer_base+0x3c/0x70
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff814ee1d2>] schedule_timeout+0x192/0x2e0
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff8107c360>] ? process_timeout+0x0/0x10
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa04276d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07d0cff>] ptlrpc_abort_bulk+0x1bf/0x2f0 [ptlrpc]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07a30b5>] target_bulk_io+0x575/0x930 [ptlrpc]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa0427a13>] ? cfs_alloc+0x63/0x90 [libcfs]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa06f6854>] ost_brw_write+0x1614/0x1850 [ost]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff810551d4>] ? find_busiest_group+0x244/0x9f0
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07979c0>] ? target_bulk_timeout+0x0/0xe0 [ptlrpc]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa06fa6e5>] ost_handle+0x3325/0x4ba0 [ost]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea42>] ? default_wake_function+0x12/0x20
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07d5e74>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6bd9>] ptlrpc_main+0xbb9/0x1990 [ptlrpc]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff8100c14a>] child_rip+0xa/0x20
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
          Jan 27 16:16:20 oss03 kernel: : [<ffffffff8100c140>] ? child_rip+0x0/0x20
          

          Did you stop the OSS03 while doing heavy IO?

          hongchao.zhang Hongchao Zhang added a comment - It could be related the network, the bulk io can't be aborted, LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 11 previous similar messages LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.243.200.24@o2ib (153): c: 6, oc: 0, rc: 8 LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 11 previous similar messages Lustre: scratch-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 12. Is it stuck? Lustre: 5910:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1390835719/real 0] req@ffff880115b66800 x1457743842346619/t0(0) o250->MGC10.243.12.16@o2ib@10.243.12.17@o2ib:26/25 lens 368/512 e 0 to 1 dl 1390835735 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds LustreError: 5892:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 32 previous similar messages LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.243.201.5@o2ib (159): c: 6, oc: 0, rc: 8 LustreError: 5892:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 32 previous similar messages ... Lustre: Service thread pid 6164 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 27 16:16:20 oss03 kernel: : Pid: 6164, comm: ll_ost_io_110 Jan 27 16:16:20 oss03 kernel: : Jan 27 16:16:20 oss03 kernel: : Call Trace: Jan 27 16:16:20 oss03 kernel: : [<ffffffff8107c24c>] ? lock_timer_base+0x3c/0x70 Jan 27 16:16:20 oss03 kernel: : [<ffffffff814ee1d2>] schedule_timeout+0x192/0x2e0 Jan 27 16:16:20 oss03 kernel: : [<ffffffff8107c360>] ? process_timeout+0x0/0x10 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa04276d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07d0cff>] ptlrpc_abort_bulk+0x1bf/0x2f0 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07a30b5>] target_bulk_io+0x575/0x930 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa0427a13>] ? cfs_alloc+0x63/0x90 [libcfs] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa06f6854>] ost_brw_write+0x1614/0x1850 [ost] Jan 27 16:16:20 oss03 kernel: : [<ffffffff810551d4>] ? find_busiest_group+0x244/0x9f0 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07979c0>] ? target_bulk_timeout+0x0/0xe0 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa06fa6e5>] ost_handle+0x3325/0x4ba0 [ost] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8105ea42>] ? default_wake_function+0x12/0x20 Jan 27 16:16:20 oss03 kernel: : [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07d5e74>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6bd9>] ptlrpc_main+0xbb9/0x1990 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8100c14a>] child_rip+0xa/0x20 Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffffa07e6020>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Jan 27 16:16:20 oss03 kernel: : [<ffffffff8100c140>] ? child_rip+0x0/0x20 Did you stop the OSS03 while doing heavy IO?
          pjones Peter Jones added a comment -

          Hongchao

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao Could you please look into this one? Thanks Peter

          People

            hongchao.zhang Hongchao Zhang
            rganesan@ddn.com Rajeshwaran Ganesan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: