Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6269

Unable to mount /nobackupp8

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.5.0, Lustre 2.7.0, Lustre 2.4.3
    • None
    • nbp8-mds running lustre 2.4.3, CentoS 6.5
    • 1
    • 17570

    Description

      There appears to be some IB related congestion problems that are destabilizing out lustre servers. The various OSS and MDS systems will stop communicating over IB or the meta-data traffic on the file system slows down to where is can take minutes for an ls -l to return information. The problems come in waves, everything can be okay for a few hours, then we get a slow file system, after an hour or two it can recover and be fine for a few hours, then the same or different lustre file system has problems. This has been happening for a few days. Last night, /nobackupp8 went down. When the system was brought up and we mount the MDT, we get a lot of traffic to the MDS which locks the system up. The file system has been down since last night. We'd like to talk to a lustre engineer on this problem.

      service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 36 seconds
      service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (186): c: 0, oc: 0, rc: 8
      service221: Feb 20 10:24:11 kern:err:nbp8-oss20 LNetError: 9776:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 31 seconds
      service221: Feb 20 10:24:11 kern:err:nbp8-oss20 LNetError: 9776:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (181): c: 0, oc: 0, rc: 8
      service216: Feb 20 09:49:41 kern:err:nbp8-oss15 LNetError: 9816:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 43 seconds
      service216: Feb 20 09:49:41 kern:err:nbp8-oss15 LNetError: 9816:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (193): c: 0, oc: 0, rc: 8
      service203: Feb 20 09:49:43 kern:err:nbp8-oss2 LNetError: 9880:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 21 seconds
      service203: Feb 20 09:49:43 kern:err:nbp8-oss2 LNetError: 9880:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (171): c: 0, oc: 0, rc: 8
      service226: Feb 20 10:33:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457198/real 1424457198] req@ffff881f62464800 x1493641412487452/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457303 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:43:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457798/real 1424457798] req@ffff881f42990000 x1493641412488128/t0(0) o38->nbp8-MDT0000-lwp-OST0080@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424457903 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:53:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458398/real 1424458398] req@ffff881f4758f400 x1493641412488828/t0(0) o38->nbp8-MDT0000-lwp-OST009a@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424458503 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 11:03:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458998/real 1424458998] req@ffff881f64bad000 x1493641412489500/t0(0) o38->nbp8-MDT0000-lwp-OST009a@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424459103 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 09:51:43 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424454598/real 0] req@ffff881f46825400 x1493641412486068/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424454703 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:02:08 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424455223/real 0] req@ffff881f405d9400 x1493641412486328/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455328 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:12:33 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424455848/real 0] req@ffff881f48c69c00 x1493641412486588/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455953 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:22:58 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424456473/real 0] req@ffff881f4568e000 x1493641412486848/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424456578 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:41:39 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457699/real 1424457699] req@ffff880ff2c3d400 x1493641414590864/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:52:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458324/real 1424458324] req@ffff881028c58400 x1493641414591540/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424458429 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 11:02:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458924/real 1424458924] req@ffff880e3c2cb800 x1493641414592216/t0(0) o38->nbp8-MDT0000-lwp-OST011e@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424459029 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:31:19 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424456974/real 0] req@ffff880e669c6000 x1493641414590140/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457079 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 09:49:39 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424454474/real 1424454474] req@ffff880ff480b000 x1493641414589100/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424454579 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:00:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424455099/real 1424455099] req@ffff880ff4584c00 x1493641414589360/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455204 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:10:29 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424455724/real 1424455724] req@ffff880ff310c000 x1493641414589620/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455829 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:20:54 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424456349/real 1424456349] req@ffff880db2b0d000 x1493641414589880/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424456454 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

      Attachments

        1. service200.ps.gz
          123 kB
        2. service200.log.gz
          440 kB
        3. service200.LBUG.gz
          108 kB
        4. nbp8-mds1.syslog
          108 kB

        Issue Links

          Activity

            People

              emoly.liu Emoly Liu
              hyeung Herbert Yeung
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: