Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6269

Unable to mount /nobackupp8

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Blocker Blocker
    • None
    • Lustre 2.5.0, Lustre 2.7.0, Lustre 2.4.3
    • None
    • nbp8-mds running lustre 2.4.3, CentoS 6.5
    • 1
    • 17570

      There appears to be some IB related congestion problems that are destabilizing out lustre servers. The various OSS and MDS systems will stop communicating over IB or the meta-data traffic on the file system slows down to where is can take minutes for an ls -l to return information. The problems come in waves, everything can be okay for a few hours, then we get a slow file system, after an hour or two it can recover and be fine for a few hours, then the same or different lustre file system has problems. This has been happening for a few days. Last night, /nobackupp8 went down. When the system was brought up and we mount the MDT, we get a lot of traffic to the MDS which locks the system up. The file system has been down since last night. We'd like to talk to a lustre engineer on this problem.

      service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 36 seconds
      service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (186): c: 0, oc: 0, rc: 8
      service221: Feb 20 10:24:11 kern:err:nbp8-oss20 LNetError: 9776:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 31 seconds
      service221: Feb 20 10:24:11 kern:err:nbp8-oss20 LNetError: 9776:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (181): c: 0, oc: 0, rc: 8
      service216: Feb 20 09:49:41 kern:err:nbp8-oss15 LNetError: 9816:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 43 seconds
      service216: Feb 20 09:49:41 kern:err:nbp8-oss15 LNetError: 9816:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (193): c: 0, oc: 0, rc: 8
      service203: Feb 20 09:49:43 kern:err:nbp8-oss2 LNetError: 9880:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 21 seconds
      service203: Feb 20 09:49:43 kern:err:nbp8-oss2 LNetError: 9880:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (171): c: 0, oc: 0, rc: 8
      service226: Feb 20 10:33:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457198/real 1424457198] req@ffff881f62464800 x1493641412487452/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457303 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:43:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457798/real 1424457798] req@ffff881f42990000 x1493641412488128/t0(0) o38->nbp8-MDT0000-lwp-OST0080@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424457903 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:53:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458398/real 1424458398] req@ffff881f4758f400 x1493641412488828/t0(0) o38->nbp8-MDT0000-lwp-OST009a@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424458503 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 11:03:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458998/real 1424458998] req@ffff881f64bad000 x1493641412489500/t0(0) o38->nbp8-MDT0000-lwp-OST009a@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424459103 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 09:51:43 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424454598/real 0] req@ffff881f46825400 x1493641412486068/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424454703 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:02:08 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424455223/real 0] req@ffff881f405d9400 x1493641412486328/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455328 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:12:33 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424455848/real 0] req@ffff881f48c69c00 x1493641412486588/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455953 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:22:58 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424456473/real 0] req@ffff881f4568e000 x1493641412486848/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424456578 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:41:39 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457699/real 1424457699] req@ffff880ff2c3d400 x1493641414590864/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:52:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458324/real 1424458324] req@ffff881028c58400 x1493641414591540/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424458429 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 11:02:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458924/real 1424458924] req@ffff880e3c2cb800 x1493641414592216/t0(0) o38->nbp8-MDT0000-lwp-OST011e@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424459029 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:31:19 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424456974/real 0] req@ffff880e669c6000 x1493641414590140/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457079 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 09:49:39 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424454474/real 1424454474] req@ffff880ff480b000 x1493641414589100/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424454579 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:00:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424455099/real 1424455099] req@ffff880ff4584c00 x1493641414589360/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455204 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:10:29 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424455724/real 1424455724] req@ffff880ff310c000 x1493641414589620/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455829 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:20:54 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424456349/real 1424456349] req@ffff880db2b0d000 x1493641414589880/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424456454 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

        1. nbp8-mds1.syslog
          108 kB
        2. service200.LBUG.gz
          108 kB
        3. service200.log.gz
          440 kB
        4. service200.ps.gz
          123 kB

            emoly.liu Emoly Liu
            hyeung Herbert Yeung
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: