Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.5.0, Lustre 2.7.0, Lustre 2.4.3
    • None
    • nbp8-mds running lustre 2.4.3, CentoS 6.5
    • 1
    • 17570

    Description

      There appears to be some IB related congestion problems that are destabilizing out lustre servers. The various OSS and MDS systems will stop communicating over IB or the meta-data traffic on the file system slows down to where is can take minutes for an ls -l to return information. The problems come in waves, everything can be okay for a few hours, then we get a slow file system, after an hour or two it can recover and be fine for a few hours, then the same or different lustre file system has problems. This has been happening for a few days. Last night, /nobackupp8 went down. When the system was brought up and we mount the MDT, we get a lot of traffic to the MDS which locks the system up. The file system has been down since last night. We'd like to talk to a lustre engineer on this problem.

      service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 36 seconds
      service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (186): c: 0, oc: 0, rc: 8
      service221: Feb 20 10:24:11 kern:err:nbp8-oss20 LNetError: 9776:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 31 seconds
      service221: Feb 20 10:24:11 kern:err:nbp8-oss20 LNetError: 9776:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (181): c: 0, oc: 0, rc: 8
      service216: Feb 20 09:49:41 kern:err:nbp8-oss15 LNetError: 9816:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 43 seconds
      service216: Feb 20 09:49:41 kern:err:nbp8-oss15 LNetError: 9816:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (193): c: 0, oc: 0, rc: 8
      service203: Feb 20 09:49:43 kern:err:nbp8-oss2 LNetError: 9880:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 21 seconds
      service203: Feb 20 09:49:43 kern:err:nbp8-oss2 LNetError: 9880:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (171): c: 0, oc: 0, rc: 8
      service226: Feb 20 10:33:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457198/real 1424457198] req@ffff881f62464800 x1493641412487452/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457303 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:43:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457798/real 1424457798] req@ffff881f42990000 x1493641412488128/t0(0) o38->nbp8-MDT0000-lwp-OST0080@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424457903 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:53:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458398/real 1424458398] req@ffff881f4758f400 x1493641412488828/t0(0) o38->nbp8-MDT0000-lwp-OST009a@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424458503 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 11:03:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458998/real 1424458998] req@ffff881f64bad000 x1493641412489500/t0(0) o38->nbp8-MDT0000-lwp-OST009a@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424459103 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 09:51:43 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424454598/real 0] req@ffff881f46825400 x1493641412486068/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424454703 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:02:08 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424455223/real 0] req@ffff881f405d9400 x1493641412486328/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455328 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:12:33 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424455848/real 0] req@ffff881f48c69c00 x1493641412486588/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455953 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service226: Feb 20 10:22:58 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424456473/real 0] req@ffff881f4568e000 x1493641412486848/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424456578 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:41:39 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457699/real 1424457699] req@ffff880ff2c3d400 x1493641414590864/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:52:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458324/real 1424458324] req@ffff881028c58400 x1493641414591540/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424458429 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 11:02:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458924/real 1424458924] req@ffff880e3c2cb800 x1493641414592216/t0(0) o38->nbp8-MDT0000-lwp-OST011e@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424459029 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:31:19 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424456974/real 0] req@ffff880e669c6000 x1493641414590140/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457079 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 09:49:39 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424454474/real 1424454474] req@ffff880ff480b000 x1493641414589100/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424454579 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:00:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424455099/real 1424455099] req@ffff880ff4584c00 x1493641414589360/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455204 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:10:29 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424455724/real 1424455724] req@ffff880ff310c000 x1493641414589620/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455829 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      service202: Feb 20 10:20:54 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424456349/real 1424456349] req@ffff880db2b0d000 x1493641414589880/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424456454 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

      Attachments

        1. nbp8-mds1.syslog
          108 kB
        2. service200.ps.gz
          123 kB
        3. service200.LBUG.gz
          108 kB
        4. service200.log.gz
          440 kB

        Issue Links

          Activity

            [LU-6269] Unable to mount /nobackupp8

            Thank you for clarification, Peter!

            jaylan Jay Lan (Inactive) added a comment - Thank you for clarification, Peter!
            pjones Peter Jones added a comment -

            Jay

            Review #14806 has been deprecated in favor of the approach being taken under LU-5297. This is a long-term improvement and need not be something you rush to pick up as a fix. You only ran into this issue in the first place due to confusion about patch versions that should no longer be a factor.

            Peter

            pjones Peter Jones added a comment - Jay Review #14806 has been deprecated in favor of the approach being taken under LU-5297 . This is a long-term improvement and need not be something you rush to pick up as a fix. You only ran into this issue in the first place due to confusion about patch versions that should no longer be a factor. Peter
            jaylan Jay Lan (Inactive) added a comment - - edited

            Peter, do we need LU-5297 in 2.4.3 and 2.5.3? And, I guess I need Review #14806?

            jaylan Jay Lan (Inactive) added a comment - - edited Peter, do we need LU-5297 in 2.4.3 and 2.5.3? And, I guess I need Review #14806?
            pjones Peter Jones added a comment -

            The remaining work here will be tracked under LU-5297

            pjones Peter Jones added a comment - The remaining work here will be tracked under LU-5297

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/14806
            Subject: LU-6269 osp: decrease opd_syn_changes for an invalid rec
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1ecf3aff71b3ab3b6c6432d033535e31243878dc

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/14806 Subject: LU-6269 osp: decrease opd_syn_changes for an invalid rec Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1ecf3aff71b3ab3b6c6432d033535e31243878dc

            OSP sync thread can't handle invalid record properly, see LU-5297.

            niu Niu Yawei (Inactive) added a comment - OSP sync thread can't handle invalid record properly, see LU-5297 .
            pjones Peter Jones added a comment -

            Emoly

            Could you please turn Oleg's suggestion here - https://jira.hpdd.intel.com/browse/LU-6269?focusedCommentId=107620&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-107620 into a proper patch? Oleg has also suggested that you use a failloc to trigger this condition.

            Thanks

            Peter

            pjones Peter Jones added a comment - Emoly Could you please turn Oleg's suggestion here - https://jira.hpdd.intel.com/browse/LU-6269?focusedCommentId=107620&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-107620 into a proper patch? Oleg has also suggested that you use a failloc to trigger this condition. Thanks Peter
            green Oleg Drokin added a comment -

            b2_5 version of LU-4345 http://review.whamcloud.com/11435 already includes LU05188 patch as well, so you should be fine.

            green Oleg Drokin added a comment - b2_5 version of LU-4345 http://review.whamcloud.com/11435 already includes LU05188 patch as well, so you should be fine.

            Oleg, I just updated nas-2.5.3 to up to today of b2_5. It already has LU-4345, but b2_5 does not have LU-5188. Do I need to cherry-pick LU-5188 from master to nas-2.5.3?

            jaylan Jay Lan (Inactive) added a comment - Oleg, I just updated nas-2.5.3 to up to today of b2_5. It already has LU-4345 , but b2_5 does not have LU-5188 . Do I need to cherry-pick LU-5188 from master to nas-2.5.3?
            green Oleg Drokin added a comment -

            Well. Once you have LU-4345 patchset 7, the only reason for my patch would be to combat the condition of disk corruption of an llog record - i.e. if somehow an llog record becomes corrupted in such a way that lsr_valid becomes invalid and triggers the check - you'll have a crash exactly like you had after applying the updated patch from LU-5188. Chance of that is much-much smaller than before so I imagine it's ok not to carry it until we hash out a final version.

            green Oleg Drokin added a comment - Well. Once you have LU-4345 patchset 7, the only reason for my patch would be to combat the condition of disk corruption of an llog record - i.e. if somehow an llog record becomes corrupted in such a way that lsr_valid becomes invalid and triggers the check - you'll have a crash exactly like you had after applying the updated patch from LU-5188 . Chance of that is much-much smaller than before so I imagine it's ok not to carry it until we hash out a final version.

            People

              emoly.liu Emoly Liu
              hyeung Herbert Yeung
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: