Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4875

We have 2 OSS server in HA and two MDS in HA , On each OSS 12 OSTs are mounted per OSS with faiover. OSS servers get reboots while working

Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • Lustre 2.2.0
    • None
    • 3
    • 13484

    Description

      We have 2 OSS servers in HA with corosync. Each OSS has 12 OSTs mounted in failover. While working intermittently OSS server get reboots frequently, This is affecting the availability of file system badly.

      Attachments

        Activity

          [LU-4875] We have 2 OSS server in HA and two MDS in HA , On each OSS 12 OSTs are mounted per OSS with faiover. OSS servers get reboots while working

          I'm marking this as resolved/incomplete because we have not been able to get sufficient appropriate information to debug the problem any further and the ticket is now more than six months old.

          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - I'm marking this as resolved/incomplete because we have not been able to get sufficient appropriate information to debug the problem any further and the ticket is now more than six months old. ~ jfc.

          I will watch this.
          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - I will watch this. ~ jfc.

          Just wanted to clarify that this is NOT DDN+Intel supported Lustre and neither it is on DDN hardware. IIT Kanpur is DDN+Intel supported customer but for another system they procured. Not this one.

          atulvid Atul Vidwansa (Inactive) added a comment - Just wanted to clarify that this is NOT DDN+Intel supported Lustre and neither it is on DDN hardware. IIT Kanpur is DDN+Intel supported customer but for another system they procured. Not this one.

          Hi Andreas,

          Please update.

          Regards,

          Pankaj

          psharma Pankaj Sharma (Inactive) added a comment - Hi Andreas, Please update. Regards, Pankaj

          Pls have a look at above facts and help us

          psharma Pankaj Sharma (Inactive) added a comment - Pls have a look at above facts and help us

          As per the logs of 12 April , system rebooted at 2:28:07 prior to this at 02:07 it shows overload but what exactly this mean we don't know as CPU utilization was not high.

          psharma Pankaj Sharma (Inactive) added a comment - As per the logs of 12 April , system rebooted at 2:28:07 prior to this at 02:07 it shows overload but what exactly this mean we don't know as CPU utilization was not high.

          Uploaded log messages from OSS1 shows overloaded and hung threads as well. Can this be cause of reboot.

          psharma Pankaj Sharma (Inactive) added a comment - Uploaded log messages from OSS1 shows overloaded and hung threads as well. Can this be cause of reboot.

          Also during last reboot on 12 April 14, we have observed following error messages, I f we can relate something with this

          Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248434/real 1397248435] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248441 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
          Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248435/real

          Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248437/real 1397248437] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248444 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 73468 previous similar messages
          Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248439/real 1397248439] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248446 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 162386 previous similar messages
          Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248443/real 1397248443] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248450 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 296701 previous similar messages

          1397248467] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248474 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 12 02:04:27 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 1301167 previous similar messages
          Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248499/real 1397248499] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248506 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 2602376 previous similar messages
          Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248563/real

          1397248563] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248570 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 5199842 previous similar messages
          Apr 12 02:07:14 homeoss1 kernel: Lustre: Service thread pid 7390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:

          Apr 12 02:07:14 homeoss1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
          Apr 12 02:07:14 homeoss1 kernel:
          Apr 12 02:07:14 homeoss1 kernel: LustreError: dumping log to /tmp/lustre-log.1397248634.7390
          Apr 12 02:07:25 homeoss1 kernel: Lustre: home-OST0003: haven't heard from client c0bee620-b606-8adf-dadd-0d895330a3fd (at 10.2.1.252@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88090ff17000, cur 1397248645 expire 1397248495 last 1397248418
          Apr 12 02:07:25 homeoss1 kernel: LustreError: 7390:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248652 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
          Apr 12 02:07:25 homeoss1 kernel: LustreError: 138-a: home-OST0005: A client on nid 10.2.1.252@o2ib was evicted due to a lock glimpse callback time out: rc -4
          Apr 12 02:07:25 homeoss1 kernel: Lustre: Service thread pid 7390 completed after 210.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

          Apr 12 02:28:07 homeoss1 crmd: [5195]: notice: run_graph: Transition 256 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-253.bz2): Complete
          Apr 12 02:28:07 homeoss1 crmd: [5195]: info: te_graph_trigger: Transition 256 is now complete
          Apr 12 02:28:07 homeoss1 crmd: [5195]: info: notify_crmd: Transition 256 status: done - <null>
          Apr 12 02:28:07 homeoss1 crmd: [5195]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
          Apr 12 02:28:07 homeoss1 crmd: [5195]: info: do_state_transition: Starting PEngine Recheck Timer
          Apr 12 02:31:51 homeoss1 cib: [5191]: info: cib_stats: Processed 1 operations (10000.00us average, 0% utilization) in the last 10min
          Apr 12 02:38:25 homeoss1 kernel: imklog 4.6.2, log source = /proc/kmsg started.
          Apr 12 02:38:25 homeoss1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="7609" x-info="http://www.rsyslog.com"] (re)start

          psharma Pankaj Sharma (Inactive) added a comment - Also during last reboot on 12 April 14, we have observed following error messages, I f we can relate something with this Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248434/real 1397248435] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248441 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248435/real Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248437/real 1397248437] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248444 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 73468 previous similar messages Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248439/real 1397248439] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248446 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 162386 previous similar messages Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248443/real 1397248443] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248450 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 296701 previous similar messages 1397248467] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248474 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:04:27 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 1301167 previous similar messages Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248499/real 1397248499] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248506 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 2602376 previous similar messages Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248563/real 1397248563] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248570 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 5199842 previous similar messages Apr 12 02:07:14 homeoss1 kernel: Lustre: Service thread pid 7390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 12 02:07:14 homeoss1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Apr 12 02:07:14 homeoss1 kernel: Apr 12 02:07:14 homeoss1 kernel: LustreError: dumping log to /tmp/lustre-log.1397248634.7390 Apr 12 02:07:25 homeoss1 kernel: Lustre: home-OST0003: haven't heard from client c0bee620-b606-8adf-dadd-0d895330a3fd (at 10.2.1.252@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88090ff17000, cur 1397248645 expire 1397248495 last 1397248418 Apr 12 02:07:25 homeoss1 kernel: LustreError: 7390:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248652 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:07:25 homeoss1 kernel: LustreError: 138-a: home-OST0005: A client on nid 10.2.1.252@o2ib was evicted due to a lock glimpse callback time out: rc -4 Apr 12 02:07:25 homeoss1 kernel: Lustre: Service thread pid 7390 completed after 210.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 12 02:28:07 homeoss1 crmd: [5195] : notice: run_graph: Transition 256 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-253.bz2): Complete Apr 12 02:28:07 homeoss1 crmd: [5195] : info: te_graph_trigger: Transition 256 is now complete Apr 12 02:28:07 homeoss1 crmd: [5195] : info: notify_crmd: Transition 256 status: done - <null> Apr 12 02:28:07 homeoss1 crmd: [5195] : info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Apr 12 02:28:07 homeoss1 crmd: [5195] : info: do_state_transition: Starting PEngine Recheck Timer Apr 12 02:31:51 homeoss1 cib: [5191] : info: cib_stats: Processed 1 operations (10000.00us average, 0% utilization) in the last 10min Apr 12 02:38:25 homeoss1 kernel: imklog 4.6.2, log source = /proc/kmsg started. Apr 12 02:38:25 homeoss1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="7609" x-info="http://www.rsyslog.com"] (re)start

          Also just to reduce the load we were trying to reduce the thread count on OSS, but while setting thread count to 128 on OSS2 we are getting following error
          [root@homeoss2 ~]# lctl set_param ost.OSS.ost.threads_max=128
          ost.OSS.ost.threads_max=128
          error: set_param: writing to file /proc/fs/lustre/ost/OSS/ost/threads_max: Numerical result out of range

          psharma Pankaj Sharma (Inactive) added a comment - Also just to reduce the load we were trying to reduce the thread count on OSS, but while setting thread count to 128 on OSS2 we are getting following error [root@homeoss2 ~] # lctl set_param ost.OSS.ost.threads_max=128 ost.OSS.ost.threads_max=128 error: set_param: writing to file /proc/fs/lustre/ost/OSS/ost/threads_max: Numerical result out of range

          Hi Andreas,

          At reboot following error appears at console
          "Message from syslogd@homeoss1 at Apr 15 12:46:34 ...
          kernel:LustreError: 7571:0:(ost_handler.c:1689:ost_prolong_lock_one()) ASSERTION( lock->l_req_mode == lock->l_granted_mode ) failed:

          Message from syslogd@homeoss1 at Apr 15 12:46:34 ...
          kernel:LustreError: 7571:0:(ost_handler.c:1689:ost_prolong_lock_one()) LBUG

          Message from syslogd@homeoss1 at Apr 15 12:46:34 ...
          kernel:Kernel panic - not syncing: LBUG
          "

          psharma Pankaj Sharma (Inactive) added a comment - Hi Andreas, At reboot following error appears at console "Message from syslogd@homeoss1 at Apr 15 12:46:34 ... kernel:LustreError: 7571:0:(ost_handler.c:1689:ost_prolong_lock_one()) ASSERTION( lock->l_req_mode == lock->l_granted_mode ) failed: Message from syslogd@homeoss1 at Apr 15 12:46:34 ... kernel:LustreError: 7571:0:(ost_handler.c:1689:ost_prolong_lock_one()) LBUG Message from syslogd@homeoss1 at Apr 15 12:46:34 ... kernel:Kernel panic - not syncing: LBUG "

          People

            jfc John Fuchs-Chesney (Inactive)
            psharma Pankaj Sharma (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: