[LU-4875] We have 2 OSS server in HA and two MDS in HA , On each OSS 12 OSTs are mounted per OSS with faiover. OSS servers get reboots while working - Whamcloud Community JIRA

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:14 AM

Pls have a look at above facts and help us

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:14 AM Pls have a look at above facts and help us

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:14 AM

As per the logs of 12 April , system rebooted at 2:28:07 prior to this at 02:07 it shows overload but what exactly this mean we don't know as CPU utilization was not high.

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:14 AM As per the logs of 12 April , system rebooted at 2:28:07 prior to this at 02:07 it shows overload but what exactly this mean we don't know as CPU utilization was not high.

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:11 AM

Uploaded log messages from OSS1 shows overloaded and hung threads as well. Can this be cause of reboot.

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:11 AM Uploaded log messages from OSS1 shows overloaded and hung threads as well. Can this be cause of reboot.

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:10 AM

Also during last reboot on 12 April 14, we have observed following error messages, I f we can relate something with this

Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248434/real 1397248435] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248441 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248435/real

Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248437/real 1397248437] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248444 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 73468 previous similar messages
Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248439/real 1397248439] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248446 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 162386 previous similar messages
Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248443/real 1397248443] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248450 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 296701 previous similar messages

1397248467] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248474 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:27 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 1301167 previous similar messages
Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248499/real 1397248499] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248506 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 2602376 previous similar messages
Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248563/real

1397248563] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248570 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 5199842 previous similar messages
Apr 12 02:07:14 homeoss1 kernel: Lustre: Service thread pid 7390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:

Apr 12 02:07:14 homeoss1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Apr 12 02:07:14 homeoss1 kernel:
Apr 12 02:07:14 homeoss1 kernel: LustreError: dumping log to /tmp/lustre-log.1397248634.7390
Apr 12 02:07:25 homeoss1 kernel: Lustre: home-OST0003: haven't heard from client c0bee620-b606-8adf-dadd-0d895330a3fd (at 10.2.1.252@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88090ff17000, cur 1397248645 expire 1397248495 last 1397248418
Apr 12 02:07:25 homeoss1 kernel: LustreError: 7390:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248652 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:07:25 homeoss1 kernel: LustreError: 138-a: home-OST0005: A client on nid 10.2.1.252@o2ib was evicted due to a lock glimpse callback time out: rc -4
Apr 12 02:07:25 homeoss1 kernel: Lustre: Service thread pid 7390 completed after 210.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Apr 12 02:28:07 homeoss1 crmd: [5195]: notice: run_graph: Transition 256 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-253.bz2): Complete
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: te_graph_trigger: Transition 256 is now complete
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: notify_crmd: Transition 256 status: done - <null>
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: do_state_transition: Starting PEngine Recheck Timer
Apr 12 02:31:51 homeoss1 cib: [5191]: info: cib_stats: Processed 1 operations (10000.00us average, 0% utilization) in the last 10min
Apr 12 02:38:25 homeoss1 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Apr 12 02:38:25 homeoss1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="7609" x-info="http://www.rsyslog.com"] (re)start

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:10 AM Also during last reboot on 12 April 14, we have observed following error messages, I f we can relate something with this Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248434/real 1397248435] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248441 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248435/real Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248437/real 1397248437] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248444 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 73468 previous similar messages Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248439/real 1397248439] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248446 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 162386 previous similar messages Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248443/real 1397248443] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248450 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 296701 previous similar messages 1397248467] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248474 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:04:27 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 1301167 previous similar messages Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248499/real 1397248499] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248506 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 2602376 previous similar messages Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248563/real 1397248563] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248570 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 5199842 previous similar messages Apr 12 02:07:14 homeoss1 kernel: Lustre: Service thread pid 7390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 12 02:07:14 homeoss1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Apr 12 02:07:14 homeoss1 kernel: Apr 12 02:07:14 homeoss1 kernel: LustreError: dumping log to /tmp/lustre-log.1397248634.7390 Apr 12 02:07:25 homeoss1 kernel: Lustre: home-OST0003: haven't heard from client c0bee620-b606-8adf-dadd-0d895330a3fd (at 10.2.1.252@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88090ff17000, cur 1397248645 expire 1397248495 last 1397248418 Apr 12 02:07:25 homeoss1 kernel: LustreError: 7390:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248652 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:07:25 homeoss1 kernel: LustreError: 138-a: home-OST0005: A client on nid 10.2.1.252@o2ib was evicted due to a lock glimpse callback time out: rc -4 Apr 12 02:07:25 homeoss1 kernel: Lustre: Service thread pid 7390 completed after 210.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 12 02:28:07 homeoss1 crmd: [5195] : notice: run_graph: Transition 256 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-253.bz2): Complete Apr 12 02:28:07 homeoss1 crmd: [5195] : info: te_graph_trigger: Transition 256 is now complete Apr 12 02:28:07 homeoss1 crmd: [5195] : info: notify_crmd: Transition 256 status: done - <null> Apr 12 02:28:07 homeoss1 crmd: [5195] : info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Apr 12 02:28:07 homeoss1 crmd: [5195] : info: do_state_transition: Starting PEngine Recheck Timer Apr 12 02:31:51 homeoss1 cib: [5191] : info: cib_stats: Processed 1 operations (10000.00us average, 0% utilization) in the last 10min Apr 12 02:38:25 homeoss1 kernel: imklog 4.6.2, log source = /proc/kmsg started. Apr 12 02:38:25 homeoss1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="7609" x-info="http://www.rsyslog.com"] (re)start

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:02 AM

Also just to reduce the load we were trying to reduce the thread count on OSS, but while setting thread count to 128 on OSS2 we are getting following error
[root@homeoss2 ~]# lctl set_param ost.OSS.ost.threads_max=128
ost.OSS.ost.threads_max=128
error: set_param: writing to file /proc/fs/lustre/ost/OSS/ost/threads_max: Numerical result out of range

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:02 AM Also just to reduce the load we were trying to reduce the thread count on OSS, but while setting thread count to 128 on OSS2 we are getting following error [root@homeoss2 ~] # lctl set_param ost.OSS.ost.threads_max=128 ost.OSS.ost.threads_max=128 error: set_param: writing to file /proc/fs/lustre/ost/OSS/ost/threads_max: Numerical result out of range

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:00 AM

Hi Andreas,

At reboot following error appears at console
"Message from syslogd@homeoss1 at Apr 15 12:46:34 ...
kernel:LustreError: 7571:0:(ost_handler.c:1689:ost_prolong_lock_one()) ASSERTION( lock->l_req_mode == lock->l_granted_mode ) failed:

Message from syslogd@homeoss1 at Apr 15 12:46:34 ...
kernel:LustreError: 7571:0:(ost_handler.c:1689:ost_prolong_lock_one()) LBUG

Message from syslogd@homeoss1 at Apr 15 12:46:34 ...
kernel:Kernel panic - not syncing: LBUG
"

Pankaj Sharma (Inactive) added a comment - 15/Apr/14 9:00 AM Hi Andreas, At reboot following error appears at console "Message from syslogd@homeoss1 at Apr 15 12:46:34 ... kernel:LustreError: 7571:0:(ost_handler.c:1689:ost_prolong_lock_one()) ASSERTION( lock->l_req_mode == lock->l_granted_mode ) failed: Message from syslogd@homeoss1 at Apr 15 12:46:34 ... kernel:LustreError: 7571:0:(ost_handler.c:1689:ost_prolong_lock_one()) LBUG Message from syslogd@homeoss1 at Apr 15 12:46:34 ... kernel:Kernel panic - not syncing: LBUG "

Andreas Dilger added a comment - 11/Apr/14 1:01 PM

The above information does not include the journal size, but if you used the default then it is 400MB and 12 journals are 4800MB and not enough use up the 32GB of RAM as I thought.

You really need to post the serial console logs from the time of a crash. /var/log/messages and dmesg from a running system do not contain the information from the time of failure.

Andreas Dilger added a comment - 11/Apr/14 1:01 PM The above information does not include the journal size, but if you used the default then it is 400MB and 12 journals are 4800MB and not enough use up the 32GB of RAM as I thought. You really need to post the serial console logs from the time of a crash. /var/log/messages and dmesg from a running system do not contain the information from the time of failure.

Pankaj Sharma (Inactive) added a comment - 11/Apr/14 11:24 AM

Hi Andreas, any update further on the logs and inputs we provided. Pls help we are in difficult position here.

Regards,

Pankaj

Pankaj Sharma (Inactive) added a comment - 11/Apr/14 11:24 AM Hi Andreas, any update further on the logs and inputs we provided. Pls help we are in difficult position here. Regards, Pankaj

Pankaj Sharma (Inactive) added a comment - 10/Apr/14 1:59 PM

we have not given any specific journal size while formatting, its default, have shared above OST info from which if you can make out.

Pankaj Sharma (Inactive) added a comment - 10/Apr/14 1:59 PM we have not given any specific journal size while formatting, its default, have shared above OST info from which if you can make out.

Pankaj Sharma (Inactive) added a comment - 10/Apr/14 1:30 PM

Please find OST detail below

[root@homeoss1 ~]# tune2fs -l /dev/mapper/mpathg
tune2fs 1.42.7.wc2 (07-Nov-2013)
device /dev/dm-6 mounted by lustre per /proc/fs/lustre/obdfilter/home-OST0006/mntdev
Filesystem volume name: home-OST0006
Last mounted on: /
Filesystem UUID: 5a3ea3b2-568e-4062-a13b-ec5f121c0bd1
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 11435008
Block count: 731811520
Reserved block count: 36590576
Free blocks: 583168103
Free inodes: 11013734
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 848
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 512
Inode blocks per group: 32
RAID stripe width: 256
Flex block group size: 256
Filesystem created: Fri Mar 28 01:16:44 2014
Last mount time: Wed Apr 9 16:22:44 2014
Last write time: Wed Apr 9 16:22:44 2014
Mount count: 51
Maximum mount count: -1
Last checked: Fri Mar 28 01:16:44 2014
Check interval: 0 (<none>)
Lifetime writes: 593 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 5cd731f7-67c3-4db6-9e7c-21db7e829749
Journal backup: inode blocks
MMP block number: 9734
MMP update interval: 5

Pankaj Sharma (Inactive) added a comment - 10/Apr/14 1:30 PM Please find OST detail below [root@homeoss1 ~] # tune2fs -l /dev/mapper/mpathg tune2fs 1.42.7.wc2 (07-Nov-2013) device /dev/dm-6 mounted by lustre per /proc/fs/lustre/obdfilter/home-OST0006/mntdev Filesystem volume name: home-OST0006 Last mounted on: / Filesystem UUID: 5a3ea3b2-568e-4062-a13b-ec5f121c0bd1 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 11435008 Block count: 731811520 Reserved block count: 36590576 Free blocks: 583168103 Free inodes: 11013734 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 848 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 512 Inode blocks per group: 32 RAID stripe width: 256 Flex block group size: 256 Filesystem created: Fri Mar 28 01:16:44 2014 Last mount time: Wed Apr 9 16:22:44 2014 Last write time: Wed Apr 9 16:22:44 2014 Mount count: 51 Maximum mount count: -1 Last checked: Fri Mar 28 01:16:44 2014 Check interval: 0 (<none>) Lifetime writes: 593 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 5cd731f7-67c3-4db6-9e7c-21db7e829749 Journal backup: inode blocks MMP block number: 9734 MMP update interval: 5

Pankaj Sharma (Inactive) added a comment - 10/Apr/14 1:30 PM

uploaded the last 2 days sar file from OSS2 server

Pankaj Sharma (Inactive) added a comment - 10/Apr/14 1:30 PM uploaded the last 2 days sar file from OSS2 server

We have 2 OSS server in HA and two MDS in HA , On each OSS 12 OSTs are mounted per OSS with faiover. OSS servers get reboots while working

Details

Description

Attachments

Activity

People

Dates