[LU-4875] We have 2 OSS server in HA and two MDS in HA , On each OSS 12 OSTs are mounted per OSS with faiover. OSS servers get reboots while working Created: 09/Apr/14 Updated: 29/Nov/17 Resolved: 19/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Pankaj Sharma (Inactive) | Assignee: | John Fuchs-Chesney (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 13484 |
| Description |
|
We have 2 OSS servers in HA with corosync. Each OSS has 12 OSTs mounted in failover. While working intermittently OSS server get reboots frequently, This is affecting the availability of file system badly. |
| Comments |
| Comment by Pankaj Sharma (Inactive) [ 09/Apr/14 ] |
|
SOS report from system which contains all configuration and log files can be downloaded from https://ftp.usa.hp.com/hprc or ftp.usa.hp.com . Username is lustre and password is 7ui$HY6x. |
| Comment by Pankaj Sharma (Inactive) [ 09/Apr/14 ] |
|
var/log/messages from both OSS are attached |
| Comment by Pankaj Sharma (Inactive) [ 09/Apr/14 ] |
|
dmesg from both OSS servers are attached |
| Comment by Pankaj Sharma (Inactive) [ 09/Apr/14 ] |
|
ost.threads_max counts are set to 256 and ost_io.threads_max are set to 256 as well .monitor timeout is 60 sec and dc-deadtime is 30 sec . Thanks |
| Comment by Andreas Dilger [ 10/Apr/14 ] |
|
I don't know what .xz files are, so I cannot look at them. The dmesg and messages files do not list how much RAM is on these nodes, nor what type of RAID you are using. Is it MD software RAID? My first guess would be that with 12 very large OSTs (I see 180 disks) on the node that it is just running out of memory. |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
Thanks Andreas for prompt reply. |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
Do we have any parameters in Lustre thru which we can restrict running out of memory. As we have already reduced ost_io.threads_max to 256 |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
HA failover configuration file |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
sar file from OSS1 for last 3 days are uploaded which can give us some idea for cpu utilization, I/O wait etc. |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
we have noticed following in var/log/messages " max_child_count reached, postponing execution of operation monitor on ocf::Filesystem " Do this has some relation with reboot, if yes then what exactly this means |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
HA configuration file - corosync.conf.txt and oos1-cibxml.txt are uploaded |
| Comment by Andreas Dilger [ 10/Apr/14 ] |
|
You should collect the messages from the console, which is best done by connecting via serial port to the servers. That will hopefully tell you exactly what is going wrong at the time of failure. What is the size of the journal on each OST? |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
uploaded the last 2 days sar file from OSS2 server |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
Please find OST detail below [root@homeoss1 ~]# tune2fs -l /dev/mapper/mpathg |
| Comment by Pankaj Sharma (Inactive) [ 10/Apr/14 ] |
|
we have not given any specific journal size while formatting, its default, have shared above OST info from which if you can make out. |
| Comment by Pankaj Sharma (Inactive) [ 11/Apr/14 ] |
|
Hi Andreas, any update further on the logs and inputs we provided. Pls help we are in difficult position here. Regards, Pankaj |
| Comment by Andreas Dilger [ 11/Apr/14 ] |
|
The above information does not include the journal size, but if you used the default then it is 400MB and 12 journals are 4800MB and not enough use up the 32GB of RAM as I thought. You really need to post the serial console logs from the time of a crash. /var/log/messages and dmesg from a running system do not contain the information from the time of failure. |
| Comment by Pankaj Sharma (Inactive) [ 15/Apr/14 ] |
|
Hi Andreas, At reboot following error appears at console Message from syslogd@homeoss1 at Apr 15 12:46:34 ... Message from syslogd@homeoss1 at Apr 15 12:46:34 ... |
| Comment by Pankaj Sharma (Inactive) [ 15/Apr/14 ] |
|
Also just to reduce the load we were trying to reduce the thread count on OSS, but while setting thread count to 128 on OSS2 we are getting following error |
| Comment by Pankaj Sharma (Inactive) [ 15/Apr/14 ] |
|
Also during last reboot on 12 April 14, we have observed following error messages, I f we can relate something with this Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248434/real 1397248435] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248441 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248437/real 1397248437] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248444 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 1397248467] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248474 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 1397248563] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248570 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Apr 12 02:07:14 homeoss1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Apr 12 02:28:07 homeoss1 crmd: [5195]: notice: run_graph: Transition 256 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-253.bz2): Complete |
| Comment by Pankaj Sharma (Inactive) [ 15/Apr/14 ] |
|
Uploaded log messages from OSS1 shows overloaded and hung threads as well. Can this be cause of reboot. |
| Comment by Pankaj Sharma (Inactive) [ 15/Apr/14 ] |
|
As per the logs of 12 April , system rebooted at 2:28:07 prior to this at 02:07 it shows overload but what exactly this mean we don't know as CPU utilization was not high. |
| Comment by Pankaj Sharma (Inactive) [ 15/Apr/14 ] |
|
Pls have a look at above facts and help us |
| Comment by Pankaj Sharma (Inactive) [ 16/Apr/14 ] |
|
Hi Andreas, Please update. Regards, Pankaj |
| Comment by Atul Vidwansa (Inactive) [ 17/Apr/14 ] |
|
Just wanted to clarify that this is NOT DDN+Intel supported Lustre and neither it is on DDN hardware. IIT Kanpur is DDN+Intel supported customer but for another system they procured. Not this one. |
| Comment by John Fuchs-Chesney (Inactive) [ 18/Sep/14 ] |
|
I will watch this. |
| Comment by John Fuchs-Chesney (Inactive) [ 19/May/15 ] |
|
I'm marking this as resolved/incomplete because we have not been able to get sufficient appropriate information to debug the problem any further and the ticket is now more than six months old. ~ jfc. |