Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4875

We have 2 OSS server in HA and two MDS in HA , On each OSS 12 OSTs are mounted per OSS with faiover. OSS servers get reboots while working

Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • Lustre 2.2.0
    • None
    • 3
    • 13484

    Description

      We have 2 OSS servers in HA with corosync. Each OSS has 12 OSTs mounted in failover. While working intermittently OSS server get reboots frequently, This is affecting the availability of file system badly.

      Attachments

        Activity

          [LU-4875] We have 2 OSS server in HA and two MDS in HA , On each OSS 12 OSTs are mounted per OSS with faiover. OSS servers get reboots while working

          we have not given any specific journal size while formatting, its default, have shared above OST info from which if you can make out.

          psharma Pankaj Sharma (Inactive) added a comment - we have not given any specific journal size while formatting, its default, have shared above OST info from which if you can make out.

          Please find OST detail below

          [root@homeoss1 ~]# tune2fs -l /dev/mapper/mpathg
          tune2fs 1.42.7.wc2 (07-Nov-2013)
          device /dev/dm-6 mounted by lustre per /proc/fs/lustre/obdfilter/home-OST0006/mntdev
          Filesystem volume name: home-OST0006
          Last mounted on: /
          Filesystem UUID: 5a3ea3b2-568e-4062-a13b-ec5f121c0bd1
          Filesystem magic number: 0xEF53
          Filesystem revision #: 1 (dynamic)
          Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink
          Filesystem flags: signed_directory_hash
          Default mount options: user_xattr acl
          Filesystem state: clean
          Errors behavior: Continue
          Filesystem OS type: Linux
          Inode count: 11435008
          Block count: 731811520
          Reserved block count: 36590576
          Free blocks: 583168103
          Free inodes: 11013734
          First block: 0
          Block size: 4096
          Fragment size: 4096
          Reserved GDT blocks: 848
          Blocks per group: 32768
          Fragments per group: 32768
          Inodes per group: 512
          Inode blocks per group: 32
          RAID stripe width: 256
          Flex block group size: 256
          Filesystem created: Fri Mar 28 01:16:44 2014
          Last mount time: Wed Apr 9 16:22:44 2014
          Last write time: Wed Apr 9 16:22:44 2014
          Mount count: 51
          Maximum mount count: -1
          Last checked: Fri Mar 28 01:16:44 2014
          Check interval: 0 (<none>)
          Lifetime writes: 593 GB
          Reserved blocks uid: 0 (user root)
          Reserved blocks gid: 0 (group root)
          First inode: 11
          Inode size: 256
          Required extra isize: 28
          Desired extra isize: 28
          Journal inode: 8
          Default directory hash: half_md4
          Directory Hash Seed: 5cd731f7-67c3-4db6-9e7c-21db7e829749
          Journal backup: inode blocks
          MMP block number: 9734
          MMP update interval: 5

          psharma Pankaj Sharma (Inactive) added a comment - Please find OST detail below [root@homeoss1 ~] # tune2fs -l /dev/mapper/mpathg tune2fs 1.42.7.wc2 (07-Nov-2013) device /dev/dm-6 mounted by lustre per /proc/fs/lustre/obdfilter/home-OST0006/mntdev Filesystem volume name: home-OST0006 Last mounted on: / Filesystem UUID: 5a3ea3b2-568e-4062-a13b-ec5f121c0bd1 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 11435008 Block count: 731811520 Reserved block count: 36590576 Free blocks: 583168103 Free inodes: 11013734 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 848 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 512 Inode blocks per group: 32 RAID stripe width: 256 Flex block group size: 256 Filesystem created: Fri Mar 28 01:16:44 2014 Last mount time: Wed Apr 9 16:22:44 2014 Last write time: Wed Apr 9 16:22:44 2014 Mount count: 51 Maximum mount count: -1 Last checked: Fri Mar 28 01:16:44 2014 Check interval: 0 (<none>) Lifetime writes: 593 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 5cd731f7-67c3-4db6-9e7c-21db7e829749 Journal backup: inode blocks MMP block number: 9734 MMP update interval: 5

          uploaded the last 2 days sar file from OSS2 server

          psharma Pankaj Sharma (Inactive) added a comment - uploaded the last 2 days sar file from OSS2 server

          You should collect the messages from the console, which is best done by connecting via serial port to the servers. That will hopefully tell you exactly what is going wrong at the time of failure.

          What is the size of the journal on each OST?

          adilger Andreas Dilger added a comment - You should collect the messages from the console, which is best done by connecting via serial port to the servers. That will hopefully tell you exactly what is going wrong at the time of failure. What is the size of the journal on each OST?

          HA configuration file - corosync.conf.txt and oos1-cibxml.txt are uploaded

          psharma Pankaj Sharma (Inactive) added a comment - HA configuration file - corosync.conf.txt and oos1-cibxml.txt are uploaded

          we have noticed following in var/log/messages " max_child_count reached, postponing execution of operation monitor on ocf::Filesystem "

          Do this has some relation with reboot, if yes then what exactly this means

          psharma Pankaj Sharma (Inactive) added a comment - we have noticed following in var/log/messages " max_child_count reached, postponing execution of operation monitor on ocf::Filesystem " Do this has some relation with reboot, if yes then what exactly this means

          sar file from OSS1 for last 3 days are uploaded which can give us some idea for cpu utilization, I/O wait etc.

          psharma Pankaj Sharma (Inactive) added a comment - sar file from OSS1 for last 3 days are uploaded which can give us some idea for cpu utilization, I/O wait etc.

          HA failover configuration file

          psharma Pankaj Sharma (Inactive) added a comment - HA failover configuration file

          Do we have any parameters in Lustre thru which we can restrict running out of memory. As we have already reduced ost_io.threads_max to 256

          psharma Pankaj Sharma (Inactive) added a comment - Do we have any parameters in Lustre thru which we can restrict running out of memory. As we have already reduced ost_io.threads_max to 256

          Thanks Andreas for prompt reply.
          Each OSS server has 32 GB RAM.
          We are using hardware RAID 5. Each OST consists of 11 x 300 GB SAS Disks in RAID 5. We have such 12 OSTs on each OSS.
          Yes probably you may be right that they may be running out of memory but how can we make sure of that, is there anything in logs or we can monitor in Lustre thru some debug option that it is running out of memory. If you need any other log then do let me know.
          I will extract .xz files and upload HA configuration files as .tar/.zip

          psharma Pankaj Sharma (Inactive) added a comment - Thanks Andreas for prompt reply. Each OSS server has 32 GB RAM. We are using hardware RAID 5. Each OST consists of 11 x 300 GB SAS Disks in RAID 5. We have such 12 OSTs on each OSS. Yes probably you may be right that they may be running out of memory but how can we make sure of that, is there anything in logs or we can monitor in Lustre thru some debug option that it is running out of memory. If you need any other log then do let me know. I will extract .xz files and upload HA configuration files as .tar/.zip

          I don't know what .xz files are, so I cannot look at them. The dmesg and messages files do not list how much RAM is on these nodes, nor what type of RAID you are using. Is it MD software RAID?

          My first guess would be that with 12 very large OSTs (I see 180 disks) on the node that it is just running out of memory.

          adilger Andreas Dilger added a comment - I don't know what .xz files are, so I cannot look at them. The dmesg and messages files do not list how much RAM is on these nodes, nor what type of RAID you are using. Is it MD software RAID? My first guess would be that with 12 very large OSTs (I see 180 disks) on the node that it is just running out of memory.

          People

            jfc John Fuchs-Chesney (Inactive)
            psharma Pankaj Sharma (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: