Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6607

MDS ( 2 node DNE) running out of memory and crash

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.7.0
    • 4
    • 9223372036854775807

    Description

      2 node DNE MDS
      16 OSS
      2K clients

      A MDS node randomly running out of memory and hang.
      We watch MDS drain its memory in matter of few minutes. Many times right after recovery from previous hangs.

      Clients are generating a ton of Lustre errors with strings "ptlrpc_expire_one_request". The numbers are from several hundred thousands to several millions of such errors from each node. Here are number of error counts from some nodes:

      comet-12-31 662616
      comet-10-06 690764
      comet-12-24 720396
      comet-12-25 735659
      comet-12-14 778073
      comet-12-33 840302
      comet-10-10 928322
      comet-12-33 945614
      comet-12-25 992288
      comet-10-15 1131711
      comet-12-25 1147043
      comet-10-07 1160876
      comet-12-30 1180270
      comet-10-03 1387072
      comet-10-02 2515764
      comet-10-02 3371128

      I am attaching logs from both client and server on one such incidence.

      Attachments

        1. clients_log.gz
          622 kB
        2. dmesg_mds.gz
          21 kB
        3. dmesg.out
          396 kB
        4. lustre-log.tgz
          9.35 MB
        5. messages-19-6.gz
          92 kB
        6. slabinfo.txt
          27 kB

        Activity

          [LU-6607] MDS ( 2 node DNE) running out of memory and crash

          Hi WangDi,

          We are running CentOS 6.6 with Linux kernel 3.10.73 from elrepo.
          Lustre and ZFS are build as kdms modules.

          Filesystem has 16 OSS and each has 6 OSTs.

          Haisong

          haisong Haisong Cai (Inactive) added a comment - Hi WangDi, We are running CentOS 6.6 with Linux kernel 3.10.73 from elrepo. Lustre and ZFS are build as kdms modules. Filesystem has 16 OSS and each has 6 OSTs. Haisong

          Ah, it is a ZFS environment (ZFS + DNE)? A few questions here

          1. I saw this on your MDS console message(dmesg_mds.gz), the kernel version is definitely not EL6? EL7? But we do not support EL7 server on MDS yet. could you please confirm what kernel did you use on MDS?

          Linux version 3.10.73-1.el6.elrepo.x86_64 (mockbuild@Build64R6) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Thu Mar 26 16:28:30 EDT 2015
          

          2. In the slab info

          kmalloc-8192      9033431 9033431   8192    1    2 : tunables    8    4    0 : slabdata 9033431 9033431      0
          

          8192 size slab costs too much memory, 941G! that is too much. Btw: how much OSTs for each OSS?

          di.wang Di Wang (Inactive) added a comment - Ah, it is a ZFS environment (ZFS + DNE)? A few questions here 1. I saw this on your MDS console message(dmesg_mds.gz), the kernel version is definitely not EL6? EL7? But we do not support EL7 server on MDS yet. could you please confirm what kernel did you use on MDS? Linux version 3.10.73-1.el6.elrepo.x86_64 (mockbuild@Build64R6) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Thu Mar 26 16:28:30 EDT 2015 2. In the slab info kmalloc-8192 9033431 9033431 8192 1 2 : tunables 8 4 0 : slabdata 9033431 9033431 0 8192 size slab costs too much memory, 941G! that is too much. Btw: how much OSTs for each OSS?

          Files collected between 2 time MDS crashes.

          haisong Haisong Cai (Inactive) added a comment - Files collected between 2 time MDS crashes.

          WangDi,

          We ran into this problem on one of MDS (mdt0, the master again today)
          I have collected information you asked by issuing the following commands:

          echo t > /proc/sysrq-trigger
          dmesg > /state/partition1/tmp/dmesg.out
          cat /proc/slabinfo > /state/partition1/tmp/slabinfo.txt

          dmesg.out & slabinfo.txt will be uploaded separately.

          Haisong

          haisong Haisong Cai (Inactive) added a comment - WangDi, We ran into this problem on one of MDS (mdt0, the master again today) I have collected information you asked by issuing the following commands: echo t > /proc/sysrq-trigger dmesg > /state/partition1/tmp/dmesg.out cat /proc/slabinfo > /state/partition1/tmp/slabinfo.txt dmesg.out & slabinfo.txt will be uploaded separately. Haisong

          WangDi,

          We had 2 incidences recently and both time I failed to collect need info.
          One time I simply forgot and the other time we had no chance since MDS node was hung.

          Haisong

          haisong Haisong Cai (Inactive) added a comment - WangDi, We had 2 incidences recently and both time I failed to collect need info. One time I simply forgot and the other time we had no chance since MDS node was hung. Haisong

          Hello, Cai

          Oh, I only need output of 1) when MDT1 is busy. But if you can get both at the same time, that would be great.

          Thanks
          WangDi

          di.wang Di Wang (Inactive) added a comment - Hello, Cai Oh, I only need output of 1) when MDT1 is busy. But if you can get both at the same time, that would be great. Thanks WangDi

          Hi WangDi,

          I understand when to run 2).
          Do you want output of 1) now or at the same time when I run 2)?

          Haisong

          haisong Haisong Cai (Inactive) added a comment - Hi WangDi, I understand when to run 2). Do you want output of 1) now or at the same time when I run 2)? Haisong

          Hello, Cai

          I checked the debug log and dmesg, and I can see MDT0001 seems very slow at that moment. though I can not figure out why from these message. So

          1. Could you please post these information here stack trace of MDT0001 (panda-mds-19-6), which will help us understand what MDT0001 was busying with. Something like

          echo t > /proc/sysrq-trigger
          dmesg > /tmp/dmesg.out
          

          2. Could you please post "cat /proc/slabinfo" here when OOM happens?

          Thanks
          WangDi

          di.wang Di Wang (Inactive) added a comment - Hello, Cai I checked the debug log and dmesg, and I can see MDT0001 seems very slow at that moment. though I can not figure out why from these message. So 1. Could you please post these information here stack trace of MDT0001 (panda-mds-19-6), which will help us understand what MDT0001 was busying with. Something like echo t > /proc/sysrq-trigger dmesg > /tmp/dmesg.out 2. Could you please post "cat /proc/slabinfo" here when OOM happens? Thanks WangDi

          Hi Lai,

          Any update?

          thanks,
          Haisong

          haisong Haisong Cai (Inactive) added a comment - Hi Lai, Any update? thanks, Haisong

          Just like to highlight these messages on server (should also be in messages-19-6.gz file)

          May 15 06:35:19 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 06:45:05 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:17:59 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:18:53 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:18:54 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:18:56 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:19:00 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:19:08 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:19:37 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:20:09 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:21:13 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:23:25 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:27:44 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 07:55:17 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 08:08:07 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 08:08:07 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 08:08:08 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 08:08:10 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 08:11:04 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 08:11:12 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          May 15 08:11:28 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).

          haisong Haisong Cai (Inactive) added a comment - Just like to highlight these messages on server (should also be in messages-19-6.gz file) May 15 06:35:19 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 06:45:05 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:17:59 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:18:53 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:18:54 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:18:56 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:19:00 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:19:08 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:19:37 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:20:09 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:21:13 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:23:25 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:27:44 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 07:55:17 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 08:08:07 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 08:08:07 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 08:08:08 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 08:08:10 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 08:11:04 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 08:11:12 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). May 15 08:11:28 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
          pjones Peter Jones added a comment -

          Lai

          Could you please advise on this issue?

          Thanks

          Peter

          pjones Peter Jones added a comment - Lai Could you please advise on this issue? Thanks Peter

          People

            laisiyao Lai Siyao
            haisong Haisong Cai (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: