[LU-6607] MDS ( 2 node DNE) running out of memory and crash Created: 15/May/15  Updated: 24/Mar/18  Resolved: 24/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Haisong Cai (Inactive) Assignee: Lai Siyao
Resolution: Won't Fix Votes: 1
Labels: sdsc
Environment:

Linux panda-mds-19-6.sdsc.edu 3.10.73-1.el6.elrepo.x86_64 #1 SMP Thu Mar 26 16:28:30 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

lustre-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-osd-zfs-mount-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-iokit-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-source-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-osd-zfs-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-modules-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-tests-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64


Attachments: File clients_log.gz     File dmesg.out     File dmesg_mds.gz     File lustre-log.tgz     File messages-19-6.gz     Text File slabinfo.txt    
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

2 node DNE MDS
16 OSS
2K clients

A MDS node randomly running out of memory and hang.
We watch MDS drain its memory in matter of few minutes. Many times right after recovery from previous hangs.

Clients are generating a ton of Lustre errors with strings "ptlrpc_expire_one_request". The numbers are from several hundred thousands to several millions of such errors from each node. Here are number of error counts from some nodes:

comet-12-31 662616
comet-10-06 690764
comet-12-24 720396
comet-12-25 735659
comet-12-14 778073
comet-12-33 840302
comet-10-10 928322
comet-12-33 945614
comet-12-25 992288
comet-10-15 1131711
comet-12-25 1147043
comet-10-07 1160876
comet-12-30 1180270
comet-10-03 1387072
comet-10-02 2515764
comet-10-02 3371128

I am attaching logs from both client and server on one such incidence.



 Comments   
Comment by Peter Jones [ 15/May/15 ]

Lai

Could you please advise on this issue?

Thanks

Peter

Comment by Haisong Cai (Inactive) [ 15/May/15 ]

Just like to highlight these messages on server (should also be in messages-19-6.gz file)

May 15 06:35:19 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 06:45:05 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:17:59 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:18:53 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:18:54 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:18:56 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:19:00 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:19:08 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:19:37 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:20:09 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:21:13 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:23:25 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:27:44 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 07:55:17 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 08:08:07 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 08:08:07 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 08:08:08 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 08:08:10 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 08:11:04 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 08:11:12 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
May 15 08:11:28 panda-mds-19-6 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).

Comment by Haisong Cai (Inactive) [ 19/May/15 ]

Hi Lai,

Any update?

thanks,
Haisong

Comment by Di Wang [ 19/May/15 ]

Hello, Cai

I checked the debug log and dmesg, and I can see MDT0001 seems very slow at that moment. though I can not figure out why from these message. So

1. Could you please post these information here stack trace of MDT0001 (panda-mds-19-6), which will help us understand what MDT0001 was busying with. Something like

echo t > /proc/sysrq-trigger
dmesg > /tmp/dmesg.out

2. Could you please post "cat /proc/slabinfo" here when OOM happens?

Thanks
WangDi

Comment by Haisong Cai (Inactive) [ 19/May/15 ]

Hi WangDi,

I understand when to run 2).
Do you want output of 1) now or at the same time when I run 2)?

Haisong

Comment by Di Wang [ 19/May/15 ]

Hello, Cai

Oh, I only need output of 1) when MDT1 is busy. But if you can get both at the same time, that would be great.

Thanks
WangDi

Comment by Haisong Cai (Inactive) [ 13/Jul/15 ]

WangDi,

We had 2 incidences recently and both time I failed to collect need info.
One time I simply forgot and the other time we had no chance since MDS node was hung.

Haisong

Comment by Haisong Cai (Inactive) [ 01/Sep/15 ]

WangDi,

We ran into this problem on one of MDS (mdt0, the master again today)
I have collected information you asked by issuing the following commands:

echo t > /proc/sysrq-trigger
dmesg > /state/partition1/tmp/dmesg.out
cat /proc/slabinfo > /state/partition1/tmp/slabinfo.txt

dmesg.out & slabinfo.txt will be uploaded separately.

Haisong

Comment by Haisong Cai (Inactive) [ 01/Sep/15 ]

Files collected between 2 time MDS crashes.

Comment by Di Wang [ 01/Sep/15 ]

Ah, it is a ZFS environment (ZFS + DNE)? A few questions here

1. I saw this on your MDS console message(dmesg_mds.gz), the kernel version is definitely not EL6? EL7? But we do not support EL7 server on MDS yet. could you please confirm what kernel did you use on MDS?

Linux version 3.10.73-1.el6.elrepo.x86_64 (mockbuild@Build64R6) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Thu Mar 26 16:28:30 EDT 2015

2. In the slab info

kmalloc-8192      9033431 9033431   8192    1    2 : tunables    8    4    0 : slabdata 9033431 9033431      0

8192 size slab costs too much memory, 941G! that is too much. Btw: how much OSTs for each OSS?

Comment by Haisong Cai (Inactive) [ 01/Sep/15 ]

Hi WangDi,

We are running CentOS 6.6 with Linux kernel 3.10.73 from elrepo.
Lustre and ZFS are build as kdms modules.

Filesystem has 16 OSS and each has 6 OSTs.

Haisong

Comment by Haisong Cai (Inactive) [ 01/Sep/15 ]

On one of the 2 MDS servers:

[root@panda-mds-19-6 panda-mds-19-6]# sysctl -a | grep slab
kernel.spl.kmem.slab_kmem_alloc = 92736
kernel.spl.kmem.slab_kmem_max = 92736
kernel.spl.kmem.slab_kmem_total = 172032
kernel.spl.kmem.slab_vmem_alloc = 407675904
kernel.spl.kmem.slab_vmem_max = 490480640
kernel.spl.kmem.slab_vmem_total = 485459072
vm.min_slab_ratio = 5

Comment by Di Wang [ 01/Sep/15 ]

Is that possible you can upgrade MDS to 2.7.58 ? there are quite a few fix on these area since 2.7.51.

Btw: we are currently testing ZFS on DNE at LU-7009, please follow there.

Comment by Haisong Cai (Inactive) [ 01/Sep/15 ]

LU-6584 is about OSS crashing problem. The OSS servers are part of these very same MDS servers. They are the one file-system.

We are about to apply a new patch related to LU-6584. It is built from http://review.whamcloud.com/#/c/14926/

Will it be satisfy your recommendation?

Haisong

Comment by Di Wang [ 01/Sep/15 ]

Hmm, I think LU-6584 is different issue. This ticket is about MDS OOM during failover? Do you happen to know any easy way to reproduce this problem?
Hmm, btw: is that possible you can add "log_buf_len=10M" in your boot command? since the dmesg you post here only have half stack trace. Thanks.

Comment by Haisong Cai (Inactive) [ 01/Sep/15 ]

Hi Wang Di,

I understand LU-6584 is a different problem, for OSS and not MDS memory issue.

What I said earlier was, to work on LU-6584 problem, we have to apply a patch soon. This is because they are the same
file-system. That patch is built with http://review.whamcloud.com/#/c/14926/

Was that 2.7.58 equivalent?

Haisong

Comment by Di Wang [ 02/Sep/15 ]

Ah, it is. you can use that build. Thanks

Comment by Haisong Cai (Inactive) [ 02/Sep/15 ]

Hi WangDi,

You stated that 2.7.58 has a lot fixes. But it may still not fix our problem, correct?
Can you elaborate on slab situation? You indicated 941G (or 94G) was too big, why is it? Is it because of default setting or some configuration mistake?

thanks,
Haisong

Comment by Di Wang [ 03/Sep/15 ]

Hello, Haisong

Yes, I do not know the exact reason why for this 8192_size slab caused so much memory here. No, I do not think this is related with any default setting. Did you do a lot cross-MDT operation here, like creating remote directory or striped directory? (unfortunately, there are not enough stack trace information here) Btw: this stack trace is collected when OOM happens ? or before? or about to happen? Right now, I would suggest

1. Use 2.7.58 plus that patch (http://review.whamcloud.com/#/c/14926/) you need, maybe also include http://review.whamcloud.com/#/c/16161/.
2. Please add "log_bf_len=10M" in your boot command, so we can see more of the stack trace when error happens.
3. Please help me find an easy way to reproduce the problem. Thanks!

Even though 2.7.58 might not help you on this issue, but it is way better than 2.7.51 on DNE.

Comment by Peter Jones [ 24/Mar/18 ]

SDSC have moved onto more current releases so I do not think any further work is needed here

Generated at Sat Feb 10 02:01:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.