[LU-6607] MDS ( 2 node DNE) running out of memory and crash - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
- sdsc
Environment:

Hide
Linux panda-mds-19-6.sdsc.edu 3.10.73-1.el6.elrepo.x86_64 #1 SMP Thu Mar 26 16:28:30 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

lustre-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-osd-zfs-mount-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-iokit-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-source-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-osd-zfs-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-modules-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-tests-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64

Show
Linux panda-mds-19-6.sdsc.edu 3.10.73-1.el6.elrepo.x86_64 #1 SMP Thu Mar 26 16:28:30 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux lustre-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64 lustre-osd-zfs-mount-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64 lustre-iokit-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64 lustre-source-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64 lustre-osd-zfs-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64 lustre-modules-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64 lustre-tests-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64

Severity:
4
Rank (Obsolete):
9223372036854775807

Description

2 node DNE MDS
16 OSS
2K clients

A MDS node randomly running out of memory and hang.
We watch MDS drain its memory in matter of few minutes. Many times right after recovery from previous hangs.

Clients are generating a ton of Lustre errors with strings "ptlrpc_expire_one_request". The numbers are from several hundred thousands to several millions of such errors from each node. Here are number of error counts from some nodes:

comet-12-31 662616
comet-10-06 690764
comet-12-24 720396
comet-12-25 735659
comet-12-14 778073
comet-12-33 840302
comet-10-10 928322
comet-12-33 945614
comet-12-25 992288
comet-10-15 1131711
comet-12-25 1147043
comet-10-07 1160876
comet-12-30 1180270
comet-10-03 1387072
comet-10-02 2515764
comet-10-02 3371128

I am attaching logs from both client and server on one such incidence.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dmesg_mds.gz
21 kB
15/May/15 4:06 PM
lustre-log.tgz
9.35 MB
15/May/15 4:06 PM
messages-19-6.gz
92 kB
15/May/15 4:06 PM
clients_log.gz
622 kB
15/May/15 4:06 PM
dmesg.out
396 kB
01/Sep/15 6:18 PM
slabinfo.txt
27 kB
01/Sep/15 6:18 PM

Activity

People

Assignee:: Lai Siyao

Reporter:: Haisong Cai (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 15/May/15 4:06 PM

Updated:: 24/Mar/18 2:03 PM

Resolved:: 24/Mar/18 2:03 PM