Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.5.2
Labels:
None

Severity:
3
Rank (Obsolete):
15580

Description

This morning some of our clients were hanging (others had not been checked at that time), the active MDS was unresponsive and flooding the console with stack traces. We had to fail over to the second MDS to get the file system back.

Looking at the system logs, we see a large number of these messages:
kernel: socknal_sd00_02: page allocation failure. order:2, mode:0x20 all followed by many stack traces, full log attached. Our monitoring is showing that the memory was mainly used by buffers but this had been the case for all of last week already and was stable and only slowly increasing. After the restart the memory used by buffers has quickly increase to about 60% and currently seems to be stable about there...

Just before these page allocation failure messages we noticed a few client reconnect messages, but have not been able to find any network problems so far. Since the restart of the MDT, no unexpected client reconnects have been seen.

We are running lustre 2.5.2 + 4 patches as recommended in ~~LU-5529~~ and ~~LU-5514~~.

We've been hammering the MDS a bit since the upgrade, both creating files, stating many files/directories from many clients etc and removing many files, but I would still expect the MDS not to fall over like this.

Is this a problem/memory leak in Lustre or something else? Could it be related different compile options when compiling Lustre? We did compile the version on the MDS in house with these patches and there is always a chance we didn't quite use the same compile time options that the automatic build process would use...

What can we do to debug this further and avoid it in the future?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cs04r-sc-mds03-01-logs-20140911.txt
37 kB
11/Sep/14 2:10 PM
cs04r-sc-mds03-01-lustre-dk_after_umount.xz
8.21 MB
10/Sep/14 10:20 AM
cs04r-sc-mds03-01-meminfo-20140909-1705.txt
1 kB
10/Sep/14 10:20 AM
cs04r-sc-mds03-01-meminfo-20141002-1841
1 kB
02/Oct/14 5:53 PM
cs04r-sc-mds03-01-meminfo-20141003-1412
1 kB
03/Oct/14 1:21 PM
cs04r-sc-mds03-01-memory.png
27 kB
10/Sep/14 10:20 AM
cs04r-sc-mds03-01-messages.txt.xz
50 kB
04/Sep/14 4:50 PM
cs04r-sc-mds03-01-slabinfo-20140909-1705.txt
27 kB
10/Sep/14 10:20 AM
cs04r-sc-mds03-01-slabinfo-20141002-1841
27 kB
02/Oct/14 5:53 PM
cs04r-sc-mds03-01-slabinfo-20141003-1412
27 kB
03/Oct/14 1:21 PM
lustre-log.1410414046.22423.xz
6.26 MB
11/Sep/14 2:12 PM
manual_dump.txt.xz
7.64 MB
11/Sep/14 2:12 PM

Issue Links

is related to

LU-5595 replay-single test_38: MDS 1 OOM

Resolved

is related to

LU-5583 clients receive IO error after MDT failover

Open

Activity

People

Assignee:: Zhenyu Xu

Reporter:: Frederik Ferner (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 04/Sep/14 4:50 PM

Updated:: 12/Aug/22 9:54 PM

Resolved:: 12/Aug/22 9:54 PM