[LU-4650] contention on ll_inode_size_lock with mmap'ed file - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.1.6, Lustre 2.4.2
Labels:
None
Environment:
rhel 6.4
kernel 2.6.32-431

Severity:
3
Epic:
- contention
- mmap
Rank (Obsolete):
12719

Description

Our customer (CEA) is suffering huge contention when using a debugging tool (Distributed Debugging Tool) with a binary file located on Lustre filesystem. The binary file is quite large (~300 MB). The debugging tool launches one gdb instance per core on the client, which reveals high contention on large SMP nodes (32 cores).

Global launch time appears to be 3 minutes when binary file is on Lustre compared to 20 seconds only on NFS.

After analysis of the operations done by gdb, I have created a test program that reproduces the issue (mmaptest.c). It does:

open a file O_RDONLY
mmap it entirely PROT_READ, MAP_PRIVATE
access each page of the memory region (from first to last page)

Launch command is:

\cp file1G /dev/null; time ./launch_mmaptest.sh 16 file1G

Run time with lustre 2.1.6

	ext4	lustre
1 instance	0.339s	2.951s
32 instances	0.558s	9m20.669s

Run time with lustre 2.4.2

	ext4	lustre
1 instance	0.349s	6.542s
16 instances	0.373s	45.588s

With several instances, processes are waiting on inode size lock. Here is the stack of most of the instances during the test

[<ffffffff810a0371>] down+0x41/0x50
[<ffffffffa0b5daa2>] ll_inode_size_lock+0x52/0x110 [lustre]
[<ffffffffa0b97a06>] ccc_prep_size+0x86/0x270 [lustre]
[<ffffffffa0b9f4a1>] vvp_io_fault_start+0xf1/0xb00 [lustre]
[<ffffffffa060061a>] cl_io_start+0x6a/0x140 [obdclass]
[<ffffffffa0604d54>] cl_io_loop+0xb4/0x1b0 [obdclass]
[<ffffffffa0b827a2>] ll_fault+0x2c2/0x4d0 [lustre]
[<ffffffff8114a4c4>] __do_fault+0x54/0x540
[<ffffffff8114aa4d>] handle_pte_fault+0x9d/0xbd0
[<ffffffff8114b7aa>] handle_mm_fault+0x22a/0x300
[<ffffffff8104aa68>] __do_page_fault+0x138/0x480
[<ffffffff8152e2fe>] do_page_fault+0x3e/0xa0
[<ffffffff8152b6b5>] page_fault+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

launch_mmaptest.sh
0.2 kB
19/Feb/14 9:45 AM
mmaptest.c
0.9 kB
19/Feb/14 9:45 AM

Issue Links

is related to

LU-4257 parallel dds are slower than serial dds

Resolved

Activity

People

Assignee:: Dmitry Eremin (Inactive)

Reporter:: Gregoire Pichon

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Feb/14 9:45 AM

Updated:: 19/May/16 6:35 PM

Resolved:: 19/May/16 6:35 PM