[LU-8118] very slow metadata performance with shared striped directory Created: 09/May/16  Updated: 07/Jun/16

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: llnl
Environment:

lustre-2.8.0-2.6.32_573.22.1.1chaos.ch5.4.x86_64.x86_64
clients and servers all on same OS and same Lustre build.
Patch stack on top of lustre 2.8.0 tag is:

cb25ac6 Target building for RHEL7 under Koji.
675a140 LU-7841 doc: stop using python-docutils
b50a29a LU-7893 osd-zfs: calls dmu_objset_disown() with NULL
67fe716 LU-7198 clio: remove mtime check in vvp_io_fault_start()
80b4633 LLNL-0000 llapi: get OST count from proc
e2717c9 LU-5725 ofd: Expose OFD site_stats through proc
8d9a8f2 LU-4009 osd-zfs: Add tunables to disable sync (DEBUG)
699abe4 LU-8073 build: Eliminate lustre-source binary package
71ee38a LU-8072 build: Restore module debuginfo
7fb8959 LU-7962 build: Support builds w/ weak module ZFS
66579d9 LU-7961 build: Fix ldiskfs source autodetect for CentOS 6
52aa718 LU-7643 build: Remove Linux version string from RPM release field
445b063 LU-5614 build: use %kernel_module_package in rpm spec
f5b8fb1 LU-7699 build: Convert lustre_ver.h.in into a static .h file
333612e LU-7699 build: Eliminate lustre_build_version.h
a49b396 LU-7699 build: Replace version_tag.pl with LUSTRE-VERSION-GEN
6948075 LU-7518 build: Remove the Phi accelerator-specific packaging
ea79df5 New tag 2.8.0-RC5

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We are experiencing a severe metadata performance issue intermittently. So far it has occurred with a striped directory being alter
ed by many threads on multiple nodes.

Setup:
Multiple MDTs, one MDT per MDS
A directory striped across those MDTs with lfs setdirstripe -D set
several processes on each of several nodes making metadata changes in that striped dir

Symptoms:
Create rates in 10s of creates/second total across all the MDTs hosting the shards
getdents() times >1000 seconds per getdents() call

In one case:
The directory was striped across 10 MDTs and 16 OSTs
Mdtest had been run as follows:

srun -N 10 -n 80 mdtest -d /p/lustre/faaland1/mdtest -n 128000 -F

The create rate started out about 30,000 creates/second total across all 10 MDTs. After some time it dropped to 10-20 creates/secon
d. On a separate node, which mounted the same filesystem but was not running any of the mdtest processes and was entirely idle, I r
an ls and observed the very slow getdents() calls.
On yet another idle node, I created another directory, striped across the same MDTs, and created 10000 files within it. Create rate
was good. Listing that directory produced getdents() times of about 0.003 seconds.
There were no indications of network problems within the nodes at the time, nor before or after our test (this is on catalyst and th
e nodes are normally used for compute jobs and monitored 24x7).

In the other case:
This filesystem has 4 MDTs, each on its own MDS, and 2 OSTs.
The directory is striped across all 4 MDTs and has -D set.

The workload involved 10 clients, each running 96 threads. Shell scripts were randomly invoking mkdir, touch, rmdir, or rm (the lat
ter two having chosen a file or directory to remove).

Create rate started at about 10,000/sec concurrent with thousands of unlinks, mkdirs, rmdirs, and stats per sec. All those operatio
ns slowed to single digit per-second rates. An ls in that common directory, on a node not running the job, also produced >1000 seco
nd getdents() calls.

strace -T output:

getdents(3, /* 1061 entries */, 32768)  = 32760 <1887.749809>
getdents(3, /* 1102 entries */, 32768)  = 32760 <1990.174707>
getdents(3, /* 1087 entries */, 32768)  = 32752 <1994.781547>
getdents(3, /* 1056 entries */, 32768)  = 32768 <1907.404333>
brk(0xcf0000)                           = 0xcf0000 <0.000030>
getdents(3, /* 1091 entries */, 32768)  = 32752 <1860.720958>


 Comments   
Comment by Andreas Dilger [ 10/May/16 ]

Are your workloads creating subdirectories? Have you tested without "setstripe -D" to see if that is the cause of the slowdown?

Comment by Joseph Gmitter (Inactive) [ 10/May/16 ]

Hi Lai,

Can you please advise on this?

Thanks.
Joe

Comment by Olaf Faaland [ 10/May/16 ]

Andreas,
The mdtest workload did not create subdirectories. I have not tested with a shared directory but without setdirstripe -D.
-Olaf

Comment by Andreas Dilger [ 10/May/16 ]

The "setstripe -D" should only affect subdirectory creation. That would appear to affect your second example where you wrote "Shell scripts were randomly invoking mkdir, touch, rmdir, or rm" but not the mdtest run, which is only creating files.

As a starting point, it would be useful to collect full debug logs from one of the clients when it is in the "slow create" mode, to see if it is blocked locally or waiting on the MDS. If possible, collecting debug logs (at least +dlmtrace +rpctrace) from the MDSes during this slowdown would also be useful. Do you have any indication that one MDS is slower than the others (higher CPU load or load average) during this time?

Generated at Sat Feb 10 02:14:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.