[LU-14221] Client hangs when using DoM with a fixed mdc lru_size - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.5, Lustre 2.12.6
Labels:
- ORNL

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

After enabling DoM and beginning to use one of our file systems more heavily recently, we discovered a bug seemingly related to locking.

Basically, with any fixed `lru_size`, everything will work normally until the number of locks hit the `lru_size`. From that point, everything will hang until the `lru_max_age` is hit, at which point it will clear the locks and move on, until filling again. We confirmed this by setting the number of locks pretty low, then setting a low (10s) `lru_max_age`, and kicking off a tar extraction. The tar would extract until the `lock_count` hit our `lru_size` value (basically 1 for 1 with number of files), then hang for 10s, then continue with another batch after the locks had been cleared. The same behavior can be replicated by letting it hang and then running `lctl set_param ldlm.namespaces.mdc.lru_size=clear`, which will free up the process temporarily as well.

Our current workaround is to set `lru_size` to 0 and set the `lru_max_age` to 30s to keep the number of locks to a manageable level.

This appears to only occur on our SLES clients. RHEL clients running the same Lustre version encounter no such problems. This may be due to the kernel version on SLES (4.12.14-197) vs RHEL (3.10.0-1160)

James believes this may be related to ~~LU-11518~~.

lru_size and lock_count while it's stuck:

lctl get_param ldlm.namespaces.*.lru_size
ldlm.namespaces.cyclone-MDT0000-mdc-ffff88078946d800.lru_size=200
lctl get_param ldlm.namespaces.*.lock_count
ldlm.namespaces.cyclone-MDT0000-mdc-ffff88078946d800.lock_count=201

Process stack while it's stuck:

[<ffffffffa0ad1932>] ptlrpc_set_wait+0x362/0x700 [ptlrpc]
[<ffffffffa0ad1d57>] ptlrpc_queue_wait+0x87/0x230 [ptlrpc]
[<ffffffffa0ab7217>] ldlm_cli_enqueue+0x417/0x8f0 [ptlrpc]
[<ffffffffa0a6105d>] mdc_enqueue_base+0x3ad/0x1990 [mdc]
[<ffffffffa0a62e38>] mdc_intent_lock+0x288/0x4c0 [mdc]
[<ffffffffa0bf29ca>] lmv_intent_lock+0x9ca/0x1670 [lmv]
[<ffffffffa0cfea99>] ll_layout_intent+0x319/0x660 [lustre]
[<ffffffffa0d09fe2>] ll_layout_refresh+0x282/0x11d0 [lustre]
[<ffffffffa0d47c73>] vvp_io_init+0x233/0x370 [lustre]
[<ffffffffa085d4d1>] cl_io_init0.isra.15+0xa1/0x150 [obdclass]
[<ffffffffa085d641>] cl_io_init+0x41/0x80 [obdclass]
[<ffffffffa085fb64>] cl_io_rw_init+0x104/0x200 [obdclass]
[<ffffffffa0d02c5b>] ll_file_io_generic+0x2cb/0xb70 [lustre]
[<ffffffffa0d03825>] ll_file_write_iter+0x125/0x530 [lustre]
[<ffffffff81214c9b>] __vfs_write+0xdb/0x130
[<ffffffff81215581>] vfs_write+0xb1/0x1a0
[<ffffffff81216ac6>] SyS_write+0x46/0xa0
[<ffffffff81002af5>] do_syscall_64+0x75/0xf0
[<ffffffff8160008f>] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[<ffffffffffffffff>] 0xffffffffffffffff

I can reproduce and provide any other debug data as necessary.

Attachments

Issue Links

is related to

LU-11509 LDLM: replace client lock LRU with improved cache algorithm

Open

is related to

LU-7266 Fix LDLM pool to make LRUR working properly

Open

LU-11518 lock_count is exceeding lru_size

Resolved

LU-13413 Lustre soft lockups with peer credit exhaustion

Resolved

LU-6529 Server side lock limits to avoid unnecessary memory exhaustion

Closed

Activity

People

Assignee:: Mikhail Pershin

Reporter:: Jeff Niles

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 15/Dec/20 7:37 PM

Updated:: 28/Feb/24 2:41 PM

Resolved:: 28/Feb/24 2:41 PM