[LU-16046] Shared-file I/O performance is poor under group lock Created: 25/Jul/22  Updated: 07/Mar/23  Resolved: 14/Nov/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Vitaly Fertman Assignee: Vitaly Fertman
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9964 > 1 group lock on same file (group lo... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LU-9964 fixed the crashes but made unlocks synchronous, as the result group unlock may take dozens of seconds. Precisely, the test is N ranks opening (+O_CREAT) the same file, taking a group lock, writing 1MB segments for a total of 128GB (each rank writes 128GB/rank count) at a stride of (1MB*rank count), fysnc() and group unlock. Timings are similar regardless of MPI barriers after each phase or no barriers.

The ticket is to make group unlock asynchronous.



 Comments   
Comment by Gerrit Updater [ 25/Jul/22 ]

"Vitaly Fertman <vitaly.fertman@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48037
Subject: LU-16046 revert: "LU-9964 llite: prevent mulitple group locks"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0209c56134919eed6e81c006e3eb45e77656bab9

Comment by Gerrit Updater [ 25/Jul/22 ]

"Vitaly Fertman <vitaly.fertman@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48038
Subject: LU-16046 ldlm: group lock fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 74fd9699bb25e6e607b4c7401d05e69c27594e20

Comment by Gerrit Updater [ 15/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48037/
Subject: LU-16046 revert: "LU-9964 llite: prevent mulitple group locks"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bc37f89a81ea0a2fae8668e21247552e8894bfd8

Comment by Gerrit Updater [ 15/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48038/
Subject: LU-16046 ldlm: group lock fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3ffcb5b700ebfd68dba4daca4192fdacaf7fd541

Comment by Peter Jones [ 16/Oct/22 ]

Landed for 2.16

Comment by Alex Zhuravlev [ 17/Oct/22 ]

bisection points to this patch.

[  106.147323] Lustre: DEBUG MARKER: == sanity test 184e: Recreate layout after stripeless layout swaps ========================================================== 13:37:47 (1666013867)
[  106.450660] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:934
[  106.450907] in_atomic(): 1, irqs_disabled(): 0, pid: 9400, name: lfs
[  106.451024] INFO: lockdep is turned off.
[  106.451101] CPU: 0 PID: 9400 Comm: lfs Tainted: G        W  O     --------- -  - 4.18.0 #1
[  106.451235] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  106.451346] Call Trace:
[  106.451403]  dump_stack+0x5c/0x80
[  106.451480]  ___might_sleep.cold.21+0x9b/0xa8
[  106.451550]  __mutex_lock+0x41/0x930
[  106.451628]  ? osc_grouplock_dec+0x28/0x1a0 [osc]
[  106.451717]  osc_grouplock_dec+0x28/0x1a0 [osc]
[  106.451805]  osc_object_ast_clear+0x1ba/0x370 [osc]
[  106.451882]  ? osc_attr_get+0x30/0x30 [osc]
[  106.451988]  ldlm_resource_foreach+0xcc/0x280 [ptlrpc]
[  106.452081]  ? osc_attr_get+0x30/0x30 [osc]
[  106.452175]  ldlm_resource_iterate+0x122/0x180 [ptlrpc]
[  106.452273]  osc_object_prune+0x50/0x90 [osc]
[  106.452377]  cl_object_prune+0x50/0x130 [obdclass]
[  106.452521]  lov_delete_composite+0xfc/0x490 [lov]
[  106.452611]  lov_conf_set+0x654/0xb10 [lov]
[  106.452695]  cl_conf_set+0x58/0x130 [obdclass]
[  106.452788]  ll_layout_conf+0x120/0x400 [lustre]
[  106.452872]  ? ll_layout_refresh+0x6e3/0x1440 [lustre]
[  106.452956]  ll_layout_refresh+0x6e3/0x1440 [lustre]
[  106.453039]  vvp_io_init+0x209/0x360 [lustre]
[  106.453135]  __cl_io_init.isra.2+0x7f/0x150 [obdclass]
[  106.453224]  cl_setattr_ost+0x19c/0x2f0 [lustre]
[  106.453308]  ll_setattr_raw+0x10a3/0x1340 [lustre]
COMMIT		TESTED	PASSED	FAILED		COMMIT DESCRIPTION
3ffcb5b700	4	3	1	BAD	LU-16046 ldlm: group lock fix
bc37f89a81	10	10	0	GOOD	LU-16046 revert: "LU-9964 llite: prevent mulitple group locks"
59f0d69168	10	10	0	GOOD	LU-15721 llite: only statfs for projid if PROJINHERIT set
a41ee518f0	10	10	0	GOOD	LU-16219 tests: syntax error fix
e174717923	10	10	0	GOOD	LU-16198 tests: increase margin for sanity/33hh
af0ce0ca76	10	10	0	GOOD	LU-16200 tests: test_32[f,g]: specify blocksize explicitly
d3074511f3	10	10	0	GOOD	LU-16180 ptlrpc: reduce lock contention in ptlrpc_free_committed
f5ca6853b8	10	10	0	GOOD	LU-16076 utils: enhance 'lfs check' command
0bb491b2ec	10	10	0	GOOD	LU-16044 osd: discard pagecache in truncate's declaration
Comment by Vitaly Fertman [ 26/Oct/22 ]

Alex, a link to a failure please ?

Comment by Alex Zhuravlev [ 27/Oct/22 ]

Alex, a link to a failure please ?

this is a local setup. AT doesn't hit this because AT's kernel has no debugging enabled (e.g. CONFIG_DEBUG_ATOMIC_SLEEP)

Comment by Cory Spitz [ 09/Nov/22 ]

https://review.whamcloud.com/c/fs/lustre-release/+/49008

Comment by Gerrit Updater [ 14/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49008/
Subject: LU-16046 ldlm: group lock unlock fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3dc261c06434eceee3ba9ef86d1f376954b2d234

Comment by Peter Jones [ 14/Nov/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 03/Mar/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50201
Subject: LU-16046 ldlm: group lock unlock fix
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3aed6087019b12bbaaeeaa0fdec37650242ee41b

Comment by Gerrit Updater [ 07/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50226
Subject: LU-16046 ldlm: group lock unlock fix
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 200a29391b5d78ef964e958ed5f6dd19e322f2f0

Generated at Sat Feb 10 03:23:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.