[LU-14513] sanity-quota test_1g: OST hangs Created: 11/Mar/21  Updated: 22/Jun/21  Resolved: 22/Jun/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14370 sanity-quota test_12b: @@@@@@ FAIL: ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for John Hammond <jhammond@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2781340b-ff26-425a-beb3-20ba863a3ba7

test_1g failed with the following error:

Timeout occurred after 82 mins, last suite running was sanity-quota
[ 1323.357288] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-quota test 1g: Quota pools: Block hard limit with wide striping ============================ 08:29:48 \(1615451388\)
[ 1323.755734] Lustre: DEBUG MARKER: == sanity-quota test 1g: Quota pools: Block hard limit with wide striping ============================ 08:29:48 (1615451388)
[ 1330.827990] Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0
[ 1333.538285] Lustre: DEBUG MARKER: /usr/sbin/lctl dl
[ 1334.188893] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0000.quota_slave.enabled
[ 1334.855615] Lustre: DEBUG MARKER: /usr/sbin/lctl dl
[ 1335.517492] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0001.quota_slave.enabled
[ 1336.294698] Lustre: DEBUG MARKER: /usr/sbin/lctl mark User quota \(block hardlimit:40 MB\)
[ 1336.722983] Lustre: DEBUG MARKER: User quota (block hardlimit:40 MB)
[ 1352.988976] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
[ 1354.015613] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Write...
[ 1354.410882] Lustre: DEBUG MARKER: Write...
[ 1356.817584] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Write out of block quota ...
[ 1357.214762] Lustre: DEBUG MARKER: Write out of block quota ...
[ 1400.932206] Lustre: ll_ost_io00_001: service thread pid 11830 was inactive for 42.752 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 1400.932210] Pid: 46929, comm: ll_ost00_004 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Mon Feb 22 19:46:31 UTC 2021
[ 1400.935288] Lustre: Skipped 1 previous similar message
[ 1400.936834] Call Trace TBD:
[ 1400.937039] [<0>] cv_wait_common+0xaf/0x130 [spl]
[ 1400.939186] [<0>] txg_wait_synced_impl+0xc2/0x110 [zfs]
[ 1400.940043] [<0>] txg_wait_synced+0xc/0x40 [zfs]
[ 1400.940821] [<0>] osd_sync+0xd6/0x150 [osd_zfs]
[ 1400.941607] [<0>] qsd_acquire+0x4c8/0xd60 [lquota]
[ 1400.942372] [<0>] qsd_op_begin+0x594/0xd60 [lquota]
[ 1400.943139] [<0>] osd_declare_attr_set+0x2da/0xb20 [osd_zfs]
[ 1400.944081] [<0>] ofd_attr_set+0x298/0xff0 [ofd]
[ 1400.944804] [<0>] ofd_setattr_hdl+0x43d/0x8a0 [ofd]
[ 1400.945974] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc]
[ 1400.946893] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
[ 1400.947914] [<0>] ptlrpc_main+0xba4/0x14a0 [ptlrpc]
[ 1400.948711] [<0>] kthread+0x112/0x130
[ 1400.949318] [<0>] ret_from_fork+0x35/0x40
[ 1400.949961] Pid: 15269, comm: ll_ost00_003 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Mon Feb 22 19:46:31 UTC 2021
[ 1400.951523] Call Trace TBD:
[ 1400.951989] [<0>] cv_wait_common+0xaf/0x130 [spl]
[ 1400.952768] [<0>] txg_wait_synced_impl+0xc2/0x110 [zfs]
[ 1400.953625] [<0>] txg_wait_synced+0xc/0x40 [zfs]
[ 1400.954358] [<0>] osd_sync+0xd6/0x150 [osd_zfs]
[ 1400.955072] [<0>] qsd_acquire+0x4c8/0xd60 [lquota]
[ 1400.955817] [<0>] qsd_op_begin+0x594/0xd60 [lquota]
[ 1400.956583] [<0>] osd_declare_attr_set+0x2da/0xb20 [osd_zfs]
[ 1400.957473] [<0>] ofd_attr_set+0x298/0xff0 [ofd]
[ 1400.958205] [<0>] ofd_setattr_hdl+0x43d/0x8a0 [ofd]
[ 1400.959002] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc]
[ 1400.959891] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
[ 1400.960897] [<0>] ptlrpc_main+0xba4/0x14a0 [ptlrpc]
[ 1400.961655] [<0>] kthread+0x112/0x130

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-quota test_1g - Timeout occurred after 82 mins, last suite running was sanity-quota



 Comments   
Comment by Alex Zhuravlev [ 11/Mar/21 ]

looks like LU-14370 exposes an out-of-order locking issue:
osd_declare_attr_set() grabs o_guard semaphore, then waits for current TXG to complete (to return reserved quota) while in the write path (ofd_write_commit() and below) we grab TX, then o_guard semaphore.
I think we osd_declare_attr_set() can release before calling qsd_transfer()

Comment by Gerrit Updater [ 12/Mar/21 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42018
Subject: LU-14513 osd: release o_guard before quota acquisition
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8842e1a98e7518dc5787eca12d57f2471a23a861

Comment by Gerrit Updater [ 21/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42018/
Subject: LU-14513 osd: release o_guard before quota acquisition
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 356248c8a14571049d0ae9eb09c5107397b92d65

Comment by Peter Jones [ 22/Jun/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:10:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.