[LU-16780] zfs's osd_sync() doesn't wait for commit callbacks Created: 27/Apr/23  Updated: 28/Apr/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

zfs's osd_sync (implementing dt_sync()) can return before all related commit callbacks have been processed. this result in an incorrect quota state: quota "usage" (read in lquota_disk_read()) returns actual number, but "pending" is out of date (updated from the commit callback).
finally qsd_acquire_local() returns EDQUOT:

	/* use latest usage */
	usage = lqe->lqe_usage;
	/* take pending write into account */
	usage += lqe->lqe_pending_write;
	if (space + usage <= lqe->lqe_granted - lqe->lqe_pending_rel) {
		lqe->lqe_pending_write += space;
		lqe->lqe_waiting_write -= space;
		rc = 0;
	} else if (lqe->lqe_edquot &&
		   (lqe->lqe_edquot_time > ktime_get_seconds() - 5)) {
		rc = -EDQUOT;
	} else {
		rc = -EAGAIN;
	}

this is a snipped from the log confirming the problem:

00040000:04000000:1.0:1682597449.976673:0:27241:0:(qsd_entry.c:253:qsd_refresh_usage()) $$$ disk usage: 0  qsd:lustre-MDT0001 qtype:usr id:60000 enforced:1 granted: 1024 pending:952 waiting:1 req:0 usage: 0 qunit:1024 qtune:512 edquot:1 default:no
00040000:04000000:1.0:1682597449.994977:0:7285:0:(qsd_entry.c:253:qsd_refresh_usage()) $$$ disk usage: 219  qsd:lustre-MDT0001 qtype:usr id:60000 enforced:1 granted: 1024 pending:953 waiting:1 req:0 usage: 219 qunit:1024 qtune:512 edquot:1 default:no
00040000:04000000:1.0:1682597450.084402:0:6415:0:(qsd_entry.c:253:qsd_refresh_usage()) $$$ disk usage: 879  qsd:lustre-MDT0001 qtype:usr id:60000 enforced:1 granted: 1024 pending:299 waiting:1 req:0 usage: 879 qunit:1024 qtune:512 edquot:1 default:no
00040000:04000000:1.0:1682597450.094358:0:6415:0:(qsd_entry.c:253:qsd_refresh_usage()) $$$ disk usage: 879  qsd:lustre-MDT0001 qtype:usr id:60000 enforced:1 granted: 1024 pending:74 waiting:1 req:0 usage: 879 qunit:1024 qtune:512 edquot:1 default:no
00040000:04000000:1.0:1682597450.186265:0:7285:0:(qsd_entry.c:253:qsd_refresh_usage()) $$$ disk usage: 953  qsd:lustre-MDT0001 qtype:usr id:60000 enforced:1 granted: 1024 pending:74 waiting:1 req:0 usage: 953 qunit:1024 qtune:512 edquot:1 default:no
...
00040000:04000000:1.0:1682597450.186948:0:7285:0:(qsd_handler.c:774:qsd_op_begin0()) $$$ acquire quota failed:-122  qsd:lustre-MDT0001 qtype:usr id:60000 enforced:1 granted: 1024 pending:74 waiting:1 req:0 usage: 953 qunit:1024 qtune:512 edquot:1 default:no
00040000:00000001:1.0:1682597450.186950:0:7285:0:(qsd_handler.c:830:qsd_op_begin0()) Process leaving (rc=18446744073709551494 : -122 : ffffffffffffff86)
...
00040000:04000000:1.0:1682597450.310321:0:6415:0:(qsd_entry.c:253:qsd_refresh_usage()) $$$ disk usage: 953  qsd:lustre-MDT0001 qtype:usr id:60000 enforced:1 granted: 1024 pending:0 waiting:0 req:0 usage: 953 qunit:1024 qtune:512 edquot:1 default:no


 Comments   
Comment by Gerrit Updater [ 28/Apr/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50790
Subject: LU-16780 osd: check all commit callbacks done
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bb664bd74f601eacf91e420ce632f9de562c978b

Generated at Sat Feb 10 03:29:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.