[LU-8135] sanity test_101g fails with 'not all RPCs are 16 MiB BRW rpcs' Created: 12/May/16  Updated: 06/Nov/18  Resolved: 21/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

autotest review-zfs


Issue Links:
Related
is related to LU-8139 sanity test_101g: 'max_pages_per_rpc ... Open
is related to LU-10239 Lustre crash (client): The first exte... Resolved
is related to LU-8632 Address of function 'page_count' is u... Resolved
is related to LU-7181 Submitting random writes using 4MB RPC Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 101g fails with

'not all RPCs are 16 MiB BRW rpcs' 

In the test log on the client, we can see the ‘File too large’ error returned when dd tries to write to the file system

== sanity test 101g: Big bulk(4/16 MiB) readahead ==================================================== 11:11:42 (1462903902)
CMD: onyx-50vm8 /usr/sbin/lctl get_param obdfilter.*.brw_size |
			while read s; do echo ost1 \$s; done
CMD: onyx-50vm8 /usr/sbin/lctl get_param obdfilter.*.brw_size |
			while read s; do echo ost2 \$s; done
CMD: onyx-50vm8 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.brw_size=16M 		osd-*.lustre-OST*.brw_size=16M 2>&1
remount client to enable large RPC size
CMD: onyx-50vm1.onyx.hpdd.intel.com grep -c /mnt/lustre' ' /proc/mounts
Stopping client onyx-50vm1.onyx.hpdd.intel.com /mnt/lustre (opts:)
CMD: onyx-50vm1.onyx.hpdd.intel.com lsof -t /mnt/lustre
CMD: onyx-50vm1.onyx.hpdd.intel.com umount  /mnt/lustre 2>&1
Starting client: onyx-50vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-50vm7@tcp:/lustre /mnt/lustre
CMD: onyx-50vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
CMD: onyx-50vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-50vm7@tcp:/lustre /mnt/lustre
dd: error writing '/mnt/lustre/f101g.sanity': File too large
2+0 records in
1+0 records out
16777216 bytes (17 MB) copied, 0.978339 s, 17.1 MB/s
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00599718 s, 0.0 kB/s
0 RPCs
 sanity test_101g: @@@@@@ FAIL: not all RPCs are 16 MiB BRW rpcs 

We’ve had three occurrences of this failure since the test was introduced by http://review.whamcloud.com/19368 a little over a week ago and, so far, only seen on review-zfs-part-1:

2016-05-09 - https://testing.hpdd.intel.com/test_sets/145be8f2-1661-11e6-b5f1-5254006e85c2
2016-05-10 - https://testing.hpdd.intel.com/test_sets/f2b9b50c-1701-11e6-b5f1-5254006e85c2
2016-05-11 - https://testing.hpdd.intel.com/test_sets/73c7a9a4-179f-11e6-b5f1-5254006e85c2



 Comments   
Comment by Peter Jones [ 12/May/16 ]

Gu Zheng

Could you please advise on this failure?

Thanks

Peter

Comment by Li Xi (Inactive) [ 13/May/16 ]

It seems following codes returns -EFBIG

	LASSERT(obj->oo_sa_hdl != NULL);
	LASSERT(oh->ot_tx != NULL);
	dmu_tx_hold_sa(oh->ot_tx, obj->oo_sa_hdl, 0);
	if (oh->ot_tx->tx_err != 0)
		GOTO(out, rc = -oh->ot_tx->tx_err);
00080000:00000001:0.0:1462903904.567380:0:13254:0:(osd_object.c:861:osd_declare_attr_set()) Process entered
00080000:00000001:0.0:1462903904.567381:0:13254:0:(osd_object.c:877:osd_declare_attr_set()) Process leaving via out (rc=18446744073709551589 : -27 : 0xffffffffffffffe5)
00080000:00000001:0.0:1462903904.567382:0:13254:0:(osd_object.c:918:osd_declare_attr_set()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)
00002000:00000001:0.0:1462903904.567383:0:13254:0:(ofd_io.c:1079:ofd_commitrw_write()) Process leaving via out_stop (rc=18446744073709551589 : -27 : 0xffffffffffffffe5)
00080000:00000001:0.0:1462903904.567384:0:13254:0:(osd_handler.c:282:osd_trans_stop()) Process entered
00040000:00000001:0.0:1462903904.567386:0:13254:0:(qsd_handler.c:1073:qsd_op_end()) Process entered
00040000:00000001:0.0:1462903904.567387:0:13254:0:(qsd_handler.c:1101:qsd_op_end()) Process leaving
00080000:00000010:0.0:1462903904.567387:0:13254:0:(osd_handler.c:296:osd_trans_stop()) kfreed 'oh': 384 at ffff880021666a00.
00080000:00000001:0.0:1462903904.567388:0:13254:0:(osd_handler.c:297:osd_trans_stop()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1462903904.567949:0:13254:0:(lustre_fid.h:740:fid_flatten32()) Process leaving (rc=252707909 : 252707909 : f100445)
00000020:00000001:0.0:1462903904.567951:0:13254:0:(lustre_fid.h:740:fid_flatten32()) Process leaving (rc=252707909 : 252707909 : f100445)
00000020:00000002:0.0:1462903904.567952:0:13254:0:(lu_object.c:162:lu_object_put()) Add ffff8800461c7178 to site lru. hash: ffff8800494dc900, bkt: ffff88004ced5118, lru_len: 2
00002000:00000001:0.0:1462903904.567954:0:13254:0:(ofd_grant.c:1268:ofd_grant_commit()) Process entered
00002000:00000001:0.0:1462903904.567955:0:13254:0:(ofd_grant.c:1323:ofd_grant_commit()) Process leaving
00002000:00000001:0.0:1462903904.567956:0:13254:0:(ofd_io.c:1138:ofd_commitrw_write()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)
00002000:00000001:0.0:1462903904.567957:0:13254:0:(ofd_io.c:1255:ofd_commitrw()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)
00000020:00000001:0.0:1462903904.567958:0:13254:0:(obd_class.h:1129:obd_commitrw()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)
Comment by Jinshan Xiong (Inactive) [ 13/May/16 ]

Alex may provide some feedback on this issue.

Comment by Alex Zhuravlev [ 16/May/16 ]

this is because of the limit on a single tx:

	if (txh->txh_space_towrite + txh->txh_space_tooverwrite >
	    2 * DMU_MAX_ACCESS)
		err = SET_ERROR(EFBIG);

#define	DMU_MAX_ACCESS (64 * 1024 * 1024) /* 64MB */

ZFS is known for huge overestimation.. I guess we hit this with large writes.

Comment by Jinshan Xiong (Inactive) [ 16/May/16 ]

I remembered that Matt once mentioned that it's pretty safe to reduce spa_asize_inflation, which is defined to 24 by default, to a reasonable value, and then try if it can fix this problem. spa_asize_inflation is a module parameter that can be set on the fly by setting '/sys/module/zfs/parameters/spa_asize_inflation'.

Comment by Jinshan Xiong (Inactive) [ 06/Sep/16 ]

After taking a further look, I realized that this ticket reveals a severe problem in ZFS in terms of space reservation, especially when large block size is used.

In the current implementation of client, it could send discontiguous pages in the same RPC for optimal throughput. However, If those pages in the same RPC locate different blocks in underlying ZFS, it will need to reserve humongous space in the transaction. For ZFS with 1MB block size and 16MB RPC size, it would need to reserve 4GB space in one trans in the worst case. In comparison, current ZFS reports error when a trans size is over 64MB.

This potential issue even exists with 1MB RPC size and 1MB block size in ZFS. I'm going to create a test case to demonstrate the problem.

To fix this problem, I would propose a new parameter from OFD to client that limits the maximum of chunks in the same RPC.

Comment by Patrick Farrell (Inactive) [ 07/Sep/16 ]

Jinshan -

Since you asked me to take a look... Nasty problem, an interesting side effect of the ZFS design. Your solution sounds good to me, it should solve the problem and allows the more common case (one or just a few chunks) to benefit from large RPC size. I'll be curious to see the exact patch - I'm not sure how the server would communicate this limit to the client.

Comment by Gerrit Updater [ 07/Sep/16 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22369
Subject: LU-8135 osc: limits the number of chunks in write RPC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1a490d624c492cf6856623b2fe3e9c0afb1dbdde

Comment by Gerrit Updater [ 21/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22369/
Subject: LU-8135 osc: limits the number of chunks in write RPC
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7f2aae8d80a73de7408668bbe569d5f4d8553efe

Comment by Peter Jones [ 21/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:14:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.