[LU-2085] sanityn test_16 (fsx) ran over its Autotest time Created: 29/Jan/12 Updated: 15/Sep/15 Resolved: 15/Sep/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Li Wei (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | performance, zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 2884 | ||||||||||||||||
| Description |
|
Commit: 9337ac981608767e9d15c9cc51cb43ad42ffb42a (Jan 25, 2012) == sanityn test 16: 2500 iterations of dual-mount fsx =================== 15:50:32 (1327535432) Chance of close/open is 1 in 50 Seed set to 5432 fd 0: /mnt/lustre/fsxfile fd 1: /mnt/lustre2/fsxfile skipping zero size read skipping zero size read skipping zero size read truncating to largest ever: 0x790c8c truncating to largest ever: 0x7a6801 truncating to largest ever: 0x7dc953 truncating to largest ever: 0x81487d truncating to largest ever: 0x9c7da8 000100 1327535647.851342 0 write 0x2b3f91 thru 0x2c0de1 (0xce51 bytes) 000200 1327535848.080671 1 write 0x14fa62 thru 0x15ef47 (0xf4e6 bytes) truncating to largest ever: 0x9e4443 000300 1327535971.751220 1 read 0x1fdb8c thru 0x208c53 (0xb0c8 bytes) 000400 1327536143.120379 0 mapwrite 0x73beca thru 0x73f9aa (0x3ae1 bytes) 000500 1327536343.371779 1 mapread 0x03c3ab thru 0x045b62 (0x97b8 bytes) 000600 1327536576.819509 0 trunc from 0x8a8fa4 to 0xd6e090 (0x4c50ed bytes) 000700 1327536779.280336 0 trunc from 0x6b9883 to 0xaa9d52 (0x3f04d0 bytes) truncating to largest ever: 0x9fb613 000800 1327537034.643260 0 mapread 0x62396f thru 0x62f99b (0xc02d bytes) 000900 1327537212.838845 0 read 0x32c1de thru 0x32ee26 (0x2c49 bytes) 001000 1327537438.173226 0 read 0x0b27aa thru 0x0be90d (0xc164 bytes) 001100 1327537653.305146 0 read 0x361731 thru 0x363077 (0x1947 bytes) 001200 1327537802.126101 1 write 0x214dcb thru 0x21ae71 (0x60a7 bytes) 001300 1327538007.965572 1 mapwrite 0x251482 thru 0x261322 (0xfea1 bytes) 001400 1327538174.912600 1 mapread 0x139172 thru 0x140b9f (0x7a2e bytes) 001500 1327538353.902998 1 mapwrite 0x613416 thru 0x61faec (0xc6d7 bytes) 001600 1327538611.778969 1 mapwrite 0x0a1504 thru 0x0ad80b (0xc308 bytes) 001700 1327538895.176592 0 trunc from 0x0bdf9c to 0x1ef680 (0x1316e5 bytes) 001800 1327539112.389673 0 trunc from 0x9dce84 to 0x10ac2e8 (0x6cf465 bytes) |
| Comments |
| Comment by Li Wei (Inactive) [ 02/Feb/12 ] |
|
Commit: faefc49f0854987d29639437064e81bbc4556774 (Feb 1, 2012) |
| Comment by Li Wei (Inactive) [ 07/Feb/12 ] |
|
Commit: faefc49f0854987d29639437064e81bbc4556774 (Feb 1, 2012) |
| Comment by Li Wei (Inactive) [ 16/Feb/12 ] |
|
Commit: 04b51cd1011a74f81cc37264e76a2a5ff92d6f7b (Feb 8, 2012) |
| Comment by Li Wei (Inactive) [ 16/Feb/12 ] |
|
Commit: 0d7658e5cf6cded47cee838c9b659f640e4febce (Feb 9, 2012) |
| Comment by Li Wei (Inactive) [ 19/Feb/12 ] |
|
Commit: f42d375dfb2f30f64440ee8bc9f78a9a3e9a9adc (Feb 17, 2012) In this session, sanityn 16 had ran for 3834s. |
| Comment by Li Wei (Inactive) [ 01/Mar/12 ] |
|
Commit: 92fdb8f552eb3f4133f1de651c8ae6c1a35c51d2 (Feb 23, 2012) |
| Comment by Andreas Dilger [ 04/Mar/12 ] |
|
I've modified test_16() to run fewer operations for review, so that it can complete in "only" about 1000s instead of timing out after 7200s. Otherwise, we are skipping all of the subtests after this. Also, the verbosity of the test has been increased so the timestamp for every operation is printed. This will allow us to see if one operation type is taking far more time than others, and potentially help debug this problem more quickly. |
| Comment by Andreas Dilger [ 14/Mar/12 ] |
|
Looking at the test results (which at least did not time out): 000034 1331655558.536392 1 mapread 0x140d20 thru 0x149017 (0x82f8 bytes) 000035 1331655560.687850 1 mapwrite 0x332aba thru 0x33f074 (0xc5bb bytes) 000036 1331655561.719549 0 write 0x9cfd1e thru 0x9da5c0 (0xa8a3 bytes) 000037 1331655561.730190 1 trunc from 0x9da5c1 to 0xe69187 (0x48ebc7 bytes) 000038 1331655570.962937 1 mapread 0x222fca thru 0x22c098 (0x90cf bytes) 000039 1331655571.012619 1 trunc from 0x48ebc7 to 0xace5fa (0x63fa34 bytes) 000040 1331655571.029985 1 mapread 0x476e02 thru 0x478953 (0x1b52 bytes) 000041 1331655571.036898 0 trunc from 0x63fa34 to 0x7a4e44 (0x165411 bytes) 000042 1331655579.471972 0 mapread 0x067b42 thru 0x076009 (0xe4c8 bytes) 000043 1331655579.511741 1 mapwrite 0x5f7798 thru 0x605ff1 (0xe85a bytes) 000044 1331655590.064530 1 mapwrite 0x5c79cb thru 0x5d718f (0xf7c5 bytes) 000045 1331655590.088253 1 write 0x4ea98b thru 0x4f1700 (0x6d76 bytes) we can see that when the DLM lock is switching from one client to another (presumably forcing a cache flush and transaction commit due to commit-on-cancel) that the operation latency is very high - over 10s in some cases, and this is a common occurrence. It would be useful to know whether IOR SSF has similar performance problems with ZFS, or if it is OK by virtue of the DLM lock getting split into disjoint areas and the threads being able to run without contention. This would likely benefit from a ZIL, but we'd need to figure out how to handle out-of-order ZIL operation commit within the Lustre ordered transaction model. Possibly we can change the osd-zfs to still return "success" to the BRW RPC, along with a ZIL record (possibly combined with the multi-slot client llog idea?) that will allow the RPC to be replayed locally even if the client failed, and not bump the last_committed value until the proper transaction commit callbacks are run. That way the client write is persistent on disk before the data is returned to the other client (without waiting for a full commit), but we don't risk data inconsistency if the writing client crashes and is unable to replay its write. As a first step, it makes sense to disable commit-on-cancel and see if this fixes the performance problem or not. |
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Li Wei (Inactive) [ 17/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/af0adc66-fedf-11e1-b4cd-52540035b04c This was b2_3 with OFD and ZFS OSTs. |
| Comment by Alex Zhuravlev [ 15/Sep/15 ] |
|
will be solved by ZIL support in LU-4009 |