[LU-2304] Test failure sanityn test_16: dual-mount fsx data read error Created: 08/Nov/12 Updated: 14/Dec/12 Resolved: 08/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | NFBlocker | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5513 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/9c6f1590-2978-11e2-8600-52540035b04c. The sub-test test_16 failed with the following error in the test output: Chance of close/open is 1 in 50 Seed set to 2417 fd 0: /mnt/lustre/f.sanityn.16 fd 1: /mnt/lustre2/f.sanityn.16 1: 1352342417.730543 MAPWRITE 0x32aa5e thru 0x33283c (0x7ddf bytes) 2: 1352342417.751680 READ 0x1f84c9 thru 0x200ba0 (0x86d8 bytes) 3: 1352342417.872135 WRITE 0x1b3674 thru 0x1bed98 (0xb725 bytes) 4: 1352342417.881507 MAPREAD 0xf1843 thru 0xf5284 (0x3a42 bytes) 5: 1352342417.893255 READ 0x224e35 thru 0x230af5 (0xbcc1 bytes) 6: 1352342417.900307 TRUNCATE UP from 0x33283d to 0x6979fe 7: 1352342417.918382 WRITE 0x5f4a1 thru 0x6e924 (0xf484 bytes) 8: 1352342417.967363 WRITE 0x4b29fa thru 0x4c0ff5 (0xe5fc bytes) 9: 1352342417.977098 MAPREAD 0x4164fe thru 0x4207e6 (0xa2e9 bytes) 10: 1352342418.100538 WRITE 0x2fad28 thru 0x303bb0 (0x8e89 bytes) 11: 1352342418.201485 TRUNCATE DOWN from 0x6979fe to 0x258c7a 12: 1352342418.219849 MAPREAD 0xfa3d6 thru 0x100b0f (0x673a bytes) 13: 1352342418.347736 WRITE 0x303e61 thru 0x30d252 (0x93f2 bytes) HOLE 14: 1352342418.353891 MAPWRITE 0x4e321 thru 0x5ce11 (0xeaf1 bytes) 15: 1352342418.394662 WRITE 0x896fe thru 0x93ead (0xa7b0 bytes) ***WWWW 16: 1352342418.400602 MAPWRITE 0x44df8f thru 0x452e00 (0x4e72 bytes) 17: 1352342418.419486 WRITE 0x25ac40 thru 0x25c836 (0x1bf7 bytes) 18: 1352342418.423533 WRITE 0x45f04d thru 0x4698a9 (0xa85d bytes) HOLE 19: 1352342418.483128 TRUNCATE DOWN from 0x4698aa to 0x17f453 20: 1352342418.725636 MAPWRITE 0x9302ea thru 0x93fc06 (0xf91d bytes) 21: 1352342418.747719 MAPREAD 0x222ff1 thru 0x232993 (0xf9a3 bytes) 22: 1352342418.800646 MAPREAD 0x4a069d thru 0x4a6124 (0x5a88 bytes) 23: 1352342418.826136 WRITE 0x20567a thru 0x20c51b (0x6ea2 bytes) 24: 1352342418.885348 WRITE 0x2e5f90 thru 0x2e63cd (0x43e bytes) 25: 1352342418.893594 MAPREAD 0x93b057 thru 0x93fc06 (0x4bb0 bytes) 26: 1352342418.954895 MAPWRITE 0x3ed692 thru 0x3f6eaa (0x9819 bytes) 27: 1352342418.998428 MAPREAD 0x32aa46 thru 0x32ef20 (0x44db bytes) 28: 1352342419.095306 READ 0x917b3 thru 0x97e3f (0x668d bytes) ***RRRR*** Info required for matching: sanityn 16 |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 08/Nov/12 ] |
|
I will take a look at this. |
| Comment by Andreas Dilger [ 08/Nov/12 ] |
|
Also failed in: |
| Comment by Andreas Dilger [ 08/Nov/12 ] |
|
Debugging patch for printing fd number for log dump for multi-fd fsx: http://review.whamcloud.com/4498 |
| Comment by Bob Glossman (Inactive) [ 19/Nov/12 ] |
|
Also failed in: |
| Comment by Jinshan Xiong (Inactive) [ 20/Nov/12 ] |
|
From the log, it seems like the lock was canceled but there was NO write RPC issued. I'm reproducing this issue on toro and if I can't I will work out a debug patch for this problem |
| Comment by Jinshan Xiong (Inactive) [ 20/Nov/12 ] |
|
It turns out this problem is due to cl_lock again - a [0,EOF) truncate lock was matched so LDLM_FL_DISCARD_DATA was (wrongly) transmitted to cancel a write mode lock which then caused data corruption. This patch is easier to be seen on wide stripe files. I will work out a patch soon. |
| Comment by Jinshan Xiong (Inactive) [ 21/Nov/12 ] |
|
patch is at: http://review.whamcloud.com/4651 |
| Comment by Peter Jones [ 08/Dec/12 ] |
|
Landed for 2.4 |