[LU-4201] Test failure sanityn test_51b: file size is 4096, should be 1024 Created: 04/Nov/13  Updated: 03/Mar/14  Resolved: 17/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: zfs

Issue Links:
Duplicate
is duplicated by LU-4358 conf-sanity test_32c: sha1sum verific... Resolved
Related
is related to LU-3321 2.x single thread/process throughput ... Resolved
is related to LU-4358 conf-sanity test_32c: sha1sum verific... Resolved
Severity: 3
Rank (Obsolete): 11420

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/dbf3795a-4447-11e3-9bd9-52540035b04c.

The sub-test test_51b failed with the following error:

== sanityn test 51b: layout lock: glimpse should be able to restart if layout changed == 12:45:37 (1383421537)
file size is 4096, should be 1024

Info required for matching: sanityn 51b



 Comments   
Comment by Andreas Dilger [ 27/Nov/13 ]

Looks like some kind of i_size/glimpse corruption if it is returning 4096 for the size?

Comment by Nathaniel Clark [ 04/Dec/13 ]

client debug_log:

00000080:00200000:0.0:1386141829.191212:0:30848:0:(file.c:1042:ll_merge_lvb()) [0x200002b10:0xa4:0x0] updating i_size 0
00000020:00010000:0.0:1386141829.191216:0:30848:0:(cl_lock.c:151:cl_lock_trace0()) wait lock: ffff88007d3353d8@(2 ffff880037908080 1 3 0 1 1 0)(ffff88007d1ba1e8/0/1) at cl_wait():1541
00000080:00200000:0.0:1386141829.191225:0:30848:0:(vvp_io.c:817:vvp_io_write_start()) write: [0, 1024)
00000080:00200000:0.0:1386141829.191230:0:30848:0:(xattr.c:452:ll_getxattr()) VFS Op:inode=[0x200002b10:0xa4:0x0](ffff88006be951b8), xattr security.capability
00000080:00200000:0.0:1386141829.191234:0:30848:0:(xattr.c:293:ll_getxattr_common()) VFS Op:inode=[0x200002b10:0xa4:0x0](ffff88006be951b8)
00000080:00200000:0.0:1386141829.191242:0:30848:0:(rw26.c:563:ll_write_begin()) Writing 0 of 0 to 1024 bytes
00000080:00200000:0.0:1386141829.191279:0:30848:0:(rw26.c:686:ll_write_end()) page@ffff88007014a400[3 ffff88006ce7ad78:0 ^(null)_ffff88007015ba00 1 0 1 ffff88006ec02088 (null) 0x0]
00000080:00200000:0.0:1386141829.191283:0:30848:0:(rw26.c:686:ll_write_end()) page@ffff88007015ba00[1 ffff88006f3eadc0:0 ^ffff88007014a400_(null) 1 0 1 ffff88007a03f0b0 (null) 0x0]
00000080:00200000:0.0:1386141829.191288:0:30848:0:(rw26.c:686:ll_write_end()) vvp-page@ffff88007014a4b0(0:0:0) vm@ffffea00017c1d30 2000000000080d 3:0 ffff88007014a400 0 lru
00000080:00200000:0.0:1386141829.191290:0:30848:0:(rw26.c:686:ll_write_end()) lov-page@ffff88007014a500
00000080:00200000:0.0:1386141829.191299:0:30848:0:(rw26.c:686:ll_write_end()) osc-page@ffff88007015bad8: 1< 0x845fed 0 0 - - > 2< 0 0 0 0x0 0x100 | (null) ffff88007ba9f618 ffff88007be00658 > 3< - (null) 0 0 0 > 4< 0 0 8 0 - | - - - - > 5< - - - - | 0 - | 0 - ->
00000080:00200000:0.0:1386141829.191301:0:30848:0:(rw26.c:686:ll_write_end()) end page@ffff88007014a400
00000080:00200000:0.0:1386141829.191302:0:30848:0:(rw26.c:686:ll_write_end()) queued page: 1.
00000080:00200000:0.0:1386141829.191305:0:30848:0:(vvp_io.c:727:vvp_io_write_commit()) commit async pages: 1, from 0, to 1024
...
00000080:00200000:0.0:1386141829.682137:0:30848:0:(file.c:1042:ll_merge_lvb()) [0x200002b10:0xa4:0x0] updating i_size 4096
00000080:00200000:0.0:1386141829.682141:0:30848:0:(vvp_io.c:831:vvp_io_write_start()) write: nob 1024, result: 1024
...
00000100:00100000:1.0:1386141830.186238:0:3477:0:(service.c:1925:ptlrpc_server_handle_req_in()) got req x1453467811029384
00000100:00080000:1.0:1386141830.186244:0:3477:0:(service.c:1084:ptlrpc_update_export_timer()) updating export LOV_OSC_UUID at 1386141830 exp ffff88006e627000
00000100:00100000:1.0:1386141830.186259:0:3477:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-10.10.17.196@tcp, seq: 2104
00000100:00100000:1.0:1386141830.186264:0:3477:0:(service.c:2068:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb01_000:LOV_OSC_UUID+4:31958:x1453467811029384:12345-10.10.17.196@tcp:106
00010000:00010000:1.0:1386141830.186273:0:3477:0:(ldlm_lockd.c:1841:ldlm_handle_gl_callback()) ### client glimpse AST callback handler ns: lustre-OST0000-osc-ffff88003767c800 lock: ffff88007d515500/0x9dfdb35531fe83b6 lrc: 3/0,0 mode: PW/PW res: [0x581f:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20000000000 nid: local remote: 0xa186550ff6a5c654 expref: -99 pid: 30848 timeout: 0 lvb_type: 1
00000020:00010000:1.0:1386141830.186297:0:3477:0:(cl_object.c:298:cl_object_glimpse()) header@ffff88006ce7acf8[0x0, 3, [0x200002b10:0xa4:0x0] hash]
00000020:00010000:1.0:1386141830.186299:0:3477:0:(cl_object.c:298:cl_object_glimpse()) size: 4096 mtime: 1386141828 atime: 1386141828 ctime: 1386141828 blocks: 1
00000020:00010000:1.0:1386141830.186303:0:3477:0:(cl_object.c:298:cl_object_glimpse()) header@ffff88006f3ead40[0x0, 3, [0x100000000:0x581f:0x0] hash]
00000020:00010000:1.0:1386141830.186305:0:3477:0:(cl_object.c:298:cl_object_glimpse()) size: 4096 mtime: 1386141828 atime: 1386141828 ctime: 1386141828 blocks: 1
00000100:00100000:1.0:1386141830.186330:0:3477:0:(service.c:2119:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb01_000:LOV_OSC_UUID+4:31958:x1453467811029384:12345-10.10.17.196@tcp:106 Request procesed in 66us (118us total) trans 0 rc 0/0
00000100:00100000:1.0:1386141830.186336:0:3477:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.10.17.196@tcp, seq: 2104
...
00000080:00200000:0.0:1386141830.187693:0:30845:0:(file.c:1042:ll_merge_lvb()) [0x200002b10:0xa4:0x0] updating i_size 4096
Comment by Nathaniel Clark [ 06/Dec/13 ]

I believe this bas been fixed by http://review.whamcloud.com/7893 (LU-3321) by Jinshan

Comment by Andreas Dilger [ 09/Dec/13 ]

It looks like this bug is not fixed by the LU-3321 patch, since there are still failures seen with that patch applied:
https://maloo.whamcloud.com/test_sets/82292a72-6103-11e3-9f9b-52540035b04c

The test run for http://review.whamcloud.com/#/c/8507/1 is based on http://review.whamcloud.com/#/c/7383/33, which is based on "LU-4283 test: auster "-s" flag does not run SLOW tests" (http://review.whamcloud.com/8351 commit v2_5_52_0-14-gb0f4543), but the 7893 patch was landed back on 2013-12-02 as commit v2_5_51_0-97-g4fcbd1a.

Indeed, there are many sanityn test failures since 2013-12-02 even though one would expect that newly-submitted patches have been rebased to include this fix by now.

Comment by Andreas Dilger [ 09/Dec/13 ]

Has any attempt been made to root cause this problem, by looking at when the bug first appeared, and then trying to isolate it to a patch that introduced this regression? In some cases, this might be a much more efficient method of figuring out the source of the problem than trying to debug the defect from first principles.

Comment by Nathaniel Clark [ 09/Dec/13 ]

The earliest occurrence of this failure is actually on the above mentioned patch (http://review.whamcloud.com/7893) on 2013-11-02, but it doesn't land until 2013-12-02. There are several failures in early Nov, but they all seem to be version 7893 or patches that were dependent on it. and then failures start in earnest after it lands. I was looking at the older version of the code and saw that 7893 updated where the bug should be, and made a mistake.

Comment by Andreas Dilger [ 09/Dec/13 ]

Jinshan, it appears patch http://review.whamcloud.com/7893 introduced a regression in the ZFS testing. My first guess is that this has nothing to do with ZFS, and is just a race condition that is larger because the ZFS IO performance is slower than ldiskfs. At least I can't understand how a client-side change would introduce a bug in the ZFS code, unless it is handling the DLM/glimpse locking differently?

Nathaniel, in the debug logs shown above, is the "updating i_size 4096" a result of a glimpse/DLM lock returned from the OST? It is possible that the osd-zfs code is not saving i_size to disk properly and/or replying to the glimpse lock incorrectly.

Comment by Jinshan Xiong (Inactive) [ 10/Dec/13 ]

This should be the same problem as LU-4358. I will take a look.

Comment by Jinshan Xiong (Inactive) [ 10/Dec/13 ]

patch is at: http://review.whamcloud.com/8531

Comment by Nathaniel Clark [ 17/Dec/13 ]

Patch landed to master

Generated at Sat Feb 10 01:40:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.