HSM _not only_ small fixes and to do list goes here
(LU-3647)
|
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Technical task | Priority: | Blocker |
| Reporter: | Keith Mannthey (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | HSM, zfs | ||
| Environment: |
Patches submitted to autotest |
||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9548 | ||||||||
| Description |
|
from https://maloo.whamcloud.com/test_sets/0afc2c56-fc86-11e2-8ce2-52540035b04c This sanity-hsm test 21 seems to be hitting a lot right now test_21
Error: 'wrong block number'
Failure Rate: 33.00% of last 100 executions [all branches]
== sanity-hsm test 21: Simple release tests == 23:18:20 (1375510700) 2+0 records in 2+0 records out 2097152 bytes (2.1 MB) copied, 0.353933 s, 5.9 MB/s sanity-hsm test_21: @@@@@@ FAIL: wrong block number Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4202:error_noexit() |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 20/Aug/13 ] |
|
"stat -c %b <file>" failed to return 0 after "lfs hsm_set --archived --exist <file>". I am currently investigating test's lustre debug-logs, but it misses HSM debug traces ... May be currently modified sanity-hsm tests to mimic future copytool behavior need to also use commands to wait for HSM actions to complete ? Like "wait_request_state $fid ARCHIVE SUCCEED" in case of concerned "lfs hsm_set --archived --exist <file>" when used as a replacement for "lfs hsm_archive <file>". Will try to setup a platform to reproduce problem, with HSM debug traces enabled on Client/MDS VMs, and running sanity-hsm tests in a loop. |
| Comment by Bruno Faccini (Inactive) [ 30/Aug/13 ] |
|
I am not able to reproduce problem with current master, even by running sanity-hsm/test_21 in a loop. BTW, according to Maloo reports test_21 failures for 'wrong block number' stopped around Aug. 13th. And this seems to match with landing of patch for So my strong assumption is that that this ticket can be closed because unrelated now. |
| Comment by Bruno Faccini (Inactive) [ 02/Sep/13 ] |
|
To be re-opened in case of re-occurence. |
| Comment by Nathaniel Clark [ 27/Nov/13 ] |
|
This has been happening on ZFS pretty regularly: |
| Comment by Andreas Dilger [ 27/Nov/13 ] |
|
Failed again on ZFS. https://maloo.whamcloud.com/test_sessions/58a20a92-5759-11e3-8d5c-52540035b04c |
| Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ] |
|
Could it be that there is some timing delay, and more likely with ZFS, between hsm_release and st_blocks to become 1 ? |
| Comment by Andreas Dilger [ 02/Dec/13 ] |
|
It would make sense to me that "lfs hsm_release" would cause all of the DLM locks to be revoked from the client, so any stat from the client would return st_blocks == 1. This should be visible in the debug logs, if this test is run with at least +dlmtrace enabled. I think in the short term it makes sense to fix sanity-hsm.sh test_21 to enable full debug for this test (using debugsave() and debugrestore()), and print the actual block number that is returned. Maybe it is as simple as ZFS returning 2 with an external xattr or something, which might even happen with ldiskfs? Probably it makes sense to also allow some small margin, like 5 blocks or so. This test is also bad because there are two places that print "wrong block number", and it isn't even clear which one is failing. I think it makes sense to submit a patch to change this immediately to the following: local fid=$(make_small $f)
local orig_size=$(stat -c "%s" $f)
local orig_blocks=$(stat -c "%b" $f)
check_hsm_flags $f "0x00000000"
$LFS hsm_archive $f || error "could not archive file"
wait_request_state $fid ARCHIVE SUCCEED
local blocks=$(stat -c "%b" $f)
[ $blocks -eq $orig_blocks ] || error "$f: wrong blocks after archive: $blocks != $orig_blocks"
local size=$(stat -c "%s" $f)
[ $size -eq $orig_size ] || error "$f: wrong size after archive: $size != $orig_size"
# Release and check states
$LFS hsm_release $f || error "$f: could not release file"
check_hsm_flags $f "0x0000000d"
blocks=$(stat -c "%b" $f)
[ $blocks -gt 5 ] || error "$f: too many blocks after release: $blocks > 5"
size=$(stat -c "%s" $f)
[ $size -ne $orig_size ] || error "$f: wrong size after release: $size != $orig_size"
Maybe this will allow ZFS to pass, but even if it doesn't then we will have more information to debug the problem. |
| Comment by Nathaniel Clark [ 03/Dec/13 ] |
|
I'll work up and test a patch per Andreas's comment. |
| Comment by Bruno Faccini (Inactive) [ 03/Dec/13 ] |
|
Patch is at http://review.whamcloud.com/8467. |
| Comment by Andreas Dilger [ 04/Dec/13 ] |
|
Nathaniel, can you please submit a separate patch to disable this test for ZFS only. That can test and possibly land in parallel with 8467 if that does not give us any relief. |
| Comment by Nathaniel Clark [ 06/Dec/13 ] |
|
In case 8467 doesn't fix 21 for ZFS, here is a patch to EXCEPT it: http://review.whamcloud.com/8503 |
| Comment by Nathaniel Clark [ 10/Dec/13 ] |
|
There was a failure https://maloo.whamcloud.com/test_sets/8d7a2b66-6134-11e3-bd66-52540035b04c post #8467. It is, so far, unique (but not seemingly patch related). |
| Comment by Andreas Dilger [ 10/Dec/13 ] |
|
Hmm, looking at the test more closely, now that we have a test failure, it seems the test itself is defective. The blocks count == 1 should not be true after ARCHIVE (as it was checked beofre the 8467 patch), only after RELEASE. The failed test reported: # error "$f: wrong size after archive: $size != $orig_size" wrong block number after archive: 4103 != 1 so it is setting "orig_size" == 1. That the test passes on ldiskfs is likely just a race, because it is a bit faster and the client does not have the actual on-disk blocks count, so it will return "st_blocks=1" to userspace to avoid bugs in tar/rsync that think st_blocks=0 means the file has no data in it. The zfs test is a bit slower to archive, and the client gets the actual st_blocks value back, and (incorrectly) fails this check. I suspect that even with make_small using conv=fsync that the client still has cached the original blocks count, so what is needed to fix this bug is to flush the locks on the client before getting orig_blocks: cancel_lru_locks osc That way, it returns an accurate value for st_blocks for later testing. Nathaniel, can you please submit a patch ASAP so maybe we can land this by tomorrow. |
| Comment by Bruno Faccini (Inactive) [ 10/Dec/13 ] |
|
Andreas, I am not sure to understand you. Do you mean that on ZFS, to get the correct st_blocks from "stat -c %b" just after make_small(), we need to "cancel_lru_locks osc" in between ? |
| Comment by Nathaniel Clark [ 11/Dec/13 ] |
|
The OSS is definitely returning a funny (not the same) number of blocks after doing hsm_archive. I checked this on my local setup with wireshark and OST_GETATTR is replying with different number of blocks than before doing hsm_archive. It may be that the passing tests are the ones hitting the cached value on the client and the failure is the one getting the value from the OST. OST debug log: 1386629604.515513:0:1344:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START 1386629604.515525:0:1344:0:(ldlm_lockd.c:1253:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-lustre-OST0001_UUID lock: ffff880072cb7740/0xb2c4c9585d128715 lrc: 2/0,0 mode: --/PR res: [0x21aa:0x0:0x0].0 rrc: 2 type: EXT [0->0] (req 0->0) flags: 0x40000000000000 nid: local remote: 0x59a0a522abcc8a58 expref: -99 pid: 1344 timeout: 0 lvb_type: 0 1386629604.515559:0:1344:0:(client.c:1473:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_002:lustre-OST0001_UUID:1344:1453973284418480:10.10.16.236@tcp:106 1386629604.515582:0:1344:0:(client.c:2123:ptlrpc_set_wait()) set ffff880072921ec0 going to sleep for 6 seconds 1386629604.516784:0:1344:0:(ofd_lvb.c:195:ofd_lvbo_update()) res: [0x21aa:0x0:0x0] updating lvb size: 0 -> 2097152 1386629604.516789:0:1344:0:(ofd_lvb.c:207:ofd_lvbo_update()) res: [0x21aa:0x0:0x0] updating lvb atime: 0 -> 1386629600 1386629604.516799:0:1344:0:(ofd_lvb.c:265:ofd_lvbo_update()) res: [0x100000000:0x21aa:0x0] updating lvb blocks from disk: 1 -> 4103 1386629604.516813:0:1344:0:(client.c:1840:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ll_ost00_002:lustre-OST0001_UUID:1344:1453973284418480:10.10.16.236@tcp:106 |
| Comment by Nathaniel Clark [ 13/Dec/13 ] |
|
ZFS reports blocks written to disk, and not # of blocks used if the file were fully written to disk, so the number actually changes after the write completes in ZFS. It's not a locking or consistency issue in Lustre. |
| Comment by Nathaniel Clark [ 13/Dec/13 ] |
|
My previous patch actually skipped wrong test. |
| Comment by Andreas Dilger [ 14/Dec/13 ] |
|
I didn't see this comment before I inspected the patch. However, this implies a different bug in ZFS. Namely, the "small_file" helper in sanity-hsm is creating files with conv=fsync, so the fsync on close should flush all the blocks from the client cache and onto disk on the OST. After that point the blocks count for the file should not change. Likewise, if the locks are cancelled at the client, it should flush all the dirty blocks from cache to disk on the OST. This makes me wonder if ZFS is implementing fsync properly at the server. |
| Comment by Bruno Faccini (Inactive) [ 16/Dec/13 ] |
|
BTW, since my change #8467 also enable full debug logs and has landed already, we may be able to use such logs gathered during recent failures of sanity-hsm/test_20 now with the "wrong block number after archive: …" symptom, to troubleshoot. |
| Comment by Nathaniel Clark [ 16/Dec/13 ] |
|
conv=fsync causes an MDS_SYNC to be sent to the MDT (but no syncs are sent to the OSTs), it does not cause the OST_BRW_ASYNC flag to be cleared in the OST_WRITE, so the ost does not think it needs to sync the data to disk. So even changing conv=fsync to oflag=sync (which causes OST_SYNCs to be sent) here is the wire traffic: A stat of the file at this point only shows 2053 (the last OST_WRITE amount). For ldiskfs the wire traffic is: Looking at the client code, on OST_SYNC's processing, the oa is ignored. |
| Comment by Andreas Dilger [ 17/Dec/13 ] |
|
Looks like two different bugs. The conv=fsync option should indeed result in a single OST_SYNC RPC sent at the end of the writes. I suspect that this was previously skipped because OST writes were always synchronous (so fsync() was a no-op), and the async journal commit feature was developed on b1_8 and this wasn't fixed in the CLIO code when it landed. It should be noted that the Lustre OST_SYNC allows syncing a range of data on a single object, so the mapping of the VFS sync_page_range() method should map its range to the RPC, and extract that from the RPC on the server side(it migh already do this. The second problem about the client not updating the blocks count based on reply values should also be investigated. I expect that the ZFS block count is not updated by the time the write is submitted, so it doesn't reply with the new block count to the client. However, the subsequent OST_SYNC should result in the right blocks count being returned to the client and then being cached under the DLM lock. |
| Comment by Nathaniel Clark [ 17/Dec/13 ] |
|
No OST_SYNC from fsync() is |
| Comment by Andreas Dilger [ 25/Dec/13 ] |
|
Patch http://review.whamcloud.com/8575 was landed, so hopefully this test will now pass. The related failures (IMHO even more serious) still need to be fixed. |
| Comment by nasf (Inactive) [ 27/Dec/13 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/459d8ae0-6e86-11e3-b713-52540035b04c |
| Comment by Jodi Levi (Inactive) [ 30/Dec/13 ] |
|
I am closing this ticket per email comments from Andreas: Cheers, Andreas |