[LU-988] sanity.sh subtest test_39j failed with "mtime is lost on close" Created: 12/Jan/12  Updated: 23/Mar/23  Resolved: 05/Jul/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4568

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://maloo.whamcloud.com/test_sets/1ae2fcae-3d32-11e1-9a65-5254004bbbd3
https://maloo.whamcloud.com/sub_tests/fa399b88-3c42-11e1-a222-5254004bbbd3
https://maloo.whamcloud.com/test_sets/28a02056-3cdd-11e1-9a65-5254004bbbd3

The sub-test test_39j failed with the following error:

mtime is lost on close: 1326363129, should be 1326363130

Info required for matching: sanity 39j



 Comments   
Comment by nasf (Inactive) [ 05/Feb/12 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/c1550c2c-4ff6-11e1-9ebc-5254004bbbd3

Comment by nasf (Inactive) [ 09/Feb/12 ]

Another failure instance:
https://maloo.whamcloud.com/test_sets/d62775e6-525c-11e1-aa4b-5254004bbbd3

Comment by Andreas Dilger [ 10/Feb/12 ]

https://maloo.whamcloud.com/test_sets/78e215b0-5259-11e1-aa4b-5254004bbbd3
https://maloo.whamcloud.com/test_sets/e47f8490-542c-11e1-97e1-5254004bbbd3

Comment by Li Wei (Inactive) [ 16/Feb/12 ]

https://maloo.whamcloud.com/test_sets/7a0e2f22-5840-11e1-9df1-5254004bbbd3

Comment by Li Wei (Inactive) [ 26/Feb/12 ]

https://maloo.whamcloud.com/test_sets/2b8d2222-5f16-11e1-ab6b-5254004bbbd3

Comment by Zhenyu Xu [ 08/Mar/12 ]

In a failure instance https://maloo.whamcloud.com/test_sessions/43b80858-68ee-11e1-91b4-5254004bbbd3
As client write the test file f39j, the ctime/mtime from servers wobbled a little bit, rewinded 1 seconds in some cases.

158313:00020000:00000002:0.0:1331173465.105834:0:25793:0:(lov_merge.c:135:lov_merge_lvb()) merged: 2097152 1331173465 1331173463 1331173465 0
167853:00020000:00000002:0.0:1331173465.114431:0:25794:0:(lov_merge.c:135:lov_merge_lvb()) merged: 2097152 1331173465 1331173463 1331173464 2048
170695:00020000:00000002:0.0:1331173465.120790:0:25795:0:(lov_merge.c:135:lov_merge_lvb()) merged: 2097152 1331173465 1331173463 1331173465 2048
240587:00020000:00000002:0.0:1331173465.185512:0:25811:0:(lov_merge.c:135:lov_merge_lvb()) merged: 2097152 1331173464 1331173463 1331173465 4096
Comment by Zhenyu Xu [ 08/Mar/12 ]

patch tracking at http://review.whamcloud.com/2274

Comment by Andreas Dilger [ 13/Mar/12 ]

Hit this again:
https://maloo.whamcloud.com/test_sets/d958bd22-6cd0-11e1-9174-5254004bbbd3

Bobijam, do you know why the time packed into the LVB is going backward? Is it because the clock in the VM itself is going backward, or because the timestamps on the inode are being lost?

Comment by Zhenyu Xu [ 13/Mar/12 ]

I don't have evidence that the inode timestamps on server are lost, it is most likely due to the clock in the VM drifted a little bit backward, I don't have strong evidence either, except for the "lov_merge.c" log I posted in my previous comment.

Comment by Zhenyu Xu [ 13/Mar/12 ]

I'll push a debugging patch to print out inode timestamp on OSTs in lov_merge_lvb.

Comment by Jian Yu [ 29/Mar/12 ]

Again:
https://maloo.whamcloud.com/test_sets/58508926-7a09-11e1-9d2a-5254004bbbd3

Comment by Jinshan Xiong (Inactive) [ 27/Apr/12 ]

another occurrence: https://maloo.whamcloud.com/test_sets/d60a69fc-9040-11e1-98a1-525400d2bfa6

Comment by Xuezhao Liu [ 25/May/12 ]

Another failure case:
https://maloo.whamcloud.com/test_sessions/d3d325bc-a610-11e1-8432-52540035b04c

Comment by nasf (Inactive) [ 28/May/12 ]

another failure:

https://maloo.whamcloud.com/test_sets/4193f5ec-a8ad-11e1-9ad2-52540035b04c

Comment by Ian Colle (Inactive) [ 01/Jun/12 ]

This seems to be happening more and more often:

May 28th - https://maloo.whamcloud.com/sub_tests/08d3c54e-a898-11e1-9ad2-52540035b04c
May 28th - https://maloo.whamcloud.com/sub_tests/6bd2ea3e-a8ad-11e1-9ad2-52540035b04c
May 29th - https://maloo.whamcloud.com/sub_tests/b923d0cc-aa02-11e1-b148-52540035b04c
May 29th - https://maloo.whamcloud.com/sub_tests/6229d398-aa0c-11e1-bd84-52540035b04c
May 30th - https://maloo.whamcloud.com/sub_tests/9dd49fc0-aa40-11e1-971d-52540035b04c
May 31st - https://maloo.whamcloud.com/sub_tests/45e6cf7e-ab60-11e1-8e7f-52540035b04c
June 1st - https://maloo.whamcloud.com/sub_tests/10d71476-ac17-11e1-9b8f-52540035b04c
June 1st - https://maloo.whamcloud.com/sub_tests/f42ea730-ac2a-11e1-a9bb-52540035b04c

Entire list here: https://maloo.whamcloud.com/sub_tests/query?&test_set%5Btest_set_script_id%5D=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test%5Bsub_test_script_id%5D=0831b228-32be-11e0-b685-52540025f9ae&sub_test%5Bstatus%5D=FAIL&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

Comment by Zhenyu Xu [ 03/Jun/12 ]

Got an log from another hit (https://maloo.whamcloud.com/test_sets/b4fc6bd0-acc3-11e1-a542-52540035b04c)

client2 log sanity.test_39j.debug_log.client-26vm2.1338628522.log

[ cancel osc lock ]
240482:00020000:00000002:0.0:1338628522.238524:0:27748:0:(lov_merge.c:103:lov_merge_lvb_kms()) MDT FID [0x200000400:0x287:0x0] on OST[1]: s=2097152 m=1338628520 a=1338628520 c=1338628522 b=0

240484:00020000:00000002:0.0:1338628522.238526:0:27748:0:(lov_merge.c:138:lov_merge_lvb()) merged for FID [0x200000400:0x287:0x0] s=2097152 m=1338628521 a=1338628520 c=1338628522 b=4096

Comment by Zhenyu Xu [ 09/Jun/12 ]

m=1338628521 a=1338628520 c=1338628522 shows that the mtime and ctime are not consistent on OST object, will push another debug patch to show what m/ctime used on OSS side.

Comment by Ian Colle (Inactive) [ 12/Jun/12 ]

Hit it again: https://maloo.whamcloud.com/test_sets/e9396c4a-b493-11e1-bdae-52540035b04c

Comment by Ian Colle (Inactive) [ 15/Jun/12 ]

And again:
https://maloo.whamcloud.com/test_sets/5637ce7a-b6a8-11e1-9045-52540035b04c

Comment by Keith Mannthey (Inactive) [ 18/Jun/12 ]

And again:
https://maloo.whamcloud.com/test_sets/611985be-b7a1-11e1-8e55-52540035b04c

Comment by Peter Jones [ 25/Jun/12 ]

Diagnostic patch landed on June 19th so need a hit after then

Comment by Zhenyu Xu [ 26/Jun/12 ]

from https://maloo.whamcloud.com/sub_tests/c70a1fde-bf65-11e1-acb7-52540035b04c

OST write time extracted from OST debug log

$ grep "write: s=" sanity.test_39j.debug_log.client-27vm4.1340673637.log
32034:00002000:00000002:0.0:1340673636.990185:0:3765:0:(filter_io_26.c:765:filter_commitrw_write()) FID [0x6b:0x0:0x0] to write: s=0 m=1340673636 a=1340673635 c=1340673636 b=1228800
32155:00002000:00000002:0.0:1340673637.038122:0:3765:0:(filter_io_26.c:784:filter_commitrw_write()) FID [0x6b:0x0:0x0] after write: s=1048576 m=1340673636 a=1340673635 c=1340673636 b=2048
32242:00002000:00000002:0.0:1340673637.070452:0:3768:0:(filter_io_26.c:765:filter_commitrw_write()) FID [0x6b:0x0:0x0] to write: s=0 m=1340673636 a=1340673635 c=1340673636 b=2097152
32271:00002000:00000002:0.0:1340673637.096759:0:3768:0:(filter_io_26.c:784:filter_commitrw_write()) FID [0x6b:0x0:0x0] after write: s=2097152 m=1340673636 a=1340673635 c=1340673636 b=4096 ===========> the object write time is 1340673636

let's check client log

*extent lock cl_wait()=>ccc_lock_state()=>cl_object_attr_get()=>lov_attr_get_raid0()
00020000:00000002:0.0:1340673636.893853:0:27817:0:(lov_merge.c:74:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] initial value: s=0 m=1340673635 a=1340673635 c=1340673635 b=0
00020000:00000002:0.0:1340673636.893857:0:27817:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] on OST[1]: s=0 m=1340673635 a=1340673635 c=1340673635 b=0

*glimpse lock part 1: cl_glimpse_lock()->cl_lock_request()=>cl_wait()
00020000:00000002:0.0:1340673637.030919:0:27819:0:(lov_merge.c:74:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] initial value: s=2097152 m=1340673636 a=1340673635 c=1340673636 b=0 =========> local inode lvb values
00020000:00000002:0.0:1340673637.030924:0:27819:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] on OST[1]: s=2097152 m=1340673637 a=1340673635 c=1340673637 b=0 =========> OST[1] lvb values have been updated to 37/35/37

*glimpse lock part 2: cl_glimpse_lock()>cl_merge_lvb()>ll_merge_lvb()->lov_merge_lvb()
00020000:00000001:0.0:1340673637.030944:0:27819:0:(lov_merge.c:136:lov_merge_lvb()) Process entered
00020000:00000002:0.0:1340673637.030945:0:27819:0:(lov_merge.c:74:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] initial value: s=2097152 m=1340673635 a=1340673635 c=1340673635 b=0 =======> ll_merge_lvb() initial values are values set on MDT object
00020000:00000002:0.0:1340673637.030948:0:27819:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] on OST[1]: s=2097152 m=1340673637 a=1340673635 c=1340673637 b=0
00020000:00000002:0.0:1340673637.030950:0:27819:0:(lov_merge.c:144:lov_merge_lvb()) merged for FID [0x200000400:0x263:0x0] s=2097152 m=1340673637 a=1340673635 c=1340673637 b=0

*glimpse lock part 1
00020000:00000002:0.0:1340673637.074620:0:27820:0:(lov_merge.c:74:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] initial value: s=2097152 m=1340673637 a=1340673635 c=1340673637 b=0
00020000:00000002:0.0:1340673637.074624:0:27820:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] on OST[1]: s=2097152 m=1340673637 a=1340673635 c=1340673636 b=2048 =========> glimpse lock get a mismatched mtime and ctime

*glimpse lock part 2
00020000:00000002:0.0:1340673637.074638:0:27820:0:(lov_merge.c:74:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] initial value: s=2097152 m=1340673635 a=1340673635 c=1340673635 b=0
00020000:00000002:0.0:1340673637.074641:0:27820:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] on OST[1]: s=2097152 m=1340673637 a=1340673635 c=1340673636 b=2048
00020000:00000002:0.0:1340673637.074643:0:27820:0:(lov_merge.c:144:lov_merge_lvb()) merged for FID [0x200000400:0x263:0x0] s=2097152 m=1340673637 a=1340673635 c=1340673636 b=2048

*glimpse lock part 2, glimpse lock state unchanged
00020000:00000002:0.0:1340673637.084563:0:27821:0:(lov_merge.c:74:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] initial value: s=2097152 m=1340673635 a=1340673635 c=1340673637 b=2048 =======> MDT object changed ctime
00020000:00000002:0.0:1340673637.084569:0:27821:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] on OST[1]: s=2097152 m=1340673637 a=1340673635 c=1340673636 b=2048
00020000:00000002:0.0:1340673637.084571:0:27821:0:(lov_merge.c:144:lov_merge_lvb()) merged for FID [0x200000400:0x263:0x0] s=2097152 m=1340673637 a=1340673635 c=1340673637 b=2048

*local locks cancelled
00000001:02000400:0.0:1340673637.086290:0:27822:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: cancel_lru_locks osc start

*glimpse lock part 1: request glimpse lock
00020000:00000002:0.0:1340673637.158531:0:27837:0:(lov_merge.c:74:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] initial value: s=2097152 m=1340673637 a=1340673635 c=1340673637 b=2048
00020000:00000002:0.0:1340673637.158535:0:27837:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] on OST[1]: s=2097152 m=1340673636 a=1340673635 c=1340673636 b=4096

*glimpse lock part 2
00020000:00000002:0.0:1340673637.158551:0:27837:0:(lov_merge.c:74:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] initial value: s=2097152 m=1340673635 a=1340673635 c=1340673637 b=2048
00020000:00000002:0.0:1340673637.158555:0:27837:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT FID [0x200000400:0x263:0x0] on OST[1]: s=2097152 m=1340673636 a=1340673635 c=1340673636 b=4096
00020000:00000002:0.0:1340673637.158557:0:27837:0:(lov_merge.c:144:lov_merge_lvb()) merged for FID [0x200000400:0x263:0x0] s=2097152 m=1340673636 a=1340673635 c=1340673637 b=4096

Comment by Zhenyu Xu [ 27/Jun/12 ]

patch tracking at http://review.whamcloud.com/3200

LU-988 clio: use OSC object's c/mtime when build write RPC

When building OST write RPC, inode's c/mtime is out-dated until
lov_merge_lvb() is called, so we need use OSC object's c/mtime to
set OST object's c/mtime.

Comment by nasf (Inactive) [ 30/Jun/12 ]

another failure instance:
https://maloo.whamcloud.com/test_sets/0b6327c0-c2fe-11e1-b851-52540035b04c

Comment by Gerrit Updater [ 17/Oct/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48887
Subject: LU-988 doc: add lctl-lcfg man pages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8bb54c9c3887a657a460074a85c603e073c556c5

Comment by Gerrit Updater [ 23/Mar/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50387
Subject: LU-988 doc: move lnet pages, remove old docs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d337bfb3917dcd34900d2953c8e6801278b7b3fa

Generated at Sat Feb 10 01:12:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.