[LU-2547] test: recovery-small test_24a, test_24b: multiop didn't fail fsync: rc 0 Created: 28/Dec/12  Updated: 22/Dec/17  Resolved: 02/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: revzfs
Environment:

Server: 2.1.3-1nasS, centos 6.3, 2.6.32_279.2.1.el6
Client: 2.3.0-1nasC, sles11sp2, 3.0.42_0.7.3
1 mds, 2 oss, 2 clients.
Test was run from service331.


Attachments: File nas-config.sh.lustre230.sles     File nas-make.sh.lustre230.sles     File recovery-small.24a.tgz    
Severity: 3
Rank (Obsolete): 5971

 Description   

== recovery-small test 24a: fsync error (should return error) ======================================== 23:15:59 (1356678959)
multiop /mnt/nbp0-1/d0.recovery-small/d24/f.recovery-small.24a vOwy_wyc
TMPPIPE=/tmp/multiop_open_wait_pipe.9556
fail_loc=0x0
Connected clients:
service331
service331
service332
recovery-small test_24a: @@@@@@ IGNORE (bz5494): multiop didn't fail fsync: rc 0

test_logs tarball is attached: recovery-small.24a.tgz



 Comments   
Comment by Peter Jones [ 29/Dec/12 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 31/Dec/12 ]

Jay, I tested b2_3 branch, it always suceeds, could you point me to the exact version that fails? or a place where I can download?

BTW, does this test always fail in your test environment?

Comment by Jay Lan (Inactive) [ 31/Dec/12 ]

I tested it three times and all failed in my test environment.

The source is at https://github.com/jlan/lustre-nas,
branch nas-2.3.0. Tag is 2.3.0-1nasC.

Comment by Lai Siyao [ 05/Jan/13 ]

Jay, it looks like the code can't compile against 3.0 kernel yet, could you list the patches you've applied?

Comment by Jay Lan (Inactive) [ 06/Jan/13 ]

Attached are nas-config and nas-make scripts. You need to modify the nas-config script to specify

  • kversion_base to the correct kernel version in your environment,
  • flavor, most likely "default" in your case,
  • make sure kversion and kdir are correct in your environment,
  • "--with-o2ib" parameter to your ofed kernel build environment.
  • This test is not about MPI tests, so comment out module load of mpi tests
Comment by Nathaniel Clark [ 04/Feb/13 ]

https://maloo.whamcloud.com/test_sets/f17e7f9c-6c8c-11e2-91d6-52540035b04c
pre-2.4 zfs

Comment by Jay Lan (Inactive) [ 01/Apr/13 ]

Nathaniel, is the above link for me to read? I can not access that link.

Comment by Nathaniel Clark [ 03/Apr/13 ]

Jay, Sorry, that's a link to a failing autotest run.

Comment by Nathaniel Clark [ 22/Apr/13 ]

This hasn't failed with ldiskfs in 4wks, but is failing over 50% of the time with zfs.

Comment by Nathaniel Clark [ 22/Apr/13 ]

EXCEPT this test for zfs
http://review.whamcloud.com/6119

Comment by Keith Mannthey (Inactive) [ 13/May/13 ]

It looked like a patch was landed but I saw a zfs fail today that looked exactly like this.

https://maloo.whamcloud.com/test_sets/f425e1a6-bc12-11e2-b013-52540035b04c

Comment by Nathaniel Clark [ 07/Jun/13 ]

Patch to EXCEPT 24b also for ZFS
http://review.whamcloud.com/6587

Comment by Bruno Faccini (Inactive) [ 04/Jul/13 ]

Got an occurrence with recovery-small/test_24b during https://maloo.whamcloud.com/test_sets/f43c1ffc-e4ad-11e2-a950-52540035b04c.
When will http://review.whamcloud.com/6587 land ?

Comment by Peter Jones [ 22/Aug/13 ]

Landed for 2.5

Comment by Nathaniel Clark [ 22/Aug/13 ]

Patch for b2_4 http://review.whamcloud.com/7424

Comment by Andreas Dilger [ 01/Oct/14 ]

recovery-small test_24a and test_24b are being skipped, the problem was not actually fixed.

Comment by Jay Lan (Inactive) [ 19/Oct/15 ]

Please close this ticket since the test was marked "always_except".

Comment by Peter Jones [ 19/Oct/15 ]

ok Jay

Comment by Andreas Dilger [ 20/Oct/15 ]

The whole point of the always_except label is that it means this test is being skipped, but the original bug has not actually been fixed. This ticket shouldn't be closed until the original problem is fixed (lack of error return to userspace on fsync) and the test is removed from the ALWAYS_EXCEPT list in recovery-small.sh. I verified that this test is still being skipped, but only for ZFS MDT.

Comment by Peter Jones [ 12/Aug/16 ]

Niu

Can you please check to see what needs to happen to get this test re-enabled?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 19/Aug/16 ]

Lustre fsync was semantically wrong before the fix of LU-4388, it didn't sync data on fsync at all (it get the semantic of fsync and fdatasync wrong), so it's not surprise this test fails often.

00000080:00000001:4.0:1356678959.907382:0:21907:0:(obd_class.h:2061:md_sync()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:4.0:1356678959.907382:0:21907:0:(client.c:2323:__ptlrpc_req_finished()) Process entered
00000100:00000040:4.0:1356678959.907383:0:21907:0:(client.c:2335:__ptlrpc_req_finished()) @@@ refcount now 0  req@ffff8805bdf93800 x1422578972363063/t0(0) o44->lustre-MDT0000-mdc-ffff8806f504e800@10.151.25.187@o2ib:12/10 lens 448/408 e 0 to 0 dl 1356678997 ref 1 fl Complete:R/0/0 rc 0/0
00000100:00000001:4.0:1356678959.907386:0:21907:0:(client.c:2245:__ptlrpc_free_req()) Process entered
02000000:00000001:4.0:1356678959.907388:0:21907:0:(sec.c:1697:sptlrpc_cli_free_repbuf()) Process entered
02000000:00000010:4.0:1356678959.907388:0:21907:0:(sec_null.c:231:null_free_repbuf()) kfreed 'req->rq_repbuf': 1024 at ffff8805d7a85000.
02000000:00000001:4.0:1356678959.907389:0:21907:0:(sec.c:1711:sptlrpc_cli_free_repbuf()) Process leaving
00000020:00000001:4.0:1356678959.907390:0:21907:0:(genops.c:963:class_import_put()) Process entered
00000020:00000040:4.0:1356678959.907390:0:21907:0:(genops.c:970:class_import_put()) import ffff880379630800 refcount=10 obd=lustre-MDT0000-mdc-ffff8806f504e800
00000020:00000001:4.0:1356678959.907391:0:21907:0:(genops.c:979:class_import_put()) Process leaving
02000000:00000010:4.0:1356678959.907392:0:21907:0:(sec_null.c:201:null_free_reqbuf()) kfreed 'req->rq_reqbuf': 512 at ffff8805d8599600.
02000000:00000001:4.0:1356678959.907394:0:21907:0:(sec.c:437:sptlrpc_req_put_ctx()) Process entered
02000000:00000001:4.0:1356678959.907394:0:21907:0:(sec.c:453:sptlrpc_req_put_ctx()) Process leaving
00000100:00000010:4.0:1356678959.907395:0:21907:0:(client.c:2299:__ptlrpc_free_req()) kfreed 'request': 928 at ffff8805bdf93800.
00000100:00000001:4.0:1356678959.907396:0:21907:0:(client.c:2300:__ptlrpc_free_req()) Process leaving
00000100:00000001:4.0:1356678959.907396:0:21907:0:(client.c:2339:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1)
00020000:00000002:4.0:1356678959.907397:0:21907:0:(lov_object.c:787:lov_lsm_addref()) lsm ffff8805b90ed640 addref 2 by ffff880382188280.
00020000:00000002:4.0:1356678959.907398:0:21907:0:(lov_object.c:799:lov_lsm_decref()) lsm ffff8805b90ed640 decref 2 by ffff880382188280.
00000080:00000001:4.0:1356678959.907399:0:21907:0:(file.c:2121:ll_fsync()) Process leaving (rc=0 : 0 : 0)

Above log shows fsync synced meta data only.

Given fsync has been fixed since 2.6 (by LU-4388), I think we can simply re-enable these two tests to see if it will go well.

Comment by Gerrit Updater [ 19/Aug/16 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/22020
Subject: LU-2547 test: re-enable 24a/b of recovery-small
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d8c6cdf073a7ae7bd0d9aafb4af288726f4d3bfe

Comment by Gerrit Updater [ 02/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22020/
Subject: LU-2547 test: re-enable 24a/b of recovery-small
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6912c6f8b1bce483915b3cb386d4695969d29407

Comment by Peter Jones [ 02/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 01:26:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.