[LU-1850] replay-vbr test_1b: FAIL: open succeeded unexpectedly Created: 06/Sep/12  Updated: 12/Aug/16  Resolved: 28/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Peter Jones Assignee: Sarah Liu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4426

 Description   

https://maloo.whamcloud.com/test_sessions/a693a99c-f682-11e1-8eb0-52540035b04c

https://maloo.whamcloud.com/test_logs/04f30584-f686-11e1-8eb0-52540035b04c/show_text

[HOST:client-29vm6] [old_quota_type:3] [new_quota_type:ug3]
open /mnt/lustre failed: Stale NFS file handle (116)
quotacheck failed: Stale NFS file handle
metadata-updates : @@@@@@ FAIL: quotacheck has failed
 Trace dump:
 = /usr/lib64/lustre/tests/test-framework.sh:3638:error_noexit()
 = /usr/lib64/lustre/tests/test-framework.sh:3660:error()
 = /usr/lib64/lustre/tests/test-framework.sh:989:quota_save_version()
 = /u



 Comments   
Comment by Peter Jones [ 06/Sep/12 ]

Niu

Could you please look at this failure and see if you can work out why this has happened?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 10/Sep/12 ]

The error message is indicating that the /mnt/lustre isn't accessiable (got -ESTALE when trying to open it), it's not related to quota.

I think it could probably caused by the insane testing environment after replay-vbr failed.

We can see from the maloo report:
1. replay-vbr failed;
2. /mnt/lustre become inaccessiable, tests followed replay-vbr failed for -ESTALE when accessing /mnt/lustre; (from insanity to performance-sanity)
3. testing environment recovered after node reinitalization; (after performance-sanity)

I think it should be a test script issue (we'd make sure the testing environment is clean after each failed test?) but not a lustre bug.

Comment by Peter Jones [ 10/Sep/12 ]

Thanks Niu. Sarah will look at making the necessary test script improvements

Comment by Sarah Liu [ 11/Sep/12 ]

Try unmount and remount client, make sure the system is ready for the following tests.
Patch for b2_3 is here http://review.whamcloud.com/#change,3929

Comment by Jian Yu [ 11/Sep/12 ]

The initial issue is that replay-vbr test_1b failed as follows:

CMD: client-29vm6 checkstat -v -a /mnt/lustre/d0.replay-vbr/d1/f.replay-vbr.1b
Can't lstat /mnt/lustre/d0.replay-vbr/d1/f.replay-vbr.1b: Stale NFS file handle
 replay-vbr test_1b: @@@@@@ FAIL: open succeeded unexpectedly 

Dmesg on Client client-29vm6:

[33466.651998] Lustre: DEBUG MARKER: checkstat -v -a /mnt/lustre/d0.replay-vbr/d1/f.replay-vbr.1b
[33466.659487] LustreError: 3642:0:(file.c:2328:ll_inode_revalidate_fini()) failure -116 inode 524300
[33466.724390] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-vbr test_1b: @@@@@@ FAIL: open succeeded unexpectedly 

Dmesg on MDS client-29vm3:

LustreError: 8705:0:(llog_lvfs.c:610:llog_lvfs_create()) error looking up logfile 0x60:0x3f5bbfdd: rc -116
LustreError: 8705:0:(llog_cat.c:168:llog_cat_id2handle()) error opening log id 0x60:3f5bbfdd: rc -116
LustreError: 8705:0:(llog_obd.c:314:cat_cancel_cb()) Cannot find handle for log 0x60
LustreError: 8676:0:(llog_obd.c:387:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -116
LustreError: 8676:0:(llog_obd.c:214:llog_setup_named()) obd lustre-OST0003-osc-MDT0000 ctxt 2 lop_setup=ffffffffa03cbdf0 failed -116
LustreError: 8676:0:(osc_request.c:3281:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
LustreError: 8676:0:(osc_request.c:3298:__osc_llog_init()) osc 'lustre-OST0003-osc-MDT0000' tgt 'mdd_obd-lustre-MDT0000' catid ffff880065dd78d0 rc=-116
LustreError: 8676:0:(osc_request.c:3300:__osc_llog_init()) logid 0x58:0x3f5bbfd2
LustreError: 8676:0:(osc_request.c:3328:osc_llog_init()) rc: -116
LustreError: 8676:0:(lov_log.c:243:lov_llog_init()) error osc_llog_init idx 3 osc 'lustre-OST0003-osc-MDT0000' tgt 'mdd_obd-lustre-MDT0000' (rc=-116)
LustreError: 8676:0:(llog_lvfs.c:610:llog_lvfs_create()) error looking up logfile 0x59:0x3f5bbfd3: rc -116
LustreError: 8676:0:(llog_obd.c:214:llog_setup_named()) obd lustre-OST0004-osc-MDT0000 ctxt 2 lop_setup=ffffffffa03cbdf0 failed -116
LustreError: 8676:0:(osc_request.c:3281:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
LustreError: 8676:0:(osc_request.c:3298:__osc_llog_init()) osc 'lustre-OST0004-osc-MDT0000' tgt 'mdd_obd-lustre-MDT0000' catid ffff880065dd78d0 rc=-116
LustreError: 8676:0:(osc_request.c:3300:__osc_llog_init()) logid 0x59:0x3f5bbfd3
LustreError: 8676:0:(osc_request.c:3328:osc_llog_init()) rc: -116
LustreError: 8676:0:(lov_log.c:243:lov_llog_init()) error osc_llog_init idx 4 osc 'lustre-OST0004-osc-MDT0000' tgt 'mdd_obd-lustre-MDT0000' (rc=-116)
LustreError: 8706:0:(lov_log.c:155:lov_llog_origin_connect()) error osc_llog_connect tgt 3 (-107)
LustreError: 8706:0:(mds_lov.c:873:__mds_lov_synchronize()) lustre-OST0003_UUID failed at llog_origin_connect: -107
LustreError: 8706:0:(mds_lov.c:903:__mds_lov_synchronize()) lustre-OST0003_UUID sync failed -107, deactivating
LustreError: 8707:0:(lov_log.c:155:lov_llog_origin_connect()) error osc_llog_connect tgt 4 (-107)
LustreError: 8707:0:(mds_lov.c:873:__mds_lov_synchronize()) lustre-OST0004_UUID failed at llog_origin_connect: -107
LustreError: 8707:0:(mds_lov.c:903:__mds_lov_synchronize()) lustre-OST0004_UUID sync failed -107, deactivating
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 2>/dev/null
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-vbr test_1b: @@@@@@ FAIL: open succeeded unexpectedly 
Lustre: DEBUG MARKER: replay-vbr test_1b: @@@@@@ FAIL: open succeeded unexpectedly

It was really a Lustre issue. We need fix this Lustre issue first.

Comment by Niu Yawei (Inactive) [ 11/Sep/12 ]

don't we have ticket for the replay-vbr test_1b failure? If not I think we'd open another ticket for it or change the description/summary of this ticket and fix it here.

Comment by Jian Yu [ 12/Sep/12 ]

don't we have ticket for the replay-vbr test_1b failure? If not I think we'd open another ticket for it or change the description/summary of this ticket and fix it here.

I just searched for this failure on Jira and Maloo, and found it did not occur in the past. So I changed the summary of this ticket.

For the test script, after the above Lustre failure occurs, I'm not sure whether trying to mount /mnt/lustre again will succeed or not. Li Wei, could you please review the above patch created by Sarah to see whether that will reduce the affection on the subsequent sub-tests/tests after the Lustre failure occurs? Thanks.

Comment by Li Wei (Inactive) [ 12/Sep/12 ]

OK, will do the review in a couple of hours.

Comment by Niu Yawei (Inactive) [ 12/Sep/12 ]

I highly suspect the replay-vbr test_1b failure is another instance of LVM/VM flush problem, like LU-1012. From the log, looks like the llog object creation isn't committed before replay-barrier set the fs as read_only.

Comment by Jian Yu [ 15/Sep/12 ]

Here are the historical Maloo reports for replay-vbr test 1b failures:
http://tinyurl.com/8tybkum

As we can see, the failure in this ticket did not occur regularly. And per the analysis from Niu, this is very likely another instance of LVM/VM flush issue. So, we can lower the priority of this ticket.

We need land the test script patch created by Sarah to reduce the affection on the subsequent sub-tests/tests after the failure occurs.

Comment by Sarah Liu [ 20/Sep/12 ]

patch for master http://review.whamcloud.com/#change,4063

Comment by Jodi Levi (Inactive) [ 28/Sep/12 ]

Reopening ticket to add Fix Version 2.3

Generated at Sat Feb 10 01:20:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.