[LU-7190] sanity-lfsck test_18a: FAIL: (6.1) Expect 1 fixed on mds1, but got: 0 Created: 22/Sep/15  Updated: 12/May/16  Resolved: 06/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Build: https://build.hpdd.intel.com/job/lustre-master/3179
Distro/Arch: RHEL7.1/x86_64


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-lfsck test 18a failed as follows:

CMD: shadow-17vm12 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout
 sanity-lfsck test_18a: @@@@@@ FAIL: (6.1) Expect 1 fixed on mds1, but got: 0 

Maloo report: https://testing.hpdd.intel.com/test_sets/bf6d7fc0-5a65-11e5-9147-5254006e85c2



 Comments   
Comment by Joseph Gmitter (Inactive) [ 22/Sep/15 ]

Hi Fan Yong,
Could you have a look at this?
Thanks.
Joe

Comment by nasf (Inactive) [ 29/Sep/15 ]

The failure is caused by some unexpected MDT-OST connection issue. During the 2nd phase scanning, the layout LFSCK slave engine on the OST will query the master engine status from the MDT periodically. Sometimes, the query RPC may hit failure that may because network trouble, or the MDS node issues. To make the LFSCK can go ahead, the slave engine will not wait for ever, instead, it will assume the master engine has exited without notifying (or fail to notify) the slave engine. So the slave engine will exit also and clean up the LFSCK environment on the OST, including the OST-object access bitmap that is used to find out orphan OST-objects.

On the other hand, the assumption of master engine exit maybe wrong. If the master engine does not exit, and the network trouble between the MDS and OSS recovered after the slave engine exited, then the master engine will try to find out orphan OST-objects during its 2nd phase scanning. But because the slave engine has already exited and released the OST-object access bitmap, the master engine has no way to find out orphan OST-objects. That is why test_18a failed.

Comment by Gerrit Updater [ 29/Sep/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16667
Subject: LU-7190 lfsck: tolerate MDT-OST communication failures
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dac0ed16ab2493fd3ae4b53c8a30b356e9de5873

Comment by Gerrit Updater [ 06/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16667/
Subject: LU-7190 lfsck: tolerate MDT-OST communication failures
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 978458e05db4cad21e3ee32384168f53fd3e2d72

Comment by Peter Jones [ 06/Oct/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:06:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.