[LU-5833] sanity-lfsck test_6b: namespace lfsck completed unexpectedly Created: 31/Oct/14  Updated: 19/Nov/14  Resolved: 19/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 16356

 Description   

This issue was created by maloo for nasf <fan.yong@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/29838f2c-60fb-11e4-a66b-5254006e85c2.

The sub-test test_6b failed with the following error:

(6) Expect 'scanning-phase1', but got 'completed'

Please provide additional information about the failure here.

Info required for matching: sanity-lfsck 6b



 Comments   
Comment by nasf (Inactive) [ 31/Oct/14 ]

According to the log on MDS, when the namespace LFSCK started (resume from former run) for the last time, it seemed that the low layer iteration did not return more objects, as to the injected failure stub (OBD_FAIL_LFSCK_DELAY2) has not been triggered as expected, so there was no delay, so the LFSCK completed quickly. I have met such situation before. Although I did not catch the root reason, it should not related with the patch http://review.whamcloud.com/#/c/11848/14, because it has ever happened without this patch.

00000004:00000080:0.0:1414753244.871791:0:9874:0:(mdt_handler.c:5682:mdt_iocontrol()) handling ioctl cmd 0xc00866e6
00100000:10000000:1.0:1414753245.981101:0:9876:0:(lfsck_engine.c:1620:lfsck_assistant_engine()) lustre-MDT0000-osd: lfsck_namespace LFSCK assistant thread start
00100000:10000000:1.0:1414753245.981150:0:9875:0:(lfsck_namespace.c:3966:lfsck_namespace_prep()) lustre-MDT0000-osd: namespace LFSCK prep done, start pos [732, [0x200000bd4:0xdf:0x0], 0xa6f862b9510000]: rc = 0
00100000:10000000:1.0:1414753245.981685:0:9875:0:(lfsck_namespace.c:4181:lfsck_namespace_post()) lustre-MDT0000-osd: namespace LFSCK post done: rc = 0
00100000:10000000:1.0:1414753245.981695:0:9876:0:(lfsck_engine.c:1691:lfsck_assistant_engine()) lustre-MDT0000-osd: lfsck_namespace LFSCK assistant thread post
Comment by nasf (Inactive) [ 31/Oct/14 ]

Here is another failure instance without the patch 11848:
https://testing.hpdd.intel.com/test_sets/4ce9f344-5ca4-11e4-b9ce-5254006e85c2

Comment by nasf (Inactive) [ 03/Nov/14 ]

Another failure instance with the patch 11848:
https://testing.hpdd.intel.com/test_sets/5a28e144-626e-11e4-b9a7-5254006e85c2

Comment by nasf (Inactive) [ 03/Nov/14 ]

Inside the lfsck_prep(), the returned value from lfsck_open_dir() was not properly handled before returned back to the caller. For example: if the LFSCK arrived at the end of current directory when call lfsck_open_dir(), then the lfsck_open_dir() will return positive number, if the lfsck_prep() continuously returns such value to its caller, then the whole LFSCK first-stage scanning will be guarded as done by wrong.

Here is the patch to fix that:
http://review.whamcloud.com/#/c/12533

Comment by Gerrit Updater [ 19/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12533/
Subject: LU-5833 lfsck: handle lfsck_open_dir() return-value properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f935a36c035a20433669997f7d70b35073dff5f2

Generated at Sat Feb 10 01:54:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.