[LU-5848] sanity-lfsck test_18e: MDS is not the expected 'completed' Created: 03/Nov/14  Updated: 07/Jul/15  Resolved: 07/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 16378

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/81b27534-6234-11e4-8055-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9037c036-5d26-11e4-ae10-5254006e85c2

The sub-test test_18e failed with the following error:

(4) MDS4 is not the expected 'completed'

Please provide additional information about the failure here.

Info required for matching: sanity-lfsck 18e



 Comments   
Comment by nasf (Inactive) [ 05/Nov/14 ]

According to the test_logs, the LFSCK status on the MDT4 is 'scanning-phase2', not the expected "completed". So means the LFSCK on the MDT4 should have moved to the second-stage scanning. Normally, before the second-stage scanning start, the LFSCK assistant thread will generate the LFSCK log "LFSCK assistant phase2 scan start", but I cannot find such log in the MDT4 debug_log, that is abnormal. The most possible reason is that the LFSCK assistant thread was blocked inside the dt_sync() before generating such LFSCK log message. Currently, we do not know what happened during the dt_sync(). I will make a debug patch to verify that.

Comment by nasf (Inactive) [ 05/Nov/14 ]

Here is the debug patch:
http://review.whamcloud.com/#/c/12573/

Comment by Gerrit Updater [ 23/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12573/
Subject: LU-5848 debug: more debug log for dt_sync
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ec37d78caf6cbd8118cfe1ff012c828950356e7a

Comment by Andreas Dilger [ 25/Nov/14 ]

Landed to master for 2.7.0

Comment by Bruno Faccini (Inactive) [ 03/Mar/15 ]

+1 at "https://testing.hpdd.intel.com/test_sets/f720aa60-c121-11e4-b948-5254006e85c2". May be there is something new to learn from the added debug logs ?

Comment by nasf (Inactive) [ 03/Mar/15 ]

According to the log, the layout LFSCK on MDT1 was NOT fall into dt_sync(), instead, the layout LFSCK assistant thread did not arrived dt_sync() point yet. Two possible cases:
1) The layout LFSCK assistant thread blocked inside lfsck_assistant_notify_others().
2) There are dead lock between the master engine and assistant thread.

Unfortunately, we have not the kernel stack trace for all the failure instances. We cannot know exactly what happened. Is there any way to get the stack trace?

Comment by Bruno Faccini (Inactive) [ 03/Mar/15 ]

Hello Nasf, after I tried to further+in-deep debug my own failure case, I agree with you that there is not much of interest (than the expected but missing log entries as you pointed!) in the Lustre debug logs.
You may want to add an "echo t > /proc/sysrq-trigger" cmd, sent to all nodes/MDS running MMDT facets, in sanity-lfsck/test_18e upon error.

Comment by nasf (Inactive) [ 03/Mar/15 ]

As I remembered, Maloo will collect the kernel stack trace automatically. But I do not know why it does not now.

Comment by Gerrit Updater [ 03/Mar/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13950
Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 249163ca2d2e605be1d7617ca1b0a778ccb05e5e

Comment by Gerrit Updater [ 11/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13950/
Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bb597d302fe0259ce60691677a6e79c3ff19bbb2

Generated at Sat Feb 10 01:55:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.