[LU-5848] sanity-lfsck test_18e: MDS is not the expected 'completed' Created: 03/Nov/14 Updated: 07/Jul/15 Resolved: 07/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 16378 |
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_18e failed with the following error: (4) MDS4 is not the expected 'completed' Please provide additional information about the failure here. Info required for matching: sanity-lfsck 18e |
| Comments |
| Comment by nasf (Inactive) [ 05/Nov/14 ] |
|
According to the test_logs, the LFSCK status on the MDT4 is 'scanning-phase2', not the expected "completed". So means the LFSCK on the MDT4 should have moved to the second-stage scanning. Normally, before the second-stage scanning start, the LFSCK assistant thread will generate the LFSCK log "LFSCK assistant phase2 scan start", but I cannot find such log in the MDT4 debug_log, that is abnormal. The most possible reason is that the LFSCK assistant thread was blocked inside the dt_sync() before generating such LFSCK log message. Currently, we do not know what happened during the dt_sync(). I will make a debug patch to verify that. |
| Comment by nasf (Inactive) [ 05/Nov/14 ] |
|
Here is the debug patch: |
| Comment by Gerrit Updater [ 23/Nov/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12573/ |
| Comment by Andreas Dilger [ 25/Nov/14 ] |
|
Landed to master for 2.7.0 |
| Comment by Bruno Faccini (Inactive) [ 03/Mar/15 ] |
|
+1 at "https://testing.hpdd.intel.com/test_sets/f720aa60-c121-11e4-b948-5254006e85c2". May be there is something new to learn from the added debug logs ? |
| Comment by nasf (Inactive) [ 03/Mar/15 ] |
|
According to the log, the layout LFSCK on MDT1 was NOT fall into dt_sync(), instead, the layout LFSCK assistant thread did not arrived dt_sync() point yet. Two possible cases: Unfortunately, we have not the kernel stack trace for all the failure instances. We cannot know exactly what happened. Is there any way to get the stack trace? |
| Comment by Bruno Faccini (Inactive) [ 03/Mar/15 ] |
|
Hello Nasf, after I tried to further+in-deep debug my own failure case, I agree with you that there is not much of interest (than the expected but missing log entries as you pointed!) in the Lustre debug logs. |
| Comment by nasf (Inactive) [ 03/Mar/15 ] |
|
As I remembered, Maloo will collect the kernel stack trace automatically. But I do not know why it does not now. |
| Comment by Gerrit Updater [ 03/Mar/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13950 |
| Comment by Gerrit Updater [ 11/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13950/ |