[LU-4885] sanity-lfsck test 18d failed: (3) MDS1 is not the expected 'completed' Created: 11/Apr/14  Updated: 11/Apr/14  Resolved: 11/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: lfsck
Environment:

Toro Autotest


Severity: 3
Rank (Obsolete): 13520

 Description   

sanity-lfsck test 18d has failed a few times during autotesting. One such failure is at https://maloo.whamcloud.com/test_sets/01266ea2-c171-11e3-9451-52540035b04c .

Expected LFSCK to end with status "completed", but got status "failed". From the client test_log:

Update not seen after 32s: wanted 'completed' got 'failed'
 sanity-lfsck test_18d: @@@@@@ FAIL: (3) MDS1 is not the expected 'completed' 

From the client console, we see:

02:01:39:LustreError: 11-0: lustre-OST0000-osc-ffff88007a5d8c00: Communicating with 10.10.4.191@tcp, operation ldlm_enqueue failed with -12.

The console message from MDS1 has:

02:00:14:LustreError: 17569:0:(lfsck_layout.c:1422:lfsck_layout_master_notify_others()) lustre-MDT0000-osd: fail to notify OST 7 for layout start: rc = -95
02:00:14:LustreError: 17569:0:(lfsck_layout.c:1551:lfsck_layout_master_notify_others()) lustre-MDT0000-osd: fail to notify MDT 7 for layout phase1 done: rc = -95
02:00:14:LustreError: 17569:0:(lfsck_layout.c:1343:lfsck_layout_master_query_others()) lustre-MDT0000-osd: fail to query MDT 7 for layout: rc = -95
02:00:14:LustreError: 17569:0:(lfsck_layout.c:1504:lfsck_layout_master_notify_others()) lustre-MDT0000-osd: fail to notify MDT 7 for layout stop/phase2: rc = -95

This failure looks a lot like LU-4879 where LFSCK completed with "partial" instead of "failed" in this case.



 Comments   
Comment by James Nunez (Inactive) [ 11/Apr/14 ]

Note that there are other test 18d failures that have the same error message, but the status after 32 seconds is "scanning-phase2"; https://maloo.whamcloud.com/test_sets/d1a9ae82-ab04-11e3-bd80-52540035b04c and https://maloo.whamcloud.com/test_sets/e4b24d3e-b7f9-11e3-987f-52540035b04c .

For these cases, we may just need to increase the wait time for LFSCK to complete.

Comment by nasf (Inactive) [ 11/Apr/14 ]

It is another failure instance of LU-4879.

Generated at Sat Feb 10 01:46:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.