[LU-6915] sanity-lfsck test 31h fail: “(3) unexpected status” Created: 27/Jul/15  Updated: 11/Feb/16  Resolved: 11/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: lfsck
Environment:

review-dne-part-2 in autotest


Issue Links:
Duplicate
duplicates LU-5911 sanity-lfsck test_31g: update not see... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-lfsck test 31h fails with “(3) unexpected status”. Logs are at: https://testing.hpdd.intel.com/test_sets/ef98233e-3293-11e5-8214-5254006e85c2

From the LFSCK namespace output, we see:

20:24:57:status: partial
20:24:57:flags: incomplete


 Comments   
Comment by James Nunez (Inactive) [ 02/Nov/15 ]

Another failure on master:
2015-10-31 04:03:03 - https://testing.hpdd.intel.com/test_sets/0248405c-7fbc-11e5-bf12-5254006e85c2

Another failure on master for sanity-lfsck test_31g:
2015-11-02 19:16:01 - https://testing.hpdd.intel.com/test_sets/fdb229ae-81cd-11e5-af7b-5254006e85c2

Comment by Jian Yu [ 02/Dec/15 ]

More instance on master:
https://testing.hpdd.intel.com/test_sets/79ea4116-9784-11e5-b72a-5254006e85c2

Comment by nasf (Inactive) [ 10/Feb/16 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/fdb5d7b8-cb18-11e5-be8d-5254006e85c2

Comment by nasf (Inactive) [ 11/Feb/16 ]
00000020:00000080:0.0:1448881260.165609:0:29625:0:(class_obd.c:229:class_handle_ioctl()) cmd = c00866e6
00000004:00000080:0.0:1448881260.165616:0:29625:0:(mdt_handler.c:5587:mdt_iocontrol()) handling ioctl cmd 0xc00866e6
00100000:10000000:0.0:1448881260.166859:0:29625:0:(lfsck_namespace.c:3798:lfsck_namespace_reset()) lustre-MDT0000-osd: namespace LFSCK reset: rc = 0
00100000:10000000:1.0:1448881260.167039:0:29627:0:(osd_scrub.c:652:osd_scrub_prep()) lustre-MDT0000: OI scrub prep, flags = 0x46
00100000:10000000:1.0:1448881260.167043:0:29627:0:(osd_scrub.c:278:osd_scrub_file_reset()) lustre-MDT0000: reset OI scrub file, old flags = 0x0, add flags = 0x0
00100000:10000000:1.0:1448881260.167157:0:29628:0:(lfsck_engine.c:1562:lfsck_assistant_engine()) lustre-MDT0000-osd: lfsck_namespace LFSCK assistant thread start
00100000:10000000:1.0:1448881260.167179:0:29626:0:(lfsck_namespace.c:4041:lfsck_namespace_prep()) lustre-MDT0000-osd: namespace LFSCK prep done, start pos [1, [0x0:0x0:0x0], 0x0]: rc = 0
00100000:10000000:1.0:1448881260.167185:0:29627:0:(osd_scrub.c:1498:osd_scrub_main()) lustre-MDT0000: OI scrub start, flags = 0x46, pos = 12
00100000:10000000:1.0:1448881260.167673:0:29626:0:(lfsck_namespace.c:3940:lfsck_namespace_checkpoint()) lustre-MDT0000-osd: namespace LFSCK checkpoint at the pos [12, [0x0:0x0:0x0], 0x0]: rc = 0
00100000:10000000:1.0:1448881260.167676:0:29626:0:(lfsck_engine.c:1046:lfsck_master_engine()) LFSCK entry: oit_flags = 0x60000, dir_flags = 0x8006, oit_cookie = 12, dir_cookie = 0x0, parent = [0x0:0x0:0x0], pid = 29626
00000100:00100000:0.0:1448881260.167737:0:29625:0:(client.c:1530:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lctl:lustre-MDT0000-mdtlov_UUID:29625:1519247244731748:10.2.4.167@tcp:1101
00000100:00100000:0.0:1448881260.167775:0:29625:0:(client.c:1530:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lctl:lustre-MDT0000-mdtlov_UUID:29625:1519247244731752:10.2.4.167@tcp:1101
00000100:00100000:0.0:1448881260.167784:0:29625:0:(client.c:1530:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lctl:lustre-MDT0000-mdtlov_UUID:29625:1519247244731756:10.2.4.167@tcp:1101
00000100:00100000:0.0:1448881260.167790:0:29625:0:(client.c:2210:ptlrpc_set_wait()) set ffff880059e146c0 going to sleep for 6 seconds
00100000:10000000:0.0:1448881260.170033:0:29625:0:(lfsck_lib.c:2031:lfsck_async_interpret_common()) lustre-MDT0000-osd: fail to notify MDT 3 for lfsck_namespace start: rc = -114
...

The logs shows that some former LFSCK instance has not finished yet when the new LFSCK start, that caused only part of MDTs joined the current LFSCK run, as to the finial LFSCK status was "partial", not "completed".

We should make all the LFSCK instances to be completed before next LFSCK run. We already have the solution with the patch http://review.whamcloud.com/#/c/17406/

Comment by nasf (Inactive) [ 11/Feb/16 ]

It is another failure instance of LU-7256.

Generated at Sat Feb 10 02:04:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.