[LU-2628] replay-single test_40: mdc_enqueue() ldlm_cli_enqueue: -4 Created: 16/Jan/13  Updated: 27/Aug/19  Resolved: 27/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6150

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/7179c526-600d-11e2-84d4-52540035b04c.

The sub-test test_40 failed with the following error:

test_40 failed with 4

Info required for matching: replay-single 40

I met this during DNE test, but DNE does not touch this part of code at all. I just investigate the debug log
a bit, it seems the failure is related with recent layout lock landing. Jinshan, Could you please have a look?

00000100:00100000:0.0:1358254377.783158:0:23931:0:(client.c:2059:ptlrpc_set_wait()) set ffff88006c48ae00 going to sleep for 0 seconds
00000100:00100000:0.0:1358254378.202028:0:23930:0:(client.c:2059:ptlrpc_set_wait()) set ffff88006c48a980 going to sleep for 0 seconds
00000100:00100000:0.0:1358254378.783171:0:23931:0:(client.c:2059:ptlrpc_set_wait()) set ffff88006c48ae00 going to sleep for 0 seconds
00000100:00100000:0.0:1358254379.202097:0:23930:0:(client.c:2059:ptlrpc_set_wait()) set ffff88006c48a980 going to sleep for 0 seconds
00000100:00100000:0.0:1358254379.217287:0:23931:0:(client.c:1957:ptlrpc_interrupted_set()) INTERRUPTED SET ffff88006c48ae00
00000100:00100000:0.0:1358254379.217291:0:23931:0:(client.c:2059:ptlrpc_set_wait()) set ffff88006c48ae00 going to sleep for 0 seconds
00000100:00100000:0.0:1358254379.217298:0:23931:0:(client.c:1774:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc writeme:ba5d0671-fa8b-7c21-6711-f632712f4bd4:23931:1424228991737390:10.10.4.211@tcp:101
00010000:00010000:0.0:1358254379.217308:0:23931:0:(ldlm_request.c:547:ldlm_cli_enqueue_fini()) ### client-side enqueue END (FAILED) ns: lustre-MDT0000-mdc-ffff88006ccbac00 lock: ffff88006bea0200/0xb20e505c2e7d4ca0 lrc: 4/1,0 mode: --/CR res: 8589959616/1603 bits 0x8 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 23931 timeout: 0 lvb_type: 3
00010000:00010000:0.0:1358254379.217315:0:23931:0:(ldlm_request.c:496:failed_lock_cleanup()) ### setting FL_LOCAL_ONLY | LDLM_FL_FAILED | LDLM_FL_ATOMIC_CB | LDLM_FL_CBPENDING ns: lustre-MDT0000-mdc-ffff88006ccbac00 lock: ffff88006bea0200/0xb20e505c2e7d4ca0 lrc: 4/1,0 mode: --/CR res: 8589959616/1603 bits 0x8 rrc: 1 type: IBT flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 23931 timeout: 0 lvb_type: 3
00010000:00010000:0.0:1358254379.217320:0:23931:0:(ldlm_lock.c:849:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: lustre-MDT0000-mdc-ffff88006ccbac00 lock: ffff88006bea0200/0xb20e505c2e7d4ca0 lrc: 4/1,0 mode: --/CR res: 8589959616/1603 bits 0x8 rrc: 1 type: IBT flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 23931 timeout: 0 lvb_type: 3
00010000:00010000:0.0:1358254379.217326:0:23931:0:(ldlm_lock.c:914:ldlm_lock_decref_internal()) ### final decref done on cbpending lock ns: lustre-MDT0000-mdc-ffff88006ccbac00 lock: ffff88006bea0200/0xb20e505c2e7d4ca0 lrc: 3/0,0 mode: --/CR res: 8589959616/1603 bits 0x8 rrc: 1 type: IBT flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 23931 timeout: 0 lvb_type: 3
00010000:00010000:0.0:1358254379.217332:0:23931:0:(ldlm_lockd.c:1667:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-MDT0000-mdc-ffff88006ccbac00 lock: ffff88006bea0200/0xb20e505c2e7d4ca0 lrc: 4/0,0 mode: --/CR res: 8589959616/1603 bits 0x8 rrc: 1 type: IBT flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 23931 timeout: 0 lvb_type: 3
00010000:00010000:0.0:1358254379.217336:0:23931:0:(ldlm_lockd.c:1680:ldlm_handle_bl_callback()) Lock ffff88006bea0200 already unused, calling callback (ffffffffa09c8f90)
00010000:00010000:0.0:1358254379.217339:0:23931:0:(ldlm_request.c:1095:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88006ccbac00 lock: ffff88006bea0200/0xb20e505c2e7d4ca0 lrc: 5/0,0 mode: --/CR res: 8589959616/1603 bits 0x8 rrc: 1 type: IBT flags: 0x20e400000000 nid: local remote: 0x0 expref: -99 pid: 23931 timeout: 0 lvb_type: 3
00010000:00010000:0.0:1358254379.217361:0:23931:0:(ldlm_request.c:1108:ldlm_cli_cancel_local()) not sending request (at caller's instruction)
00010000:00010000:0.0:1358254379.217364:0:23931:0:(ldlm_lockd.c:1689:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lustre-MDT0000-mdc-ffff88006ccbac00 lock: ffff88006bea0200/0xb20e505c2e7d4ca0 lrc: 3/0,0 mode: --/CR res: 8589959616/1603 bits 0x8 rrc: 1 type: IBT flags: 0xa0f400000000 nid: local remote: 0x0 expref: -99 pid: 23931 timeout: 0 lvb_type: 3
00010000:00010000:0.0:1358254379.217369:0:23931:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88006ccbac00 lock: ffff88006bea0200/0xb20e505c2e7d4ca0 lrc: 0/0,0 mode: --/CR res: 8589959616/1603 bits 0x8 rrc: 1 type: IBT flags: 0xa0f400000000 nid: local remote: 0x0 expref: -99 pid: 23931 timeout: 0 lvb_type: 3
00000002:00020000:0.0:1358254379.217379:0:23931:0:(mdc_locks.c:830:mdc_enqueue()) ldlm_cli_enqueue: -4
00000080:00200000:0.0:1358254379.218420:0:23931:0:(vvp_io.c:95:vvp_io_fini()) ignore/verify layout 0/0, layout version -1.
00000080:00200000:0.0:1358254379.218486:0:23931:0:(file.c:306:ll_file_release()) VFS Op:inode=144115607908910659/33554529(ffff88006cbf8238)
00000002:00080000:0.0:1358254379.218516:0:23931:0:(mdc_request.c:836:mdc_close()) @@@ matched open req@ffff88006c57d000 x1424228991737309/t171798692241(171798692241) o101->lustre-MDT0000-mdc-ffff88006ccbac00@10.10.4.211@tcp:12/10 lens 576/544 e 0 to 0 dl 1358254380 ref 1 fl Complete:RP/4/0 rc 0/301



 Comments   
Comment by Jinshan Xiong (Inactive) [ 17/Jan/13 ]

The root cause of this problem is that the mdt was still in recovery state when the client at vm1 was trying to enqueue layout lock. Finally it was then interrupted by signal.

The reason that the mdt couldn't be recovered was that the client at vm2 failed to reconnect because the request was dropped by setting of OBD_FAIL_MDS_CONNECT_NET.

Taking a step back, the idea of this test case is to verify that the client can still write to OST while MDT is not connected. This is no longer right after layout lock is introduced because we have to verify the layout is correct before writing to OSTs.

I'd like to fix this problem by disabling this test case. How do you guys think?

Comment by Jinshan Xiong (Inactive) [ 17/Jan/13 ]

patch is at: http://review.whamcloud.com/5056

Comment by nasf (Inactive) [ 24/Jan/13 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/3db35304-6645-11e2-a42b-52540035b04c

Comment by nasf (Inactive) [ 29/Jan/13 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/7581330a-6a28-11e2-9da0-52540035b04c

Comment by Oleg Drokin [ 29/Jan/13 ]

Terrible idea to disable the test case.
Just ensure the client has the layout lock first before killing MDS.

Comment by Jinshan Xiong (Inactive) [ 30/Jan/13 ]

Patch Set 3: I would prefer that you didn't submit this
I think this is a terrible idea. We still need to ensure clients are able to do io in absense of MDS (otherwise we got a single point of failure for the entire cluster). So this test must remain. It may just need to be changed to ensure the client has necessary layout lock before killing off MDS.
The only way we can land this in this state if it severely affects othr testing, and then on the condition that somebody will start working on a proper test fix right away for a very soon submission.

Though I can do some trick to make this test pass it can't change the fact that now IO relies on the MDT otherwise wrong objects would be updated by the client. Layout lock can be lost at any time so I don't understand what's wrong to disable a test case which doesn't reflect real world use cases.

Maybe you think that IO is just to push some data to OST objects but now things have changed. Verifying layout is also part of IO, IMHO.

Comment by Andreas Dilger [ 30/Jan/13 ]

Oleg, I think the MDS is already a single point of failure. Sure, a client could continue to write to a file for a short time if the MDS has failed, but it will then block as soon as it tries to use the next file.

I agree that if the client has these DLM locks cached that it could continue to work while the MDS is down, but this also risks the client being "rogue" and writing to objects that are incorrect because it has lost its connection to the MDS, but the client doesn't know it yet.

Comment by Keith Mannthey (Inactive) [ 12/Feb/13 ]

Another instance: It reported 3 out of 100 hit it.
https://maloo.whamcloud.com/test_sessions/4cb4d830-7579-11e2-bf59-52540035b04c

Comment by Andreas Dilger [ 11/Mar/13 ]

It sounds from Oleg's comments that this is a defect in the test that could be fixed, rather than just disabling the test? I'm going to reopen this issue, and mark it "always_except" so that we know to look at fixing the test when we get a chance.

Comment by Andreas Dilger [ 11/Mar/13 ]

Oleg's comments in the patch were:

Ok, we went though the actual failure with Jinshan.

The failure is not because the connection is lost, the failure is because the MDT is actually alive, and client succesfully reconnected, now due to bz 16774 we are dropping all unused locks on reconnect.

The test itself has a problem - it only expects a single client present so when the connect request is dropped, there's now no way to know from what client would the request be dropped, from the IO client or some other possibly present clients.

Comment by Jinshan Xiong (Inactive) [ 12/Mar/13 ]

This test was wrongly implemented originally. If it really wants to make sure that the IO can still continue without connecting to the MDS, it shouldn't have set OBD_FAIL_MDS_CONNECT_NET to fail once <- this is the defect of this test case. There is nothing wrong with ELC because llnl used to have problem that huge # of useless locks were replayed during recovery.

Based on this situation, I still think it's reasonable to get rid of this test case because we can't compose a fix for it.

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL7 Server/Client - ZFS
build# 3305
https://testing.hpdd.intel.com/test_sets/02982ada-bbc7-11e5-8506-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

Another instance found on b2_8 for failover testing , build# 6.
https://testing.hpdd.intel.com/test_sessions/54ec62da-d99d-11e5-9ebe-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/c5a8e44c-d9c7-11e5-85dd-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ]

close old tickets

Comment by Andreas Dilger [ 27/Aug/19 ]

Reopen to remove always_except label.

Generated at Sat Feb 10 01:26:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.