[LU-2628] replay-single test_40: mdc_enqueue() ldlm_cli_enqueue: -4 Created: 16/Jan/13 Updated: 27/Aug/19 Resolved: 27/Aug/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6150 |
| Description |
|
This issue was created by maloo for wangdi <di.wang@intel.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/7179c526-600d-11e2-84d4-52540035b04c. The sub-test test_40 failed with the following error:
Info required for matching: replay-single 40 I met this during DNE test, but DNE does not touch this part of code at all. I just investigate the debug log 00000100:00100000:0.0:1358254377.783158:0:23931:0:(client.c:2059:ptlrpc_set_wait()) set ffff88006c48ae00 going to sleep for 0 seconds |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 17/Jan/13 ] |
|
The root cause of this problem is that the mdt was still in recovery state when the client at vm1 was trying to enqueue layout lock. Finally it was then interrupted by signal. The reason that the mdt couldn't be recovered was that the client at vm2 failed to reconnect because the request was dropped by setting of OBD_FAIL_MDS_CONNECT_NET. Taking a step back, the idea of this test case is to verify that the client can still write to OST while MDT is not connected. This is no longer right after layout lock is introduced because we have to verify the layout is correct before writing to OSTs. I'd like to fix this problem by disabling this test case. How do you guys think? |
| Comment by Jinshan Xiong (Inactive) [ 17/Jan/13 ] |
|
patch is at: http://review.whamcloud.com/5056 |
| Comment by nasf (Inactive) [ 24/Jan/13 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/3db35304-6645-11e2-a42b-52540035b04c |
| Comment by nasf (Inactive) [ 29/Jan/13 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/7581330a-6a28-11e2-9da0-52540035b04c |
| Comment by Oleg Drokin [ 29/Jan/13 ] |
|
Terrible idea to disable the test case. |
| Comment by Jinshan Xiong (Inactive) [ 30/Jan/13 ] |
Though I can do some trick to make this test pass it can't change the fact that now IO relies on the MDT otherwise wrong objects would be updated by the client. Layout lock can be lost at any time so I don't understand what's wrong to disable a test case which doesn't reflect real world use cases. Maybe you think that IO is just to push some data to OST objects but now things have changed. Verifying layout is also part of IO, IMHO. |
| Comment by Andreas Dilger [ 30/Jan/13 ] |
|
Oleg, I think the MDS is already a single point of failure. Sure, a client could continue to write to a file for a short time if the MDS has failed, but it will then block as soon as it tries to use the next file. I agree that if the client has these DLM locks cached that it could continue to work while the MDS is down, but this also risks the client being "rogue" and writing to objects that are incorrect because it has lost its connection to the MDS, but the client doesn't know it yet. |
| Comment by Keith Mannthey (Inactive) [ 12/Feb/13 ] |
|
Another instance: It reported 3 out of 100 hit it. |
| Comment by Andreas Dilger [ 11/Mar/13 ] |
|
It sounds from Oleg's comments that this is a defect in the test that could be fixed, rather than just disabling the test? I'm going to reopen this issue, and mark it "always_except" so that we know to look at fixing the test when we get a chance. |
| Comment by Andreas Dilger [ 11/Mar/13 ] |
|
Oleg's comments in the patch were:
|
| Comment by Jinshan Xiong (Inactive) [ 12/Mar/13 ] |
|
This test was wrongly implemented originally. If it really wants to make sure that the IO can still continue without connecting to the MDS, it shouldn't have set OBD_FAIL_MDS_CONNECT_NET to fail once <- this is the defect of this test case. There is nothing wrong with ELC because llnl used to have problem that huge # of useless locks were replayed during recovery. Based on this situation, I still think it's reasonable to get rid of this test case because we can't compose a fix for it. |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for hardfailover: EL7 Server/Client - ZFS |
| Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ] |
|
Another instance found on b2_8 for failover testing , build# 6. |
| Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ] |
|
close old tickets |
| Comment by Andreas Dilger [ 27/Aug/19 ] |
|
Reopen to remove always_except label. |