[LU-10406] sanity-lfsck test_31c: (4) Fail to re-generate master LMV EA: 2 Created: 19/Dec/17  Updated: 27/Feb/19  Resolved: 13/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10005 File creation to slave MDT is much sl... Resolved
is related to LU-11218 osp xattr cache brought by mdd_is_par... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1bf35756-9e3e-11e7-ba27-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a426e9c6-e3da-11e7-8027-52540065bddc
https://testing.hpdd.intel.com/test_sets/a30b149e-e47b-11e7-9c63-52540065bddc
https://testing.hpdd.intel.com/test_sets/f72002ba-e45d-11e7-9840-52540065bddc
https://testing.hpdd.intel.com/test_sets/b717e1b8-e446-11e7-a066-52540065bddc
https://testing.hpdd.intel.com/test_sets/75350600-e445-11e7-8027-52540065bddc
https://testing.hpdd.intel.com/test_sets/11a47514-e425-11e7-9840-52540065bddc
https://testing.hpdd.intel.com/test_sets/a58c7e12-e492-11e7-8027-52540065bddc
https://testing.hpdd.intel.com/test_sets/e357fa00-a919-11e7-b78a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/b319cf8e-e330-11e7-a066-52540065bddc

The sub-test test_31c started failing intermittently on 2017-12-17 09:06:23 with the following error on :

(4) Fail to re-generate master LMV EA: 3

It looks like this was introduced by patch https://review.hpdd.intel.com/29078 LU-10005 osp: cache non-exist EA, which landed to master on 2017-12-16 23:21, but failed twice on the patch itself before landing.

Info required for matching: sanity-lfsck 31c



 Comments   
Comment by Andreas Dilger [ 21/Dec/17 ]

Fan Yong, can you please take a look at this, it is causing a lot of test failures. If there isn't a straight-forward fix then can you please revert the failing patch.

Comment by Andreas Dilger [ 21/Dec/17 ]

This has failed 27x in the past 5 days since this patch was landed.

Comment by Andreas Dilger [ 22/Dec/17 ]

It looks like there is a revert patch already: https://review.whamcloud.com/30643

Comment by nasf (Inactive) [ 22/Dec/17 ]

adilger, I think you are right. The failure is quite possible related with the patch https://review.whamcloud.com/#/c/29078/. I am out for vacation these days, no mac at hand, not convenient to debug the root reason. Please revert such patch. I will make proper patch when back.
Thanks!

Comment by Bob Glossman (Inactive) [ 23/Dec/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/174e3f3c-e78f-11e7-8027-52540065bddc

Comment by Gerrit Updater [ 02/Jan/18 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30694
Subject: LU-10406 tests: stop running sanity-lfsck 31c
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a113d617c0d4b78d3775468d29f51959f1fb5bc4

Comment by James Nunez (Inactive) [ 02/Jan/18 ]

We have a patch (https://review.whamcloud.com/#/c/30643/) to revert the patch for LU-10005, but we are having a hard time getting the patch to pass testing. Thus, I've uploaded patch https://review.whamcloud.com/#/c/30694/ to temporarily add sanity-lfsck test 31c to the ALWAYS_EXCEPT list.

Looking at the revert patch, 30643 patch set #3, we see that sanity-lfsck passes as part of review-dne-part-2 and review-zfs-part-1, https://testing.hpdd.intel.com/test_sets/b4da8d66-e76a-11e7-9840-52540065bddc and https://testing.hpdd.intel.com/test_sets/cd79dd76-e766-11e7-9c63-52540065bddc, but fails, in fact hangs in test 8 each time, when run in a custom test session created from "Test-Parameters: clientcount=2 osscount=1 ostcount=7 mdscount=2 mdtcount=2 testlist=sanity-lfsck,sanity-lfsck,sanity-lfsck,sanity-lfsck,sanity-lfsck"; https://testing.hpdd.intel.com/test_sessions/d29aeac6-05b2-44c5-88d4-560b25dbd1e2.

Comment by Andreas Dilger [ 03/Jan/18 ]

James, is it worthwhile to remove the custom test session request, and just run a normal review test session? It may be that sanity-lfsck.sh itself has some problem when running with the requested test config independent of this patch (e.g. MDTCOUNT=2 instead of the standard MDTCOUNT=4), so it would be good to limit the variables and try to get the reversion landed.

Comment by Gerrit Updater [ 04/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30694/
Subject: LU-10406 tests: stop running sanity-lfsck 31c
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d645c722d73c8c4226ab5695d37a541d39bb0a64

Comment by James Nunez (Inactive) [ 04/Jan/18 ]

Fan Yong -
We submitted a patch to revert the patch for LU-10005 at https://review.whamcloud.com/#/c/30643/. We submitted that patch with the following test parameters:
Test-Parameters: clientcount=2 osscount=1 ostcount=7 mdscount=2 mdtcount=2 testlist=sanity-lfsck,sanity-lfsck,sanity-lfsck,sanity-lfsck,sanity-lfsck

The intention of running sanity-lfsck multiple times, was to confirm that reverting the patch for LU-10005 actually fixed the test 31c failure. Yet, the test sessions with multiple sanity-lfsck runs failed in the same way with a hang in test 8. Results are at
https://testing.hpdd.intel.com/test_sessions/e958a3e3-4a3f-4f98-83b7-b19e1513ea95
https://testing.hpdd.intel.com/test_sessions/d29aeac6-05b2-44c5-88d4-560b25dbd1e2

Do you understand why just running sanity-lfsck, with the revert of LU-10005, would cause test 8 to hang?

Comment by nasf (Inactive) [ 05/Jan/18 ]

The hung issue happened at the beginning of sanity-lfsck test_8. According to the test logic, the scripts will reformat the system via 'formatall', then mount up the system via 'setupall'. The hung just happened during 'setupall'. Please note that, the system was in clean status since it was reformatted just now, nothing related with your patch.

The logs on OST:

00000020:01000004:0.0:1513957722.318110:0:21014:0:(obd_mount_server.c:1208:server_register_target()) Registration lustre-OST0000, fs=lustre, 10.2.8.87@tcp, index=0000, flags=0x62
10000000:01000000:0.0:1513957722.318112:0:21014:0:(mgc_request.c:1267:mgc_set_info_async()) register_target lustre-OST0000 0x10000062
10000000:01000000:0.0:1513957722.318125:0:21014:0:(mgc_request.c:1206:mgc_target_register()) register lustre-OST0000
00000100:00100000:0.0:1513957722.318134:0:21014:0:(client.c:2320:ptlrpc_set_wait()) set ffff88005b08ee80 going to sleep for 0 seconds
00000100:00100000:0.0:1513957722.318138:0:21014:0:(client.c:2026:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mount.lustre:8e527c78-387c-cf38-1d04-d2dd8fba8f79:21014:1587498878862272:10.2.8.88@tcp:253
00000020:02020000:0.0:1513957722.318145:0:21014:0:(obd_mount_server.c:1230:server_register_target()) 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -108. Is the MGS running?

Means the OST fail to register with MGS because connection with MGS shutdown. But from the logs on MDS:

00000100:00100000:0.0:1513961335.627207:0:3242:0:(service.c:2099:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0001:8e527c78-387c-cf38-1d04-d2dd8fba8f79+16:16679:x1587498878943424:12345-10.2.8.87@tcp:400
00000100:00100000:0.0:1513961335.627235:0:3242:0:(service.c:2145:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0001:8e527c78-387c-cf38-1d04-d2dd8fba8f79+16:16679:x1587498878943424:12345-10.2.8.87@tcp:400 Request procesed in 28us (101us total) trans 0 rc 0/0

That means the MGS received ping RPC from the OST. But I cannot find the logs for MGS_TARGET_REG RPC from OSTs. That is strange. There are no detailed logs, we do not know what happened, it seems connection trouble just during the MGS_TARGET_REG. Anyway, it is NOT related with your patch.

Comment by nasf (Inactive) [ 06/Jan/18 ]

I know why the OST cannot register to the MGS. The connection between the MGC and MGS has not been setup when the OST registering to the MGS:

00000100:00080000:0.0:1513957706.461000:0:16677:0:(import.c:1277:ptlrpc_connect_interpret()) ffff88007b1bf800 MGS: changing import state from CONNECTING to DISCONN
...
00000020:00000080:0.0:1513957722.315616:0:21014:0:(genops.c:1404:class_connect()) connect: client lustre-OST0000-osd_UUID, cookie 0xd7729766f2408d04
00000020:01000004:0.0:1513957722.315619:0:21014:0:(obd_mount_server.c:1893:server_fill_super()) Found service lustre-OST0000 on device /dev/mapper/lvm--Role_OSS-P1
00000020:01000004:0.0:1513957722.315658:0:21014:0:(obd_mount.c:331:lustre_start_mgc()) MGC10.2.8.88@tcp: Set MGC reconnect 1
10000000:01000000:0.0:1513957722.315660:0:21014:0:(mgc_request.c:1249:mgc_set_info_async()) InitRecov MGC10.2.8.88@tcp 1/d0:i1:r0:or0:CONNECTING
00000020:01000004:0.0:1513957722.315664:0:21014:0:(obd_mount_server.c:1303:server_start_targets()) starting target lustre-OST0000
00000020:01000004:0.0:1513957722.315670:0:21014:0:(obd_mount_server.c:294:server_mgc_set_fs()) Set mgc disk for /dev/mapper/lvm--Role_OSS-P1
00000040:01000000:0.0:1513957722.318097:0:21014:0:(llog_obd.c:210:llog_setup()) obd MGC10.2.8.88@tcp ctxt 0 is initialized
00000020:01000004:0.0:1513957722.318110:0:21014:0:(obd_mount_server.c:1208:server_register_target()) Registration lustre-OST0000, fs=lustre, 10.2.8.87@tcp, index=0000, flags=0x62
10000000:01000000:0.0:1513957722.318112:0:21014:0:(mgc_request.c:1267:mgc_set_info_async()) register_target lustre-OST0000 0x10000062
10000000:01000000:0.0:1513957722.318125:0:21014:0:(mgc_request.c:1206:mgc_target_register()) register lustre-OST0000
00000100:00100000:0.0:1513957722.318134:0:21014:0:(client.c:2320:ptlrpc_set_wait()) set ffff88005b08ee80 going to sleep for 0 seconds
00000100:00100000:0.0:1513957722.318138:0:21014:0:(client.c:2026:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mount.lustre:8e527c78-387c-cf38-1d04-d2dd8fba8f79:21014:1587498878862272:10.2.8.88@tcp:253
00000020:02020000:0.0:1513957722.318145:0:21014:0:(obd_mount_server.c:1230:server_register_target()) 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -108. Is the MGS running?
00000020:01000004:0.0:1513957722.319303:0:21014:0:(obd_mount_server.c:312:server_mgc_clear_fs()) Unassign mgc disk
...
00000100:00080000:0.0:1513957735.454711:0:16677:0:(import.c:1195:ptlrpc_connect_interpret()) ffff88007b1bf800 MGS: changing import state from CONNECTING to EVICTED
...

Please note that:

10000000:01000000:0.0:1513957722.315660:0:21014:0:(mgc_request.c:1249:mgc_set_info_async()) InitRecov MGC10.2.8.88@tcp 1/d0:i1:r0:or0:CONNECTING

That means the import::imp_invalid was 1. That will affect the ptlrpc behavior when send RPC:

static int ptlrpc_import_delay_req(struct obd_import *imp,
                                   struct ptlrpc_request *req, int *status)
{
        int delay = 0;
...
        } else if (imp->imp_invalid || imp->imp_obd->obd_no_recov) {
                if (!imp->imp_deactive)
                        DEBUG_REQ(D_NET, req, "IMP_INVALID");
                *status = -ESHUTDOWN; /* bz 12940 */
...
}

That is why the server_register_target() failed for -108.

Comment by Gerrit Updater [ 06/Jan/18 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30762
Subject: LU-10406 osp: cache xattr properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6725a75fb7bad8ebd51a2595b4766183339e5ad

Comment by Gerrit Updater [ 10/Apr/18 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/31935
Subject: LU-10406 tests: stop running sanity-lfsck 31c
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 78dae657afaa7ea4b78352a159756077d64793bf

Comment by Gerrit Updater [ 11/Apr/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31935/
Subject: LU-10406 tests: stop running sanity-lfsck 31c
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: ebaeb217ed12181a02c833354c15c68ddee21e23

Comment by Gerrit Updater [ 13/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/31194/
Subject: LU-10406 mdt: invalidate cache upon LDLM lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7edc970036e118dd6fc63252054c1c9f594ce1e1

Comment by Peter Jones [ 13/Nov/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:34:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.