[LU-10406] sanity-lfsck test_31c: (4) Fail to re-generate master LMV EA: 2 Created: 19/Dec/17 Updated: 27/Feb/19 Resolved: 13/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
| Comments |
| Comment by Andreas Dilger [ 21/Dec/17 ] |
|
Fan Yong, can you please take a look at this, it is causing a lot of test failures. If there isn't a straight-forward fix then can you please revert the failing patch. |
| Comment by Andreas Dilger [ 21/Dec/17 ] |
|
This has failed 27x in the past 5 days since this patch was landed. |
| Comment by Andreas Dilger [ 22/Dec/17 ] |
|
It looks like there is a revert patch already: https://review.whamcloud.com/30643 |
| Comment by nasf (Inactive) [ 22/Dec/17 ] |
|
adilger, I think you are right. The failure is quite possible related with the patch https://review.whamcloud.com/#/c/29078/. I am out for vacation these days, no mac at hand, not convenient to debug the root reason. Please revert such patch. I will make proper patch when back. |
| Comment by Bob Glossman (Inactive) [ 23/Dec/17 ] |
|
another on master: |
| Comment by Gerrit Updater [ 02/Jan/18 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30694 |
| Comment by James Nunez (Inactive) [ 02/Jan/18 ] |
|
We have a patch (https://review.whamcloud.com/#/c/30643/) to revert the patch for Looking at the revert patch, 30643 patch set #3, we see that sanity-lfsck passes as part of review-dne-part-2 and review-zfs-part-1, https://testing.hpdd.intel.com/test_sets/b4da8d66-e76a-11e7-9840-52540065bddc and https://testing.hpdd.intel.com/test_sets/cd79dd76-e766-11e7-9c63-52540065bddc, but fails, in fact hangs in test 8 each time, when run in a custom test session created from "Test-Parameters: clientcount=2 osscount=1 ostcount=7 mdscount=2 mdtcount=2 testlist=sanity-lfsck,sanity-lfsck,sanity-lfsck,sanity-lfsck,sanity-lfsck"; https://testing.hpdd.intel.com/test_sessions/d29aeac6-05b2-44c5-88d4-560b25dbd1e2. |
| Comment by Andreas Dilger [ 03/Jan/18 ] |
|
James, is it worthwhile to remove the custom test session request, and just run a normal review test session? It may be that sanity-lfsck.sh itself has some problem when running with the requested test config independent of this patch (e.g. MDTCOUNT=2 instead of the standard MDTCOUNT=4), so it would be good to limit the variables and try to get the reversion landed. |
| Comment by Gerrit Updater [ 04/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30694/ |
| Comment by James Nunez (Inactive) [ 04/Jan/18 ] |
|
Fan Yong - The intention of running sanity-lfsck multiple times, was to confirm that reverting the patch for Do you understand why just running sanity-lfsck, with the revert of |
| Comment by nasf (Inactive) [ 05/Jan/18 ] |
|
The hung issue happened at the beginning of sanity-lfsck test_8. According to the test logic, the scripts will reformat the system via 'formatall', then mount up the system via 'setupall'. The hung just happened during 'setupall'. Please note that, the system was in clean status since it was reformatted just now, nothing related with your patch. The logs on OST: 00000020:01000004:0.0:1513957722.318110:0:21014:0:(obd_mount_server.c:1208:server_register_target()) Registration lustre-OST0000, fs=lustre, 10.2.8.87@tcp, index=0000, flags=0x62 10000000:01000000:0.0:1513957722.318112:0:21014:0:(mgc_request.c:1267:mgc_set_info_async()) register_target lustre-OST0000 0x10000062 10000000:01000000:0.0:1513957722.318125:0:21014:0:(mgc_request.c:1206:mgc_target_register()) register lustre-OST0000 00000100:00100000:0.0:1513957722.318134:0:21014:0:(client.c:2320:ptlrpc_set_wait()) set ffff88005b08ee80 going to sleep for 0 seconds 00000100:00100000:0.0:1513957722.318138:0:21014:0:(client.c:2026:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mount.lustre:8e527c78-387c-cf38-1d04-d2dd8fba8f79:21014:1587498878862272:10.2.8.88@tcp:253 00000020:02020000:0.0:1513957722.318145:0:21014:0:(obd_mount_server.c:1230:server_register_target()) 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -108. Is the MGS running? Means the OST fail to register with MGS because connection with MGS shutdown. But from the logs on MDS: 00000100:00100000:0.0:1513961335.627207:0:3242:0:(service.c:2099:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0001:8e527c78-387c-cf38-1d04-d2dd8fba8f79+16:16679:x1587498878943424:12345-10.2.8.87@tcp:400 00000100:00100000:0.0:1513961335.627235:0:3242:0:(service.c:2145:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0001:8e527c78-387c-cf38-1d04-d2dd8fba8f79+16:16679:x1587498878943424:12345-10.2.8.87@tcp:400 Request procesed in 28us (101us total) trans 0 rc 0/0 That means the MGS received ping RPC from the OST. But I cannot find the logs for MGS_TARGET_REG RPC from OSTs. That is strange. There are no detailed logs, we do not know what happened, it seems connection trouble just during the MGS_TARGET_REG. Anyway, it is NOT related with your patch. |
| Comment by nasf (Inactive) [ 06/Jan/18 ] |
|
I know why the OST cannot register to the MGS. The connection between the MGC and MGS has not been setup when the OST registering to the MGS: 00000100:00080000:0.0:1513957706.461000:0:16677:0:(import.c:1277:ptlrpc_connect_interpret()) ffff88007b1bf800 MGS: changing import state from CONNECTING to DISCONN ... 00000020:00000080:0.0:1513957722.315616:0:21014:0:(genops.c:1404:class_connect()) connect: client lustre-OST0000-osd_UUID, cookie 0xd7729766f2408d04 00000020:01000004:0.0:1513957722.315619:0:21014:0:(obd_mount_server.c:1893:server_fill_super()) Found service lustre-OST0000 on device /dev/mapper/lvm--Role_OSS-P1 00000020:01000004:0.0:1513957722.315658:0:21014:0:(obd_mount.c:331:lustre_start_mgc()) MGC10.2.8.88@tcp: Set MGC reconnect 1 10000000:01000000:0.0:1513957722.315660:0:21014:0:(mgc_request.c:1249:mgc_set_info_async()) InitRecov MGC10.2.8.88@tcp 1/d0:i1:r0:or0:CONNECTING 00000020:01000004:0.0:1513957722.315664:0:21014:0:(obd_mount_server.c:1303:server_start_targets()) starting target lustre-OST0000 00000020:01000004:0.0:1513957722.315670:0:21014:0:(obd_mount_server.c:294:server_mgc_set_fs()) Set mgc disk for /dev/mapper/lvm--Role_OSS-P1 00000040:01000000:0.0:1513957722.318097:0:21014:0:(llog_obd.c:210:llog_setup()) obd MGC10.2.8.88@tcp ctxt 0 is initialized 00000020:01000004:0.0:1513957722.318110:0:21014:0:(obd_mount_server.c:1208:server_register_target()) Registration lustre-OST0000, fs=lustre, 10.2.8.87@tcp, index=0000, flags=0x62 10000000:01000000:0.0:1513957722.318112:0:21014:0:(mgc_request.c:1267:mgc_set_info_async()) register_target lustre-OST0000 0x10000062 10000000:01000000:0.0:1513957722.318125:0:21014:0:(mgc_request.c:1206:mgc_target_register()) register lustre-OST0000 00000100:00100000:0.0:1513957722.318134:0:21014:0:(client.c:2320:ptlrpc_set_wait()) set ffff88005b08ee80 going to sleep for 0 seconds 00000100:00100000:0.0:1513957722.318138:0:21014:0:(client.c:2026:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mount.lustre:8e527c78-387c-cf38-1d04-d2dd8fba8f79:21014:1587498878862272:10.2.8.88@tcp:253 00000020:02020000:0.0:1513957722.318145:0:21014:0:(obd_mount_server.c:1230:server_register_target()) 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -108. Is the MGS running? 00000020:01000004:0.0:1513957722.319303:0:21014:0:(obd_mount_server.c:312:server_mgc_clear_fs()) Unassign mgc disk ... 00000100:00080000:0.0:1513957735.454711:0:16677:0:(import.c:1195:ptlrpc_connect_interpret()) ffff88007b1bf800 MGS: changing import state from CONNECTING to EVICTED ... Please note that: 10000000:01000000:0.0:1513957722.315660:0:21014:0:(mgc_request.c:1249:mgc_set_info_async()) InitRecov MGC10.2.8.88@tcp 1/d0:i1:r0:or0:CONNECTING That means the import::imp_invalid was 1. That will affect the ptlrpc behavior when send RPC: static int ptlrpc_import_delay_req(struct obd_import *imp,
struct ptlrpc_request *req, int *status)
{
int delay = 0;
...
} else if (imp->imp_invalid || imp->imp_obd->obd_no_recov) {
if (!imp->imp_deactive)
DEBUG_REQ(D_NET, req, "IMP_INVALID");
*status = -ESHUTDOWN; /* bz 12940 */
...
}
That is why the server_register_target() failed for -108. |
| Comment by Gerrit Updater [ 06/Jan/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30762 |
| Comment by Gerrit Updater [ 10/Apr/18 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/31935 |
| Comment by Gerrit Updater [ 11/Apr/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31935/ |
| Comment by Gerrit Updater [ 13/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/31194/ |
| Comment by Peter Jones [ 13/Nov/18 ] |
|
Landed for 2.12 |