[LU-3009] Interop 2.3.0<->2.4 failure on test suite conf-sanity test_17 Created: 21/Mar/13 Updated: 27/Mar/13 Resolved: 27/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
client: lustre 2.3.0 RHEL6 |
||
| Severity: | 3 |
| Rank (Obsolete): | 7322 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/af198d4a-9239-11e2-b841-52540035b04c. The sub-test test_17 failed with the following error:
client console: Lustre: DEBUG MARKER: == conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) == 22:57:00 (1363845420) LustreError: 152-6: Ignoring deprecated mount option 'acl'. Lustre: MGC10.10.4.132@tcp: Reactivating import LustreError: 19749:0:(mgc_request.c:246:do_config_log_add()) failed processing sptlrpc log: -2 LustreError: 11-0: an error occurred while communicating with 10.10.4.132@tcp. The mds_connect operation failed with -11 LustreError: 11-0: an error occurred while communicating with 10.10.4.132@tcp. The mds_connect operation failed with -11 Lustre: Mounted lustre-client LNet: 19772:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. LustreError: 19814:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 19814:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 1 previous similar message LustreError: 19814:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 LustreError: 19814:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) Skipped 1 previous similar message Lustre: Unmounted lustre-client LNet: Removed LNI 10.10.4.5@tcp LNet: 20001:0:(workitem.c:416:cfs_wi_sched_destroy()) waiting for 4 threads of WI sched[cfs_rh] to terminate LNet: 20267:0:(linux-cpu.c:811:cfs_cpt_table_create()) CPU partition number 2 is larger than suggested value(1), your system may have performanceissue or run out of memory while under pressure LNet: HW CPU cores: 4, npartitions: 2 alg: No test for crc32 (crc32-table) alg: No test for adler32 (adler32-zlib) LNet: 20279:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. Lustre: Lustre: Build Version: 2.3.0-RC6--PRISTINE-2.6.32-279.5.1.el6.x86_64 LNet: Added LNI 10.10.4.5@tcp [8/256/0/180] LNet: Accept secure, port 7988 Lustre: Echo OBD driver; http://www.lustre.org/ Lustre: DEBUG MARKER: conf-sanity test_17: @@@@@@ FAIL: test_17 failed with 42 conf-sanity.sh returned 0 MDS console Lustre: DEBUG MARKER: -----============= acceptance-small: conf-sanity ============----- Wed Mar 20 22:55:33 PDT 2013 Lustre: DEBUG MARKER: only running test 17 Lustre: DEBUG MARKER: excepting tests: LustreError: 2918:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8804192f4800 x1430092447242932/t0(0) o13->lustre-OST0001-osc-MDT0000@10.10.4.134@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 LustreError: 2918:0:(client.c:1048:ptlrpc_import_delay_req()) Skipped 3 previous similar messages LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target) LustreError: Skipped 59 previous similar messages Lustre: 10439:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363845337/real 1363845337] req@ffff880212fa2000 x1430092447242948/t0(0) o251->MGC10.10.4.132@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1363845343 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 10439:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 16 previous similar messages Lustre: server umount lustre-MDT0000 complete Lustre: Skipped 10 previous similar messages LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.4.134@tcp, operation ost_statfs failed with -107. LustreError: Skipped 1 previous similar message Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 2915:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88040e894c00 x1430092447243224/t0(0) o13->lustre-OST0000-osc-MDT0000@10.10.4.134@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Lustre: 11234:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363845413/real 1363845413] req@ffff88021c120800 x1430092447243236/t0(0) o251->MGC10.10.4.132@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1363845419 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-MDT0000 complete Lustre: DEBUG MARKER: == conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) == 22:57:00 (1363845420) LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: lustre-MDT0000: used disk, loading Lustre: 11463:0:(mdt_lproc.c:380:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /usr/sbin/l_getidentity Lustre: 11463:0:(mdt_lproc.c:380:lprocfs_wr_identity_upcall()) Skipped 1 previous similar message LNet: 11555:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. LNet: 11555:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 2 previous similar messages Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: lustre-MDT0000: used disk, loading Lustre: DEBUG MARKER: conf-sanity test_17: @@@@@@ FAIL: test_17 failed with 42 LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target) LustreError: Skipped 1 previous similar message |
| Comments |
| Comment by Oleg Drokin [ 21/Mar/13 ] |
|
from the logs: LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. |
| Comment by Minh Diep [ 22/Mar/13 ] |
|
Oleg, actually I see similar message every time I mount new mds/mgs. should this be a bug? LDISKFS-fs (sdc1): mounted filesystem with ordered data mode. quota=on. Opts: |
| Comment by Oleg Drokin [ 22/Mar/13 ] |
|
well, then the message is a red herring? |
| Comment by Bob Glossman (Inactive) [ 22/Mar/13 ] |
|
This problem doesn't appear to require interop. I can make it happen on client/server both from current master. test log looks like: == conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) == 12:16:38 (1363979798) stop mds service on centos2 |
| Comment by Bob Glossman (Inactive) [ 22/Mar/13 ] |
|
I see the following suspicious looking error in the server log: 20000000:01000000:0.0:1363979874.430338:0:21269:0:(mgs_handler.c:757:mgs_handle()) @@@ MGS fail to handle opc = 501: rc = -2 opc = 501 is LLOG_ORIGIN_HANDLE_CREATE Not sure if this is expected or not. |
| Comment by Bob Glossman (Inactive) [ 22/Mar/13 ] |
|
Pretty sure I was looking in the wrong place. There is no real problem in lustre itself. The problem is in the test script(s). The key fact can be seen in the following test log fragment: Remove mds config log The debugfs command is failing to unlink the CONFIGS/lustre-MDT0000 llog file. The missing llog is supposed to generate an error in the following start_mds(). Since the llog unlink failed, the start_mds() which is supposed to fail & return an error instead succeeds. This causes the reported failure in the subtest. The debugfs unlink in the following conf-sanity.sh script line is failing: do_facet mgs "$DEBUGFS -w -R 'unlink CONFIGS/$FSNAME-MDT0000' $MGSDEV || return \$?" || return $? This line fails because somewhere in the changeover of test framework scripts and cfg files to multiple MDTs the environment variable MGSDEV is no longer set by anybody. I will push a patch to fix it in this subtest by using an available alternative. |
| Comment by Bob Glossman (Inactive) [ 22/Mar/13 ] |
| Comment by Peter Jones [ 27/Mar/13 ] |
|
Landed for 2.4 |