[LU-3009] Interop 2.3.0<->2.4 failure on test suite conf-sanity test_17 Created: 21/Mar/13  Updated: 27/Mar/13  Resolved: 27/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: HB
Environment:

client: lustre 2.3.0 RHEL6
server: lustre master build# 1328 RHEL6


Severity: 3
Rank (Obsolete): 7322

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/af198d4a-9239-11e2-b841-52540035b04c.

The sub-test test_17 failed with the following error:

test_17 returned 1

client console:

Lustre: DEBUG MARKER: == conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) == 22:57:00 (1363845420)
LustreError: 152-6: Ignoring deprecated mount option 'acl'.
Lustre: MGC10.10.4.132@tcp: Reactivating import
LustreError: 19749:0:(mgc_request.c:246:do_config_log_add()) failed processing sptlrpc log: -2
LustreError: 11-0: an error occurred while communicating with 10.10.4.132@tcp. The mds_connect operation failed with -11
LustreError: 11-0: an error occurred while communicating with 10.10.4.132@tcp. The mds_connect operation failed with -11
Lustre: Mounted lustre-client
LNet: 19772:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
LustreError: 19814:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 19814:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 1 previous similar message
LustreError: 19814:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
LustreError: 19814:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) Skipped 1 previous similar message
Lustre: Unmounted lustre-client
LNet: Removed LNI 10.10.4.5@tcp
LNet: 20001:0:(workitem.c:416:cfs_wi_sched_destroy()) waiting for 4 threads of WI sched[cfs_rh] to terminate
LNet: 20267:0:(linux-cpu.c:811:cfs_cpt_table_create()) CPU partition number 2 is larger than suggested value(1), your system may have performanceissue or run out of memory while under pressure
LNet: HW CPU cores: 4, npartitions: 2
alg: No test for crc32 (crc32-table)
alg: No test for adler32 (adler32-zlib)
LNet: 20279:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
Lustre: Lustre: Build Version: 2.3.0-RC6--PRISTINE-2.6.32-279.5.1.el6.x86_64
LNet: Added LNI 10.10.4.5@tcp [8/256/0/180]
LNet: Accept secure, port 7988
Lustre: Echo OBD driver; http://www.lustre.org/
Lustre: DEBUG MARKER: conf-sanity test_17: @@@@@@ FAIL: test_17 failed with 42
conf-sanity.sh returned 0

MDS console

Lustre: DEBUG MARKER: -----============= acceptance-small: conf-sanity ============----- Wed Mar 20 22:55:33 PDT 2013
Lustre: DEBUG MARKER: only running test 17
Lustre: DEBUG MARKER: excepting tests:
LustreError: 2918:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8804192f4800 x1430092447242932/t0(0) o13->lustre-OST0001-osc-MDT0000@10.10.4.134@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
LustreError: 2918:0:(client.c:1048:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target)
LustreError: Skipped 59 previous similar messages
Lustre: 10439:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363845337/real 1363845337]  req@ffff880212fa2000 x1430092447242948/t0(0) o251->MGC10.10.4.132@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1363845343 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 10439:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
Lustre: server umount lustre-MDT0000 complete
Lustre: Skipped 10 previous similar messages
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.4.134@tcp, operation ost_statfs failed with -107.
LustreError: Skipped 1 previous similar message
Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 2915:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88040e894c00 x1430092447243224/t0(0) o13->lustre-OST0000-osc-MDT0000@10.10.4.134@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Lustre: 11234:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363845413/real 1363845413]  req@ffff88021c120800 x1430092447243236/t0(0) o251->MGC10.10.4.132@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1363845419 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount lustre-MDT0000 complete
Lustre: DEBUG MARKER: == conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) == 22:57:00 (1363845420)
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: lustre-MDT0000: used disk, loading
Lustre: 11463:0:(mdt_lproc.c:380:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /usr/sbin/l_getidentity
Lustre: 11463:0:(mdt_lproc.c:380:lprocfs_wr_identity_upcall()) Skipped 1 previous similar message
LNet: 11555:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
LNet: 11555:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 2 previous similar messages
Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.10.4.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: lustre-MDT0000: used disk, loading
Lustre: DEBUG MARKER: conf-sanity test_17: @@@@@@ FAIL: test_17 failed with 42
LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target)
LustreError: Skipped 1 previous similar message


 Comments   
Comment by Oleg Drokin [ 21/Mar/13 ]

from the logs: LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
so mdt tries to connect to itself on startup and fails?

Comment by Minh Diep [ 22/Mar/13 ]

Oleg, actually I see similar message every time I mount new mds/mgs. should this be a bug?

LDISKFS-fs (sdc1): mounted filesystem with ordered data mode. quota=on. Opts:
LDISKFS-fs (sdc1): mounted filesystem with ordered data mode. quota=on. Opts:
Lustre: ctl-widelust-MDT0000: No data found on store. Initialize space
Lustre: srv-widelust-MDT0000: No data found on store. Initialize space
Lustre: widelust-MDT0000: new disk, initializing
LustreError: 11-0: widelust-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.

Comment by Oleg Drokin [ 22/Mar/13 ]

well, then the message is a red herring?

Comment by Bob Glossman (Inactive) [ 22/Mar/13 ]

This problem doesn't appear to require interop. I can make it happen on client/server both from current master. test log looks like:

== conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) == 12:16:38 (1363979798)
start mds service on centos2
Starting mds1: /dev/sdb /mnt/mds1
Started lustre-MDT0000
start ost1 service on centos2
. . .

stop mds service on centos2
Stopping /mnt/mds1 (opts:-f) on centos2
pdsh@centos1: centos2: ssh exited with exit code 1
modules unloaded.
Remove mds config log
centos2: debugfs 1.42.6.wc2 (10-Dec-2012)
centos2: unlink: Filesystem not open
start ost1 service on centos2
Starting ost1: /dev/sdc /mnt/ost1
Started lustre-OST0000
start mds service on centos2
Starting mds1: /dev/sdb /mnt/mds1
Started lustre-MDT0000
conf-sanity test_17: @@@@@@ FAIL: test_17 failed with 42
Loading modules from /usr/lib64/lustre
detected 1 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
debug=-1
subsystem_debug=all -lnet -lnd -pinger
gss/krb5 is not supported
quota/lquota options: 'hash_lqs_cur_bits=3'
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4013:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4036:error()
= /usr/lib64/lustre/tests/test-framework.sh:4276:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:4309:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4130:run_test()
= /usr/lib64/lustre/tests/conf-sanity.sh:625:main()
Dumping lctl log to /tmp/test_logs/2013-03-22/121525/conf-sanity.test_17.*.1363979875.log
FAIL 17 (78s)
conf-sanity returned 0
Finished at Fri Mar 22 12:17:56 PDT 2013 in 151s
/usr/lib64/lustre/tests/auster: completed with rc 0

Comment by Bob Glossman (Inactive) [ 22/Mar/13 ]

I see the following suspicious looking error in the server log:

20000000:01000000:0.0:1363979874.430338:0:21269:0:(mgs_handler.c:757:mgs_handle()) @@@ MGS fail to handle opc = 501: rc = -2
req@ffff88001872f050 x1430235639728852/t0(0) o501->38dab23a-7f36-96c0-1775-c9234a48b981@0@lo:0/0 lens 296/0 e 0 to 0 dl 1363979880 ref 1 fl Interpret:/0/ffffffff rc 0/-1

opc = 501 is LLOG_ORIGIN_HANDLE_CREATE

Not sure if this is expected or not.

Comment by Bob Glossman (Inactive) [ 22/Mar/13 ]

Pretty sure I was looking in the wrong place. There is no real problem in lustre itself. The problem is in the test script(s).

The key fact can be seen in the following test log fragment:

Remove mds config log
fat-amd-1: debugfs 1.42.6.wc2 (10-Dec-2012)
fat-amd-1: unlink: Filesystem not open

The debugfs command is failing to unlink the CONFIGS/lustre-MDT0000 llog file. The missing llog is supposed to generate an error in the following start_mds(). Since the llog unlink failed, the start_mds() which is supposed to fail & return an error instead succeeds. This causes the reported failure in the subtest.

The debugfs unlink in the following conf-sanity.sh script line is failing:

do_facet mgs "$DEBUGFS -w -R 'unlink CONFIGS/$FSNAME-MDT0000' $MGSDEV || return \$?" || return $?

This line fails because somewhere in the changeover of test framework scripts and cfg files to multiple MDTs the environment variable MGSDEV is no longer set by anybody.

I will push a patch to fix it in this subtest by using an available alternative.
There may be uses of MGSDEV elsewhere that also need attention.

Comment by Bob Glossman (Inactive) [ 22/Mar/13 ]

http://review.whamcloud.com/5816

Comment by Peter Jones [ 27/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:30:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.