[LU-3796] Interop 2.1.5<->2.5 failure on test suite conf-sanity test_35a: failed to start LWP Created: 20/Aug/13  Updated: 17/Jul/17  Resolved: 17/Jul/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

server: lustre-master build #1617
client: 2.1.5


Severity: 3
Rank (Obsolete): 9814

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/0daeb618-06df-11e3-87de-52540035b04c.

The sub-test test_35a failed with the following error:

test_35a failed with 6

Lustre: DEBUG MARKER: == conf-sanity test 35a: Reconnect to the last active server first == 01:31:45 (1376641905)
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o loop,user_xattr,acl  /dev/lvm-MDS/P1 /mnt/mds1
LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust
Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Set up a fake failnode for the MDS
Lustre: DEBUG MARKER: Set up a fake failnode for the MDS
Lustre: DEBUG MARKER: lctl get_param -n devices
Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre-MDT0000.failover.node=127.0.0.2
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait for RECONNECT_INTERVAL seconds \(10s\)
Lustre: DEBUG MARKER: Wait for RECONNECT_INTERVAL seconds (10s)
LustreError: 11151:0:(obd_mount_server.c:704:lustre_lwp_add_conn()) lustre-MDT0000-lwp-MDT0000: can't add conn: rc = -2
Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity.sh test_35a 2013-08-16 1h32m11s
Lustre: DEBUG MARKER: conf-sanity.sh test_35a 2013-08-16 1h32m11s
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Stopping the MDT:
Lustre: DEBUG MARKER: Stopping the MDT:
Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
LustreError: 3060:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88031d622400 x1443503658857732/t0(0) o13->lustre-OST0000-osc-MDT0000@192.168.4.21@o2ib:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Lustre: lustre-MDT0000: Not available for connect from 192.168.4.21@o2ib (stopping)
Lustre: lustre-MDT0000: Not available for connect from 192.168.4.23@o2ib (stopping)
LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 192.168.4.21@o2ib (no target)
Lustre: 11340:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1376641938/real 1376641938]  req@ffff880312bfd800 x1443503658857752/t0(0) o251->MGC192.168.4.20@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1376641944 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 11340:0:(client.c:1896:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Restarting the MDT:
Lustre: DEBUG MARKER: Restarting the MDT:
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o loop,user_xattr,acl  /dev/lvm-MDS/P1 /mnt/mds1
LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
LustreError: 11591:0:(obd_mount_server.c:704:lustre_lwp_add_conn()) lustre-MDT0000-lwp-MDT0000: can't add conn: rc = -2
LustreError: 15c-8: MGC192.168.4.20@o2ib: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 11537:0:(obd_mount_server.c:1274:server_start_targets()) lustre-MDT0000: failed to start LWP: -2
LustreError: 11537:0:(obd_mount_server.c:1732:server_fill_super()) Unable to start targets: -2
Lustre: Failing over lustre-MDT0000
LustreError: 11537:0:(obd_mount.c:1277:lustre_fill_super()) Unable to mount  (-2)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_35a: @@@@@@ FAIL: test_35a failed with 6 


 Comments   
Comment by Jodi Levi (Inactive) [ 21/Aug/13 ]

Johann,
Could you please comment on this?
Thank you!

Comment by Johann Lombardi (Inactive) [ 21/Aug/13 ]

We failed to set up the local lwp connection on the MDT. Niu actually wrote this code.

From the log:

00000020:00000010:0.0:1376641955.674789:0:11896:0:(obd_mount_server.c:692:lustre_lwp_add_conn()) kmalloced 'bufs': 104 at ffff88030f528740.
00000020:00000001:0.0:1376641955.674790:0:11896:0:(lustre_cfg.h:235:lustre_cfg_new()) Process entered
00000020:00000001:0.0:1376641955.674791:0:11896:0:(lustre_cfg.h:216:lustre_cfg_len()) Process entered
00000020:00000001:0.0:1376641955.674792:0:11896:0:(lustre_cfg.h:222:lustre_cfg_len()) Process leaving (rc=88 : 88 : 58)
00000020:00000001:0.0:1376641955.674793:0:11896:0:(lustre_cfg.h:216:lustre_cfg_len()) Process entered
00000020:00000001:0.0:1376641955.674794:0:11896:0:(lustre_cfg.h:222:lustre_cfg_len()) Process leaving (rc=88 : 88 : 58)
00000020:00000001:0.0:1376641955.674795:0:11896:0:(lustre_cfg.h:216:lustre_cfg_len()) Process entered
00000020:00000001:0.0:1376641955.674796:0:11896:0:(lustre_cfg.h:222:lustre_cfg_len()) Process leaving (rc=88 : 88 : 58)
00000020:00000010:0.0:1376641955.674797:0:11896:0:(lustre_cfg.h:238:lustre_cfg_new()) kmalloced 'lcfg': 88 at ffff88030f397840.
00000020:00000001:0.0:1376641955.674799:0:11896:0:(lustre_cfg.h:251:lustre_cfg_new()) Process leaving (rc=18446612145454544960 : -131928255006656 : ffff88030f397840)
00000020:00000001:0.0:1376641955.674801:0:11896:0:(obd_config.c:777:class_add_conn()) Process entered
00000020:00000001:0.0:1376641955.674802:0:11896:0:(obd_class.h:961:obd_add_conn()) Process entered
00010000:00000001:0.0:1376641955.674803:0:11896:0:(ldlm_lib.c:67:import_set_conn()) Process entered
00000100:00000200:0.0:1376641955.674807:0:11896:0:(events.c:550:ptlrpc_uuid_to_peer()) 127.0.0.2@tcp->12345-ffffffec@<0:0>
00000100:00000100:0.0:1376641955.674808:0:11896:0:(client.c:84:ptlrpc_uuid_to_connection()) cannot find peer 127.0.0.2@tcp!
00010000:00080000:0.0:1376641955.674809:0:11896:0:(ldlm_lib.c:76:import_set_conn()) can't find connection 127.0.0.2@tcp
00010000:00000001:0.0:1376641955.674811:0:11896:0:(ldlm_lib.c:77:import_set_conn()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000020:00000001:0.0:1376641955.674812:0:11896:0:(obd_class.h:968:obd_add_conn()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000020:00000001:0.0:1376641955.674814:0:11896:0:(obd_config.c:802:class_add_conn()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000020:00020000:0.0:1376641955.674816:0:11896:0:(obd_mount_server.c:704:lustre_lwp_add_conn()) lustre-MDT0000-lwp-MDT0000: can't add conn: rc = -2
00000020:00000010:0.0:1376641955.685478:0:11896:0:(obd_mount_server.c:708:lustre_lwp_add_conn()) kfreed 'bufs': 104 at ffff88030f528740.

So we are trying to connect to 127.0.0.2 which is the fake failover address used in conf-sanity test 35a.

Comment by Niu Yawei (Inactive) [ 22/Aug/13 ]

This is actually because the fix 82a0cc9ee5489340406a6fc64494f37989099729 (LU-2140 test: add fake nid with proper nettype) wasn't backported to 2.1.

Comment by Niu Yawei (Inactive) [ 22/Aug/13 ]

backport fix of LU-2140: http://review.whamcloud.com/7417

Comment by Niu Yawei (Inactive) [ 17/Jul/17 ]

The backport patch for b2_1 was ready, but I think we don't need it anymore, closing this ticket.

Generated at Sat Feb 10 01:36:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.