[LU-3796] Interop 2.1.5<->2.5 failure on test suite conf-sanity test_35a: failed to start LWP Created: 20/Aug/13 Updated: 17/Jul/17 Resolved: 17/Jul/17 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server: lustre-master build #1617 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9814 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/0daeb618-06df-11e3-87de-52540035b04c. The sub-test test_35a failed with the following error:
Lustre: DEBUG MARKER: == conf-sanity test 35a: Reconnect to the last active server first == 01:31:45 (1376641905) Lustre: DEBUG MARKER: mkdir -p /mnt/mds1 Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o loop,user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1 LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Set up a fake failnode for the MDS Lustre: DEBUG MARKER: Set up a fake failnode for the MDS Lustre: DEBUG MARKER: lctl get_param -n devices Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre-MDT0000.failover.node=127.0.0.2 Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait for RECONNECT_INTERVAL seconds \(10s\) Lustre: DEBUG MARKER: Wait for RECONNECT_INTERVAL seconds (10s) LustreError: 11151:0:(obd_mount_server.c:704:lustre_lwp_add_conn()) lustre-MDT0000-lwp-MDT0000: can't add conn: rc = -2 Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity.sh test_35a 2013-08-16 1h32m11s Lustre: DEBUG MARKER: conf-sanity.sh test_35a 2013-08-16 1h32m11s Lustre: DEBUG MARKER: /usr/sbin/lctl mark Stopping the MDT: Lustre: DEBUG MARKER: Stopping the MDT: Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts Lustre: DEBUG MARKER: umount -d -f /mnt/mds1 LustreError: 3060:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88031d622400 x1443503658857732/t0(0) o13->lustre-OST0000-osc-MDT0000@192.168.4.21@o2ib:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Lustre: lustre-MDT0000: Not available for connect from 192.168.4.21@o2ib (stopping) Lustre: lustre-MDT0000: Not available for connect from 192.168.4.23@o2ib (stopping) LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 192.168.4.21@o2ib (no target) Lustre: 11340:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1376641938/real 1376641938] req@ffff880312bfd800 x1443503658857752/t0(0) o251->MGC192.168.4.20@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1376641944 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 11340:0:(client.c:1896:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' Lustre: DEBUG MARKER: /usr/sbin/lctl mark Restarting the MDT: Lustre: DEBUG MARKER: Restarting the MDT: Lustre: DEBUG MARKER: mkdir -p /mnt/mds1 Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o loop,user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1 LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: LustreError: 11591:0:(obd_mount_server.c:704:lustre_lwp_add_conn()) lustre-MDT0000-lwp-MDT0000: can't add conn: rc = -2 LustreError: 15c-8: MGC192.168.4.20@o2ib: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 11537:0:(obd_mount_server.c:1274:server_start_targets()) lustre-MDT0000: failed to start LWP: -2 LustreError: 11537:0:(obd_mount_server.c:1732:server_fill_super()) Unable to start targets: -2 Lustre: Failing over lustre-MDT0000 LustreError: 11537:0:(obd_mount.c:1277:lustre_fill_super()) Unable to mount (-2) Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity test_35a: @@@@@@ FAIL: test_35a failed with 6 |
| Comments |
| Comment by Jodi Levi (Inactive) [ 21/Aug/13 ] |
|
Johann, |
| Comment by Johann Lombardi (Inactive) [ 21/Aug/13 ] |
|
We failed to set up the local lwp connection on the MDT. Niu actually wrote this code. From the log: 00000020:00000010:0.0:1376641955.674789:0:11896:0:(obd_mount_server.c:692:lustre_lwp_add_conn()) kmalloced 'bufs': 104 at ffff88030f528740. 00000020:00000001:0.0:1376641955.674790:0:11896:0:(lustre_cfg.h:235:lustre_cfg_new()) Process entered 00000020:00000001:0.0:1376641955.674791:0:11896:0:(lustre_cfg.h:216:lustre_cfg_len()) Process entered 00000020:00000001:0.0:1376641955.674792:0:11896:0:(lustre_cfg.h:222:lustre_cfg_len()) Process leaving (rc=88 : 88 : 58) 00000020:00000001:0.0:1376641955.674793:0:11896:0:(lustre_cfg.h:216:lustre_cfg_len()) Process entered 00000020:00000001:0.0:1376641955.674794:0:11896:0:(lustre_cfg.h:222:lustre_cfg_len()) Process leaving (rc=88 : 88 : 58) 00000020:00000001:0.0:1376641955.674795:0:11896:0:(lustre_cfg.h:216:lustre_cfg_len()) Process entered 00000020:00000001:0.0:1376641955.674796:0:11896:0:(lustre_cfg.h:222:lustre_cfg_len()) Process leaving (rc=88 : 88 : 58) 00000020:00000010:0.0:1376641955.674797:0:11896:0:(lustre_cfg.h:238:lustre_cfg_new()) kmalloced 'lcfg': 88 at ffff88030f397840. 00000020:00000001:0.0:1376641955.674799:0:11896:0:(lustre_cfg.h:251:lustre_cfg_new()) Process leaving (rc=18446612145454544960 : -131928255006656 : ffff88030f397840) 00000020:00000001:0.0:1376641955.674801:0:11896:0:(obd_config.c:777:class_add_conn()) Process entered 00000020:00000001:0.0:1376641955.674802:0:11896:0:(obd_class.h:961:obd_add_conn()) Process entered 00010000:00000001:0.0:1376641955.674803:0:11896:0:(ldlm_lib.c:67:import_set_conn()) Process entered 00000100:00000200:0.0:1376641955.674807:0:11896:0:(events.c:550:ptlrpc_uuid_to_peer()) 127.0.0.2@tcp->12345-ffffffec@<0:0> 00000100:00000100:0.0:1376641955.674808:0:11896:0:(client.c:84:ptlrpc_uuid_to_connection()) cannot find peer 127.0.0.2@tcp! 00010000:00080000:0.0:1376641955.674809:0:11896:0:(ldlm_lib.c:76:import_set_conn()) can't find connection 127.0.0.2@tcp 00010000:00000001:0.0:1376641955.674811:0:11896:0:(ldlm_lib.c:77:import_set_conn()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000020:00000001:0.0:1376641955.674812:0:11896:0:(obd_class.h:968:obd_add_conn()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000020:00000001:0.0:1376641955.674814:0:11896:0:(obd_config.c:802:class_add_conn()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000020:00020000:0.0:1376641955.674816:0:11896:0:(obd_mount_server.c:704:lustre_lwp_add_conn()) lustre-MDT0000-lwp-MDT0000: can't add conn: rc = -2 00000020:00000010:0.0:1376641955.685478:0:11896:0:(obd_mount_server.c:708:lustre_lwp_add_conn()) kfreed 'bufs': 104 at ffff88030f528740. So we are trying to connect to 127.0.0.2 which is the fake failover address used in conf-sanity test 35a. |
| Comment by Niu Yawei (Inactive) [ 22/Aug/13 ] |
|
This is actually because the fix 82a0cc9ee5489340406a6fc64494f37989099729 ( |
| Comment by Niu Yawei (Inactive) [ 22/Aug/13 ] |
|
backport fix of |
| Comment by Niu Yawei (Inactive) [ 17/Jul/17 ] |
|
The backport patch for b2_1 was ready, but I think we don't need it anymore, closing this ticket. |