[LU-3101] Interop 1.8.9<->2.4 failure on test suite replay-single test_61d: cannot restart mgs Created: 03/Apr/13  Updated: 19/Aug/13  Resolved: 19/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

client: 1.8.9
server: lustre-master build #1346


Severity: 3
Rank (Obsolete): 7538

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/a0617196-9725-11e2-9ec7-52540035b04c.

The sub-test test_61d failed with the following error:

cannot restart mgs

MDS console shows:

00:09:37:Lustre: DEBUG MARKER: == replay-single test 61d: error in llog_setup should cleanup the llog context correctly == 00:09:35 (1364368175)
00:09:37:Lustre: DEBUG MARKER: grep -c /mnt/mds' ' /proc/mounts
00:09:37:Lustre: DEBUG MARKER: umount -d /mnt/mds
00:09:49:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
00:09:50:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000605
00:09:50:Lustre: DEBUG MARKER: mkdir -p /mnt/mds
00:09:50:Lustre: DEBUG MARKER: mkdir -p /mnt/mds; mount -t lustre -o loop  /dev/lvm-MDS/P1 /mnt/mds
00:09:50:LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
00:09:50:Lustre: *** cfs_fail_loc=605, val=0***
00:09:50:LustreError: 5059:0:(llog_obd.c:207:llog_setup()) MGS: ctxt 0 lop_setup=ffffffffa0631ce0 failed: rc = -95
00:09:50:LustreError: 5059:0:(obd_config.c:572:class_setup()) setup MGS failed (-95)
00:09:50:LustreError: 5059:0:(obd_mount.c:378:lustre_start_simple()) MGS setup error -95
00:09:50:LustreError: 15e-a: Failed to start MGS 'MGS' (-95). Is the 'mgs' module loaded?
00:09:50:LustreError: 5059:0:(obd_mount.c:1379:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
00:09:50:LustreError: 5059:0:(obd_mount.c:2115:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
00:09:50:LustreError: 5059:0:(obd_mount.c:2145:server_put_super()) no obd lustre-MDT0000
00:09:51:LustreError: 5059:0:(obd_mount.c:139:server_deregister_mount()) lustre-MDT0000 not registered
00:09:51:LustreError: 5059:0:(obd_mount.c:2989:lustre_fill_super()) Unable to mount /dev/loop0 (-95)
00:09:51:Lustre: DEBUG MARKER: lctl set_param fail_loc=0
00:09:51:Lustre: DEBUG MARKER: mkdir -p /mnt/mds
00:09:51:Lustre: DEBUG MARKER: mkdir -p /mnt/mds; mount -t lustre -o loop  /dev/lvm-MDS/P1 /mnt/mds
00:09:51:LustreError: 15d-9: The MGS service was already started from server
00:09:51:LustreError: 5228:0:(obd_mount.c:1379:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
00:09:51:LustreError: 5228:0:(obd_mount.c:2115:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
00:09:51:LustreError: 5228:0:(obd_mount.c:2145:server_put_super()) no obd lustre-MDT0000
00:09:51:LustreError: 5228:0:(obd_mount.c:139:server_deregister_mount()) lustre-MDT0000 not registered
00:09:51:LustreError: 5228:0:(obd_mount.c:2989:lustre_fill_super()) Unable to mount  (-114)
00:09:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_61d: @@@@@@ FAIL: cannot restart mgs 
00:09:51:Lustre: DEBUG MARKER: replay-single test_61d: @@@@@@ FAIL: cannot restart mgs
00:09:51:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2013-03-26/lustre-master-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--2_4_1__1346__-70011898121780-141237/replay-single.test_61d.debug_log.$(hostname -s).1364368184.log;
00:09:51:         dmesg > /logdir/test_logs/2013-03-26/lu
00:09:51:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 2>/dev/null || true
00:09:51:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] && echo $(< /proc/sys/lnet/catastrophe) || echo 0);
00:09:51:if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
00:09:51:exit $rc;
00:09:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 62: don\'t mis-drop resent replay == 00:09:46 \(1364368186\)


 Comments   
Comment by Peter Jones [ 04/Apr/13 ]

Hongchao

Could you please investigate?

Thanks

Peter

Comment by Hongchao Zhang [ 12/Apr/13 ]

the issue is reproduced on master locally, and it's caused by the wrong cleanup after MGS failed to start up.
the patch is tracked at http://review.whamcloud.com/#change,6035

Comment by Jian Yu [ 14/Aug/13 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1)
Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/31/

replay-single test 61d hit the same failure:
https://maloo.whamcloud.com/test_sets/cf9987d6-0486-11e3-90ba-52540035b04c

Hi Oleg,
Could you please cherry-pick the patch to Lustre b2_4 branch? Thanks.

Comment by Hongchao Zhang [ 19/Aug/13 ]

the patch is landed on master

Comment by Jian Yu [ 19/Aug/13 ]

The patch was also cherry-picked to Lustre b2_4 branch.

Generated at Sat Feb 10 01:30:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.