[LU-4190] LustreError: 18166:0:(genops.c:1570:obd_exports_barrier()) ASSERTION( list_empty(&obd->obd_exports) ) failed: Created: 30/Oct/13  Updated: 15/Dec/19  Resolved: 15/Dec/19

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: yueyuling Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre2.4.0,with 2 servers and 1 client, kernel version:2.6.32-358.6.2.l2.08
MGSnode: mgs, 1 mdt and 4osts
Failnode: 1mdt and 4 osts


Issue Links:
Related
is related to LU-4916 mount failure when adding failover no... Resolved
Severity: 3
Rank (Obsolete): 11330

 Description   

2 servers work normally at active-active status
1、Mount Lustre FS on the client and write and read data;
2、Umount the MDT on the Failnode;
3、Read data on the client from Lustre FS,successfully;
4、Mount the MDT on the Failnode;
5、Umount the MDT on the MGSnode;
6、Read data on the client from Lustre FS,failed;
7、Mount the MDT on the MGSnode,then the MGSnode crash,print information as follow:

LustreError: 18166:0:(genops.c:320:class_newdev()) Device MGC192.168.22.50@tcp already exists at 2, won't add
LustreError: 18166:0:(obd_config.c:374:class_attach()) Cannot create device MGC192.168.22.50@tcp of type mgc : -17
LustreError: 18166:0:(obd_mount.c:196:lustre_start_simple()) MGC192.168.22.50@tcp attach error -17
LustreError: 18166:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
LustreError: 18166:0:(obd_mount_server.c:1426:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
LustreError: 18166:0:(obd_mount_server.c:1456:server_put_super()) no obd lustre-MDT0000
LustreError: 18166:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-MDT0000 not registered
LustreError: 18166:0:(genops.c:1570:obd_exports_barrier()) ASSERTION( list_empty(&obd->obd_exports) ) failed:
LustreError: 18166:0:(genops.c:1570:obd_exports_barrier()) LBUG
Pid: 18166, comm: mount.lustre

Call Trace:
[<ffffffffa070a8a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa070aeb7>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0813d91>] obd_exports_barrier+0x181/0x190 [obdclass]
[<ffffffffa0f1e886>] mgs_device_fini+0xf6/0x5c0 [mgs]
[<ffffffffa083e837>] class_cleanup+0x817/0xe00 [obdclass]
[<ffffffffa0817e2c>] ? class_name2dev+0x7c/0xe0 [obdclass]
[<ffffffffa0842e9b>] class_process_config+0x1b6b/0x2f60 [obdclass]
[<ffffffffa070bb90>] ? cfs_alloc+0x30/0x60 [libcfs]
[<ffffffffa0844723>] class_manual_cleanup+0x493/0xe80 [obdclass]
[<ffffffff8147a1fe>] ? _read_unlock+0xe/0x10
[<ffffffffa0817e2c>] ? class_name2dev+0x7c/0xe0 [obdclass]
[<ffffffffa087fb9d>] server_put_super+0x42d/0x2580 [obdclass]
[<ffffffffa0882440>] server_fill_super+0x750/0x1580 [obdclass]
[<ffffffffa084fc98>] lustre_fill_super+0x1d8/0x530 [obdclass]
[<ffffffffa084fac0>] ? lustre_fill_super+0x0/0x530 [obdclass]
[<ffffffff8114d21f>] get_sb_nodev+0x5f/0xa0
[<ffffffffa08473f5>] lustre_get_sb+0x25/0x30 [obdclass]
[<ffffffff8114c74b>] vfs_kern_mount+0x7b/0x1b0
[<ffffffff8114c8f2>] do_kern_mount+0x52/0x130
[<ffffffff81168912>] do_mount+0x2d2/0x8c0
[<ffffffff81168f90>] sys_mount+0x90/0xe0
[<ffffffff81002f5b>] system_call_fastpath+0x16/0x1b

Message fromKernel panic - not syncing: LBUG
Pid: 18166, comm: mount.lustre Tainted: GF --------------- 2.6.32-358.6.2.l2.08 #2
Call Trace:
[<ffffffff81476fa7>] ? panic+0xa1/0x163
[<ffffffffa070af0b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[<ffffffffa0813d91>] ? obd_exports_barrier+0x181/0x190 [obdclass]
[<ffffffffa0f1e886>] ? mgs_device_fini+0xf6/0x5c0 [mgs]
[<ffffffffa083e837>] ? class_cleanup+0x817/0xe00 [obdclass]
[<ffffffffa0817e2c>] ? class_name2dev+0x7c/0xe0 [obdclass]
[<ffffffffa0842e9b>] ? class_process_config+0x1b6b/0x2f60 [obdclass]
[<ffffffffa070bb90>] ? cfs_alloc+0x30/0x60 [libcfs]
[<ffffffffa0844723>] ? class_manual_cleanup+0x493/0xe80 [obdclass]
[<ffffffff8147a1fe>] ? _read_unlock+0xe/0x10
[<ffffffffa0817e2c>] ? class_name2dev+0x7c/0xe0 [obdclass]
[<ffffffffa087fb9d>] ? server_put_super+0x42d/0x2580 [obdclass]
[<ffffffffa0882440>] ? server_fill_super+0x750/0x1580 [obdclass]
[<ffffffffa084fc98>] ? lustre_fill_super+0x1d8/0x530 [obdclass]
[<ffffffffa084fac0>] ? lustre_fill_super+0x0/0x530 [obdclass]
[<ffffffff8114d21f>] ? get_sb_nodev+0x5f/0xa0
[<ffffffffa08473f5>] ? lustre_get_sb+0x25/0x30 [obdclass]
[<ffffffff8114c74b>] ? vfs_kern_mount+0x7b/0x1b0
[<ffffffff8114c8f2>] ? do_kern_mount+0x52/0x130
[<ffffffff81168912>] ? do_mount+0x2d2/0x8c0
[<ffffffff81168f90>] ? sys_mount+0x90/0xe0
[<ffffffff81002f5b>] ? system_call_fastpath+0x16/0x1b
*******show para for nt_memcpy16*******
src: ffff8802e118fc40, dst: ffffc901125a8d70, len: 80
*******show para for panic done*******



 Comments   
Comment by Andreas Dilger [ 30/Oct/13 ]

2 servers work normally at active-active status
1、Mount Lustre FS on the client and write and read data;
2、Umount the MDT on the Failnode;
3、Read data on the client from Lustre FS,successfully;
4、Mount the MDT on the Failnode;
5、Umount the MDT on the MGSnode;
6、Read data on the client from Lustre FS,failed;

Are you mounting the same MDT device (lustre-MDT0000) on both nodes? That is bad and will lead to filesystem corruption. You should only mount it on one MDS node at a time. I suggest you enable "MMP" on your devices with "tune2fs -O mmp /dev/<mdt_or_ost_device>" (this happens automatically if you format the filesystem with --failnode).

Comment by yueyuling [ 31/Oct/13 ]

Thank you for your response! But I didn't mount the same MDT device on both nodes. There are two MDS devices in my Lustre FS. I mount one MDT device on each node.
So, I modify my descriptions as follow:
2 servers work normally at active-active status
1、Mount MGS, MDT0000 and 4 OSTs at MGSnode, mount MDT0001 and other 4 OSTs at Failnode, and Mount Lustre FS on the client and write and read data;
2、Umount the MDT0001 at the Failnode;
3、Read data on the client from Lustre FS,successfully;
4、Mount the MDT0001 on the Failnode;
5、Umount the MDT0000 on the MGSnode;
6、Read data on the client from Lustre FS,failed;
7、Mount the MDT0000 on the MGSnode, the MGSnode crash and output as follow:

Comment by Di Wang [ 05/Apr/14 ]

I tried this test on current master.

MDT1

[root@client-2 ~]# mkfs.lustre --reformat --mgs --mdt --index=0 --fsname lustre --failnode=10.10.4.3@tcp /dev/disk/by-id/scsi-1IET_00040001

MDT2

[root@client-3 ~]#  mkfs.lustre --reformat --mgsnode=10.10.4.2@tcp --mgsnode=10.10.4.3@tcp --mdt --index=1 --fsname lustre  --failnode=10.10.4.2@tcp /dev/disk/by-id/scsi-1IET_00020001 

But unfortunately when it failed when I tries to mount mdt2

[root@client-3 ~]# mount -t lustre /dev/disk/by-id/scsi-1IET_00020001 /mnt/mds2/
mount.lustre: mount /dev/sdj at /mnt/mds2 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root@client-3 ~]# 
...
LDISKFS-fs (sdj): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: srv-lustre-MDT0001: No data found on store. Initialize space
Lustre: lustre-MDT0001: new disk, initializing
LustreError: 11-0: lustre-MDT0000-osp-MDT0001: Communicating with 10.10.4.2@tcp, operation mds_connect failed with -11.
LustreError: 13a-8: Failed to get MGS log params and no local copy.
LustreError: 2354:0:(obd_mount_server.c:699:lustre_lwp_add_conn()) lustre-MDT0001: can't find lwp device.
LustreError: 15c-8: MGC10.10.4.2@tcp: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 2242:0:(obd_mount_server.c:1321:server_start_targets()) lustre-MDT0001: failed to start LWP: -2
LustreError: 2242:0:(obd_mount_server.c:1776:server_fill_super()) Unable to start targets: -2
Lustre: Failing over lustre-MDT0001
Lustre: server umount lustre-MDT0001 complete
LustreError: 2242:0:(obd_mount.c:1338:lustre_fill_super()) Unable to mount  (-2)

config log

[root@client-2 ~]# llog_reader /mnt/mds1/CONFIGS/lustre-client 
Header size : 8192
Time : Fri Apr  4 20:36:36 2014
Number of records: 30
Target uuid : config_uuid 
-----------------------
#01 (224)marker   4 (flags=0x01, v2.5.57.0) lustre-clilov   'lov setup' Fri Apr  4 20:36:36 2014-
#02 (120)attach    0:lustre-clilov  1:lov  2:lustre-clilov_UUID  
#03 (168)lov_setup 0:lustre-clilov  1:(struct lov_desc)
		uuid=lustre-clilov_UUID  stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
#04 (224)marker   4 (flags=0x02, v2.5.57.0) lustre-clilov   'lov setup' Fri Apr  4 20:36:36 2014-
#05 (224)marker   5 (flags=0x01, v2.5.57.0) lustre-clilmv   'lmv setup' Fri Apr  4 20:36:36 2014-
#06 (120)attach    0:lustre-clilmv  1:lmv  2:lustre-clilmv_UUID  
#07 (168)lov_setup 0:lustre-clilmv  1:(struct lov_desc)
		uuid=lustre-clilmv_UUID  stripe:cnt=0 size=0 offset=0 pattern=0
#08 (224)marker   5 (flags=0x02, v2.5.57.0) lustre-clilmv   'lmv setup' Fri Apr  4 20:36:36 2014-
#09 (224)marker   6 (flags=0x01, v2.5.57.0) lustre-MDT0000  'add mdc' Fri Apr  4 20:36:36 2014-
#10 (080)add_uuid  nid=10.10.4.2@tcp(0x200000a0a0402)  0:  1:10.10.4.2@tcp  
#11 (128)attach    0:lustre-MDT0000-mdc  1:mdc  2:lustre-clilmv_UUID  
#12 (136)setup     0:lustre-MDT0000-mdc  1:lustre-MDT0000_UUID  2:10.10.4.2@tcp  
#13 (080)add_uuid  nid=10.10.4.3@tcp(0x200000a0a0403)  0:  1:10.10.4.3@tcp  
#14 (104)add_conn  0:lustre-MDT0000-mdc  1:10.10.4.3@tcp  
#15 (160)modify_mdc_tgts add 0:lustre-clilmv  1:lustre-MDT0000_UUID  2:0  3:1  4:lustre-MDT0000-mdc_UUID  
#16 (224)marker   6 (flags=0x02, v2.5.57.0) lustre-MDT0000  'add mdc' Fri Apr  4 20:36:36 2014-
#17 (224)marker   7 (flags=0x01, v2.5.57.0) lustre-client   'mount opts' Fri Apr  4 20:36:36 2014-
#18 (120)mount_option 0:  1:lustre-client  2:lustre-clilov  3:lustre-clilmv  
#19 (224)marker   7 (flags=0x02, v2.5.57.0) lustre-client   'mount opts' Fri Apr  4 20:36:36 2014-
#20 (224)marker  11 (flags=0x01, v2.5.57.0) lustre-MDT0001  'add mdc' Fri Apr  4 20:50:05 2014-
#21 (080)add_uuid  nid=10.10.4.3@tcp(0x200000a0a0403)  0:  1:10.10.4.3@tcp  
#22 (128)attach    0:lustre-MDT0001-mdc  1:mdc  2:lustre-clilmv_UUID  
#23 (136)setup     0:lustre-MDT0001-mdc  1:lustre-MDT0001_UUID  2:10.10.4.3@tcp  
#24 (080)add_uuid  nid=10.10.4.2@tcp(0x200000a0a0402)  0:  1:10.10.4.2@tcp  
#25 (104)add_conn  0:lustre-MDT0001-mdc  1:10.10.4.2@tcp  
#26 (160)modify_mdc_tgts add 0:lustre-clilmv  1:lustre-MDT0001_UUID  2:1  3:1  4:lustre-MDT0001-mdc_UUID  
#27 (224)marker  11 (flags=0x02, v2.5.57.0) lustre-MDT0001  'add mdc' Fri Apr  4 20:50:05 2014-
#28 (224)marker  12 (flags=0x01, v2.5.57.0) lustre-client   'mount opts' Fri Apr  4 20:50:05 2014-
#29 (120)mount_option 0:  1:lustre-client  2:lustre-clilov  3:lustre-clilmv  
#30 (224)marker  12 (flags=0x02, v2.5.57.0) lustre-client   'mount opts' Fri Apr  4 20:50:05 2014-

It might be related with the change http://review.whamcloud.com/7666 Fan Yong, could you please comment here. Thanks!

Comment by nasf (Inactive) [ 10/Apr/14 ]

The original issue happened on Lustre-2.4, but the patch http://review.whamcloud.com/#/c/7666/ only has been applied to Lustre-2.6, then even though such patch has some issues, it should not affect Lustre-2.4, right?

Comment by Di Wang [ 16/Apr/14 ]

Oh, I am not asking the original issue shown in this ticket, but the failure I met in my test, which stops me continue the test on 2.6. Hmm, I will create a new ticket then.

Comment by yueyuling [ 17/Apr/14 ]

In addition, I've created the MGS, MDT0000 and MDT0001 separately with different device. So, the MGS and MDT0000 are in different devices.

Comment by Andreas Dilger [ 02/May/14 ]

Mike, could you please try configuring a test system as described here to see if a similar problem still exists in master? This seems similar to the failure in LU-4916.

Comment by Mikhail Pershin [ 10/May/14 ]

I've tried to repeat those steps after LU-4916 fix and everything works, please clarify how did you write/read data from client? I'd try to repeat all steps as close as possible.

Comment by Jodi Levi (Inactive) [ 12/May/14 ]

Duplicate of LU-4916

Comment by Mikhail Pershin [ 13/May/14 ]

Jodi, this is not duplicate of LU-4916, this was just blocked by LU-4916. Reported bug happened in Lustre 2.4 and LU-4916 doesn't even exist there. It looks like this issue doesn't exist in current master and is not blocker for 2.6, but it exists in 2.4 as reported.

Comment by Doug Oucharek (Inactive) [ 13/May/14 ]

This is not a duplicate of LU-4916. LU-4916 blocks the ability to reproduce this issue, but does not resolve it. As such, I am reopening and giving it a lower priority (since it cannot be reproduced thanks to LU-4916).

Comment by yueyuling [ 14/May/14 ]

The steps of write/read data :
1、Mount Lustre FS on the client and write and read data:
Repeat 100 times as follow:
a、Create a new directory;
b、Copy data from client to Lustre FS, 5 files per directory, each file is 1.2GB;
c、Use MD5 to read the files in the directory and record the md5 value;
3、Read data on the client from Lustre FS,successfully;
Use MD5 to read the files which is writen at step 1.

Comment by Mikhail Pershin [ 15/Dec/19 ]

Outdated

Generated at Sat Feb 10 01:40:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.