|
The system was using the default e2fsprogs until very recently. We have now installed 1.42.7.wc1 with the patch to get the quota working.
The system originally had both of the servers it could be on listed as failnode rather than just the other node in the pair.
At one point we had 26 out of 30 OSS's available but the last 4 would not come on line ( this was after only those 4 were working ).
We have replaced the kernel modules and user space tools after LU-4111, however this configuration has worked previously on three other filesystems.
At present when we mount an OST it does not become up in the MDT's lctl device list but stays in AT eg.
root@lus04-mds1:~# lctl dl
0 UP mgs MGS MGS 7
1 UP mgc MGC172.17.148.4@tcp 3f2155e6-d9ad-156f-d431-406faeb78ef5 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lus04-mdtlov lus04-mdtlov_UUID 4
4 UP mds lus04-MDT0000 lus04-MDT0000_UUID 3
5 AT osc lus04-OST0000-osc lus04-mdtlov_UUID 1
We have downgraded the system to 1.8.9wc1 and the system performs in the same way.
|
|
MGS/MDT mount logs
Oct 18 14:29:06 lus04-mds1 kernel: [ 7549.035620] Lustre: Build Version: v1_8_9_WC1sanger1--PRISTINE-2.6.32.59-sles-lustre-1.8.8wc1
Oct 18 14:29:06 lus04-mds1 kernel: [ 7549.081427] Lustre: Added LNI 172.17.148.4@tcp [8/256/0/180]
Oct 18 14:29:06 lus04-mds1 kernel: [ 7549.081520] Lustre: Accept secure, port 988
Oct 18 14:29:06 lus04-mds1 kernel: [ 7549.135974] Lustre: Lustre Client File System; http://www.lustre.org/
Oct 18 14:29:07 lus04-mds1 kernel: [ 7549.145508] LDISKFS-fs (dm-1): barriers disabled
Oct 18 14:29:07 lus04-mds1 kernel: [ 7549.147409] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode
Oct 18 14:29:07 lus04-mds1 kernel: [ 7549.261222] LDISKFS-fs (dm-1): barriers disabled
Oct 18 14:29:07 lus04-mds1 kernel: [ 7549.262804] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode
Oct 18 14:29:07 lus04-mds1 kernel: [ 7549.289975] Lustre: MGS MGS started
Oct 18 14:29:07 lus04-mds1 kernel: [ 7549.290696] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 18 14:29:16 lus04-mds1 kernel: [ 7558.133538] LDISKFS-fs (dm-0): barriers disabled
Oct 18 14:29:16 lus04-mds1 kernel: [ 7558.136647] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode
Oct 18 14:29:16 lus04-mds1 kernel: [ 7558.265245] LDISKFS-fs (dm-0): barriers disabled
Oct 18 14:29:16 lus04-mds1 kernel: [ 7558.267566] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode
Oct 18 14:29:16 lus04-mds1 kernel: [ 7558.298400] LustreError: 13c-e: Client log lus04-client has disappeared! Regenerating all logs.
Oct 18 14:29:16 lus04-mds1 kernel: [ 7558.298589] Lustre: MGS: Logs for fs lus04 were removed by user request. All servers must be restarted in order to regenerate the logs.
Oct 18 14:29:16 lus04-mds1 kernel: [ 7558.313147] Lustre: Enabling user_xattr
Oct 18 14:29:16 lus04-mds1 kernel: [ 7558.343685] Lustre: lus04-MDT0000: Now serving lus04-MDT0000 on /dev/mapper/lus04--mdt0-lus04 with recovery enabled
Oct 18 14:29:47 lus04-mds1 kernel: [ 7590.037063] Lustre: MGS: Regenerating lus04-OST0000 log by user request.
Oct 18 14:29:51 lus04-mds1 kernel: [ 7593.422391] LustreError: 21629:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
Oct 18 14:29:51 lus04-mds1 kernel: [ 7593.422561] LustreError: 21629:0:(obd_config.c:372:class_setup()) setup lus04-OST0000-osc failed (-2)
Oct 18 14:29:51 lus04-mds1 kernel: [ 7593.422695] LustreError: 21629:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
Oct 18 14:29:51 lus04-mds1 kernel: [ 7593.422832] Lustre: cmd=cf003 0:lus04-OST0000-osc 1:lus04-OST0000_UUID 2:0@<0:0>
OSS mount logs
Oct 18 14:29:47 lus04-oss1 kernel: [11025.183166] Lustre: Build Version: v1_8_9_WC1sanger1--PRISTINE-2.6.32.59-sles-lustre-1.8.8wc1
Oct 18 14:29:47 lus04-oss1 kernel: [11025.229796] Lustre: Added LNI 172.17.148.6@tcp [8/256/0/180]
Oct 18 14:29:47 lus04-oss1 kernel: [11025.229842] Lustre: Accept secure, port 988
Oct 18 14:29:47 lus04-oss1 kernel: [11025.285325] Lustre: Lustre Client File System; http://www.lustre.org/
Oct 18 14:29:47 lus04-oss1 kernel: [11025.370273] LDISKFS-fs (dm-14): barriers disabled
Oct 18 14:29:47 lus04-oss1 kernel: [11025.392095] LDISKFS-fs (dm-14): mounted filesystem with ordered data mode
Oct 18 14:29:47 lus04-oss1 kernel: [11025.578277] LDISKFS-fs (dm-14): barriers disabled
Oct 18 14:29:47 lus04-oss1 kernel: [11025.587805] LDISKFS-fs (dm-14): mounted filesystem with ordered data mode
Oct 18 14:29:47 lus04-oss1 kernel: [11025.601175] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 18 14:29:47 lus04-oss1 kernel: [11025.644539] Lustre: Filtering OBD driver; http://wiki.whamcloud.com/
Oct 18 14:29:48 lus04-oss1 kernel: [11026.007326] Lustre: lus04-OST0000: Now serving lus04-OST0000 on /dev/mapper/vd00 with recovery enabled
root@lus04-oss1:/# cat /proc/fs/lustre/obdfilter/lus04-OST0000/recovery_status
status: INACTIVE
|
|
Bob,
Could you please have a look at this one?
Thank you!
|
|
Looking into possible causes. I suspect it may be a side effect of upgrading/downgrading between 1.8.9 and 2.4.1. Hope to have more and better information to give you soon.
|
|
Bob we had the same symptoms when the system was running 2.4.1
I can try and generate a log of the discussion with DDN and email or work out how to attach it as a private file.
|
|
what state is your fs in now? all umounted? all mounted but some OSTs inactive? just MDT mounted?
|
|
We have the MGS and MDT mounted and a single OSS.
We have stopped as lctl dl shows on the MGS/MDT.
0 UP mgs MGS MGS 7
1 UP mgc MGC172.17.148.4@tcp 3f2155e6-d9ad-156f-d431-406faeb78ef5 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lus04-mdtlov lus04-mdtlov_UUID 4
4 UP mds lus04-MDT0000 lus04-MDT0000_UUID 3
5 AT osc lus04-OST0000-osc lus04-mdtlov_UUID 1
Note the OSS is stuck in AT not up.
|
|
Additionally when we try and mount the file system from am client
root@isg-disc-mon-05:~# dmesg
[127708.147231] Lustre: MGC172.17.148.4@tcp: Reactivating import
[127708.149699] LustreError: 2312:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[127708.149912] LustreError: 2312:0:(obd_config.c:372:class_setup()) setup lus04-MDT0000-mdc-ffff8806181bcc00 failed (-2)
[127708.150204] LustreError: 2312:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
[127708.151628] Lustre: cmd=cf003 0:lus04-MDT0000-mdc 1:lus04-MDT0000_UUID 2:0@<0:0>
[127708.151957] LustreError: 15c-8: MGC172.17.148.4@tcp: The configuration from log 'lus04-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[127708.152374] LustreError: 2302:0:(llite_lib.c:1099:ll_fill_super()) Unable to process log: -2
[127708.152695] LustreError: 2302:0:(obd_config.c:443:class_cleanup()) Device 2 not setup
[127708.152975] LustreError: 2302:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
[127708.153170] LustreError: 2302:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
[127708.155666] Lustre: client lus04-client(ffff8806181bcc00) umount complete
[127708.155791] LustreError: 2302:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-2)
|
|
A couple of things to try and verify. Some of these may be repeats of things you've already done.
Confirm you have the same version of e2fsprogs on all your servers.
Unmount everything.
Bring up at least lnet everywhere with 'modprobe lnet', should already be loaded on nodes where you have attempted mounts. get the list of nids on every server with 'lctl list_nids'. see if all the nids shown look real and sensible. The error msg Lustre: cmd=cf003 0:lus04-OST0000-osc 1:lus04-OST0000_UUID 2:0@<0:0> in particular suggests something wrong with nids. 2:0 doesn't look like a sensible value.
Examine the config log with llog_reader.
Mount the MGS with -t ldiskfs.
do 'llog_reader <mountpoint>/CONFIGS/$FSNAME-client' where $FSNAME in your fs name (lus04?).
ummount it again.
Do a fresh round of tunefs.lustre --writeconf on all your MDTs and OSTs.
Try mounting again, MGS/MDT first, then OSTs in index order.
Worst case if we can't repair this fs would it be possible to reformat? Seems like the most drastic solution, but most likely to work.
|
|
"A couple of things to try and verify. Some of these may be repeats of things you've already done.
Confirm you have the same version of e2fsprogs on all your servers."
lus04-mds1: e2fsck 1.42.7.wc1 (12-Apr-2013)
lus04-mds1: Using EXT2FS Library version 1.42.7.wc1, 12-Apr-2013
lus04-mds2: e2fsck 1.42.7.wc1 (12-Apr-2013)
lus04-mds2: Using EXT2FS Library version 1.42.7.wc1, 12-Apr-2013
lus04-oss1: e2fsck 1.42.7.wc1 (12-Apr-2013)
lus04-oss1: Using EXT2FS Library version 1.42.7.wc1, 12-Apr-2013
lus04-oss2: e2fsck 1.42.7.wc1 (12-Apr-2013)
lus04-oss2: Using EXT2FS Library version 1.42.7.wc1, 12-Apr-2013
lus04-oss3: e2fsck 1.42.7.wc1 (12-Apr-2013)
lus04-oss3: Using EXT2FS Library version 1.42.7.wc1, 12-Apr-2013
lus04-oss4: e2fsck 1.42.7.wc1 (12-Apr-2013)
lus04-oss4: Using EXT2FS Library version 1.42.7.wc1, 12-Apr-2013
One thing I have noticed is that I can't list the nids if I only modprobe lnet rather than lustre.
oot@it-admin:~# dsh -M -g lus04 "modprobe lnet"
root@it-admin:~# dsh -M -g lus04 "lctl list_nids"
lus04-mds1: 172.17.148.4@tcp
lus04-mds2: IOC_LIBCFS_GET_NI error 100: Network is down
lus04-oss1: 172.17.148.6@tcp
lus04-oss2: IOC_LIBCFS_GET_NI error 100: Network is down
lus04-oss3: IOC_LIBCFS_GET_NI error 100: Network is down
lus04-oss4: IOC_LIBCFS_GET_NI error 100: Network is down
root@it-admin:~# dsh -M -g lus04 "modprobe lustre"
root@it-admin:~# dsh -M -g lus04 "lctl list_nids"
lus04-mds1: 172.17.148.4@tcp
lus04-mds2: 172.17.148.5@tcp
lus04-oss1: 172.17.148.6@tcp
lus04-oss2: 172.17.148.7@tcp
lus04-oss3: 172.17.148.8@tcp
lus04-oss4: 172.17.148.9@tcp
llog_reader CONFIGS/lus04-client
Header size : 8192
Time : Fri Oct 18 14:29:16 2013
Number of records: 18
Target uuid : config_uuid
-----------------------
#01 (224)marker 3 (flags=0x01, v1.8.9.0) lus04-clilov 'lov setup' Fri Oct 18 14:29:16 2013-
#02 (120)attach 0:lus04-clilov 1:lov 2:lus04-clilov_UUID
#03 (168)lov_setup 0:lus04-clilov 1:(struct lov_desc)
uuid=lus04-clilov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
#04 (224)marker 3 (flags=0x02, v1.8.9.0) lus04-clilov 'lov setup' Fri Oct 18 14:29:16 2013-
#05 (224)marker 4 (flags=0x01, v1.8.9.0) lus04-MDT0000 'add mdc' Fri Oct 18 14:29:16 2013-
#06 (128)attach 0:lus04-MDT0000-mdc 1:mdc 2:lus04-MDT0000-mdc_UUID
#07 (128)setup 0:lus04-MDT0000-mdc 1:lus04-MDT0000_UUID 2:0@<0:0>
#08 (088)add_uuid nid=172.17.148.5@tcp(0x20000ac119405) 0: 1:172.17.148.5@tcp
#09 (112)add_conn 0:lus04-MDT0000-mdc 1:172.17.148.5@tcp
#10 (128)mount_option 0: 1:lus04-client 2:lus04-clilov 3:lus04-MDT0000-mdc
#11 (224)marker 4 (flags=0x02, v1.8.9.0) lus04-MDT0000 'add mdc' Fri Oct 18 14:29:16 2013-
#12 (224)marker 7 (flags=0x01, v1.8.9.0) lus04-OST0000 'add osc' Fri Oct 18 14:29:47 2013-
#13 (128)attach 0:lus04-OST0000-osc 1:osc 2:lus04-clilov_UUID
#14 (128)setup 0:lus04-OST0000-osc 1:lus04-OST0000_UUID 2:0@<0:0>
#15 (088)add_uuid nid=172.17.148.7@tcp(0x20000ac119407) 0: 1:172.17.148.7@tcp
#16 (112)add_conn 0:lus04-OST0000-osc 1:172.17.148.7@tcp
#17 (128)lov_modify_tgts add 0:lus04-clilov 1:lus04-OST0000_UUID 2:0 3:1
#18 (224)marker 7 (flags=0x02, v1.8.9.0) lus04-OST0000 'add osc' Fri Oct 18 14:29:47 2013-
I will do the "tunefs.lustre --writeconf" and report back later.
While the filesystem has not got irreplaceable information and could be regenerated, the file system was in production. And we would prefer that we could copy the data off to another system before reformatting it.
|
|
Oh, sorry. I left out a step. After doing 'modprobe lnet' do 'lctl net up'. That should enable the following lctl cmd to work.
|
|
"Oh, sorry. I left out a step. After doing 'modprobe lnet' do 'lctl net up'. That should enable the following lctl cmd to work."
root@it-admin:~# dsh -M -g lus04 "lustre_rmmod"
root@it-admin:~# dsh -M -g lus04 "modprobe lnet"
root@it-admin:~# dsh -M -g lus04 "lctl net up"
lus04-mds1: LNET configured
lus04-mds2: LNET configured
lus04-oss1: LNET configured
lus04-oss2: LNET configured
lus04-oss3: LNET configured
lus04-oss4: LNET configured
root@it-admin:~# dsh -M -g lus04 "lctl list_nids"
lus04-mds1: 172.17.148.4@tcp
lus04-mds2: 172.17.148.5@tcp
lus04-oss1: 172.17.148.6@tcp
lus04-oss2: 172.17.148.7@tcp
lus04-oss3: 172.17.148.8@tcp
lus04-oss4: 172.17.148.9@tcp
I will start off with the "tunefs.lustre --writeconf " You say MDTs and OSS's do you want the MGS done as well ?
|
|
yes, please, all devices. MGS as well. You have a separate MGS and MDT? I was just assuming the MGS/MDT was a combo device.
|
|
Commands run and sample output for MGS/MDT/OSS writeconf
tunefs.lustre --writeconf --mgs --failnode 172.17.148.5 /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: failover.node=172.17.148.5@tcp
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: failover.node=172.17.148.5@tcp failover.node=172.17.148.5@tcp
Writing CONFIGS/mountdata
tunefs.lustre --writeconf --erase-params --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --failnode=172.17.148.5@tcp /dev/lus04-mdt0/lus04
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1001
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp failover.node=172.17.148.5@tcp
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1141
(MDT update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp failover.node=172.17.148.5@tcp
Writing CONFIGS/mountdata
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.8 --ost /dev/mapper/vd15
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.8 --ost /dev/mapper/vd17
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.8 --ost /dev/mapper/vd19
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.8 --ost /dev/mapper/vd21
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.8 --ost /dev/mapper/vd23
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.8 --ost /dev/mapper/vd25
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.8 --ost /dev/mapper/vd27
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.8 --ost /dev/mapper/vd29
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.9 --ost /dev/mapper/vd16
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.9 --ost /dev/mapper/vd18
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.9 --ost /dev/mapper/vd20
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.9 --ost /dev/mapper/vd22
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.9 --ost /dev/mapper/vd24
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.9 --ost /dev/mapper/vd26
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.9 --ost /dev/mapper/vd28
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.6 --ost /dev/mapper/vd01
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.6 --ost /dev/mapper/vd03
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.6 --ost /dev/mapper/vd05
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.6 --ost /dev/mapper/vd07
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.6 --ost /dev/mapper/vd09
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.6 --ost /dev/mapper/vd11
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.6 --ost /dev/mapper/vd13
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.7 --ost /dev/mapper/vd00
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.7 --ost /dev/mapper/vd02
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.7 --ost /dev/mapper/vd04
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.7 --ost /dev/mapper/vd06
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.7 --ost /dev/mapper/vd08
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.7 --ost /dev/mapper/vd10
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.7 --ost /dev/mapper/vd12
tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.7 --ost /dev/mapper/vd14
Sample output
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04=OST001d
Index: 29
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1102
(OST writeconf no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp failover.node=172.17.148.7@tcp
Permanent disk data:
Target: lus04-OST001d
Index: 29
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp failover.node=172.17.148.8@tcp
Writing CONFIGS/mountdata
|
|
Possibly need a second opinion, but those params don't look at all right to me. I see 2 --mgsnode options + a --failnode option. Think there should be only 1 --mdsnode + ! --failnode.
Just for simplification purposes could you try just setting 1 --mgsnode and no --failnode? I understand this probably isn't your desired target config, but in the interest of getting the fs back up to possibly copy files off simpler is better. If we can get that to work we can work on turning failover back on again later.
|
|
oot@lus04-mds1:~# dmesg -c
root@lus04-mds1:~# mount -t lustre /dev/lus04-mgs0/lus04 /export/MGS
root@lus04-mds1:~# mount -t lustre /dev/lus04-mdt0/lus04 /export/MDT0
root@lus04-mds1:~# dmesg
[38471.319550] Lustre: Build Version: v1_8_9_WC1sanger1--PRISTINE-2.6.32.59-sles-lustre-1.8.8wc1
[38471.415542] Lustre: Lustre Client File System; http://www.lustre.org/
[38471.424833] LDISKFS-fs (dm-1): barriers disabled
[38471.426341] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode
[38471.524900] LDISKFS-fs (dm-1): barriers disabled
[38471.526415] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode
[38471.553022] Lustre: MGS MGS started
[38471.553767] Lustre: MGC172.17.148.4@tcp: Reactivating import
[38484.993123] LDISKFS-fs (dm-0): barriers disabled
[38484.996208] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode
[38485.108837] LDISKFS-fs (dm-0): barriers disabled
[38485.111140] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode
[38485.142781] Lustre: MGS: Logs for fs lus04 were removed by user request. All servers must be restarted in order to regenerate the logs.
[38485.156338] Lustre: Enabling user_xattr
[38485.170944] Lustre: lus04-MDT0000: Now serving lus04-MDT0000 on /dev/mapper/lus04--mdt0-lus04 with recovery enabled
root@lus04-mds1:~#
And the kernel logfile of the server with the MGS and MDT
[38634.935325] Lustre: MGS: Regenerating lus04-OST0000 log by user request.
[38639.574055] LustreError: 12608:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[38639.574234] LustreError: 12608:0:(obd_config.c:372:class_setup()) setup lus04-OST0000-osc failed (-2)
[38639.574391] LustreError: 12608:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
[38639.574563] Lustre: cmd=cf003 0:lus04-OST0000-osc 1:lus04-OST0000_UUID 2:0@<0:0>
[38644.876011] Lustre: MGS: Regenerating lus04-OST0001 log by user request.
[38648.094895] Lustre: MGS: Regenerating lus04-OST0002 log by user request.
[38649.236649] LustreError: 12609:0:(obd_config.c:611:class_add_conn()) try to add conn on immature client dev
[38649.236820] Lustre: 12609:0:(mds_lov.c:1114:mds_notify()) MDS lus04-MDT0000: add target lus04-OST0000_UUID
[38649.246930] LustreError: 12609:0:(lov_obd.c:289:lov_connect_obd()) Target lus04-OST0000_UUID not set up
[38649.247075] LustreError: 12609:0:(lov_obd.c:727:lov_add_target()) connect or notify failed (-22) for lus04-OST0000_UUID
[38649.247213] LustreError: 12609:0:(obd_config.c:1199:class_config_llog_handler()) Err -22 on cfg command:
[38649.247349] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0000_UUID 2:0 3:1
[38650.948357] Lustre: MGS: Regenerating lus04-OST0003 log by user request.
[38655.146204] LustreError: 12622:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[38655.146368] LustreError: 12622:0:(obd_config.c:372:class_setup()) setup lus04-OST0001-osc failed (-2)
[38655.146503] LustreError: 12622:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
[38655.146641] Lustre: cmd=cf003 0:lus04-OST0001-osc 1:lus04-OST0001_UUID 2:0@<0:0>
[38655.453621] Lustre: MGS: Regenerating lus04-OST0004 log by user request.
[38659.318763] LustreError: 12624:0:(obd_config.c:611:class_add_conn()) try to add conn on immature client dev
[38659.318917] Lustre: 12624:0:(mds_lov.c:1114:mds_notify()) MDS lus04-MDT0000: add target lus04-OST0001_UUID
[38659.339303] LustreError: 12624:0:(lov_obd.c:289:lov_connect_obd()) Target lus04-OST0001_UUID not set up
[38659.339447] LustreError: 12624:0:(lov_obd.c:727:lov_add_target()) connect or notify failed (-22) for lus04-OST0001_UUID
[38659.339588] LustreError: 12624:0:(obd_config.c:1199:class_config_llog_handler()) Err -22 on cfg command:
[38659.339722] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0001_UUID 2:1 3:1
[38662.002368] Lustre: MGS: Regenerating lus04-OST0006 log by user request.
[38662.002372] Lustre: Skipped 1 previous similar message
[38666.809608] LustreError: 12633:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[38666.809768] LustreError: 12633:0:(obd_config.c:372:class_setup()) setup lus04-OST0002-osc failed (-2)
[38666.809896] LustreError: 12633:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
[38666.810024] Lustre: cmd=cf003 0:lus04-OST0002-osc 1:lus04-OST0002_UUID 2:0@<0:0>
[38672.517992] Lustre: MGS: Regenerating lus04-OST0009 log by user request.
[38672.517996] Lustre: Skipped 2 previous similar messages
[38673.545529] LustreError: 12634:0:(obd_config.c:611:class_add_conn()) try to add conn on immature client dev
[38673.545678] Lustre: 12634:0:(mds_lov.c:1114:mds_notify()) MDS lus04-MDT0000: add target lus04-OST0002_UUID
[38673.604532] LustreError: 12635:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x1e3:0x0: rc -2
[38673.604684] LustreError: 12635:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1e3:0: rc -2
[38673.604826] LustreError: 12635:0:(llog_obd.c:291:cat_cancel_cb()) Cannot find handle for log 0x1e3
[38673.605968] LustreError: 12634:0:(lov_obd.c:289:lov_connect_obd()) Target lus04-OST0002_UUID not set up
[38673.606117] LustreError: 12634:0:(lov_obd.c:727:lov_add_target()) connect or notify failed (-22) for lus04-OST0002_UUID
[38673.606258] LustreError: 12634:0:(obd_config.c:1199:class_config_llog_handler()) Err -22 on cfg command:
[38673.606401] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0002_UUID 2:2 3:1
[38680.560870] LustreError: 12638:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[38680.561031] LustreError: 12638:0:(obd_config.c:372:class_setup()) setup lus04-OST0003-osc failed (-2)
[38680.561162] Lustre: cmd=cf003 0:lus04-OST0003-osc 1:lus04-OST0003_UUID 2:0@<0:0>
[38685.703831] LustreError: 12651:0:(obd_config.c:611:class_add_conn()) try to add conn on immature client dev
[38685.703987] Lustre: 12651:0:(mds_lov.c:1114:mds_notify()) MDS lus04-MDT0000: add target lus04-OST0003_UUID
[38685.748801] LustreError: 12652:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x1e8:0x0: rc -2
[38685.748936] LustreError: 12652:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1e8:0: rc -2
[38685.749062] LustreError: 12652:0:(llog_obd.c:291:cat_cancel_cb()) Cannot find handle for log 0x1e8
[38685.750193] LustreError: 12651:0:(lov_obd.c:289:lov_connect_obd()) Target lus04-OST0003_UUID not set up
[38685.750328] LustreError: 12651:0:(lov_obd.c:727:lov_add_target()) connect or notify failed (-22) for lus04-OST0003_UUID
[38685.750468] LustreError: 12651:0:(obd_config.c:1199:class_config_llog_handler()) Err -22 on cfg command:
[38685.750608] LustreError: 12651:0:(obd_config.c:1199:class_config_llog_handler()) Skipped 1 previous similar message
[38685.750748] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0003_UUID 2:3 3:1
[38689.533835] Lustre: MGS: Regenerating lus04-OST000e log by user request.
[38689.533839] Lustre: Skipped 4 previous similar messages
[38693.310511] LustreError: 12659:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[38693.310678] LustreError: 12659:0:(obd_config.c:372:class_setup()) setup lus04-OST0004-osc failed (-2)
[38693.310824] Lustre: cmd=cf003 0:lus04-OST0004-osc 1:lus04-OST0004_UUID 2:0@<0:0>
[38724.566711] Lustre: MGS: Regenerating lus04-OST000f log by user request.
[38727.657075] LustreError: 12691:0:(obd_config.c:611:class_add_conn()) try to add conn on immature client dev
[38727.657250] Lustre: 12691:0:(mds_lov.c:1114:mds_notify()) MDS lus04-MDT0000: add target lus04-OST0004_UUID
[38727.697633] LustreError: 12692:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x1e7:0x0: rc -2
[38727.697806] LustreError: 12692:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1e7:0: rc -2
[38727.697988] LustreError: 12692:0:(llog_obd.c:291:cat_cancel_cb()) Cannot find handle for log 0x1e7
[38727.699077] LustreError: 12691:0:(lov_obd.c:289:lov_connect_obd()) Target lus04-OST0004_UUID not set up
[38727.699234] LustreError: 12691:0:(lov_obd.c:727:lov_add_target()) connect or notify failed (-22) for lus04-OST0004_UUID
[38727.699394] LustreError: 12691:0:(obd_config.c:1199:class_config_llog_handler()) Err -22 on cfg command:
[38727.699546] LustreError: 12691:0:(obd_config.c:1199:class_config_llog_handler()) Skipped 1 previous similar message
[38727.699700] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0004_UUID 2:4 3:1
[38739.668093] LustreError: 12693:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[38739.668302] LustreError: 12693:0:(obd_config.c:372:class_setup()) setup lus04-OST0005-osc failed (-2)
[38739.668467] Lustre: cmd=cf003 0:lus04-OST0005-osc 1:lus04-OST0005_UUID 2:0@<0:0>
[38745.892767] LustreError: 12705:0:(obd_config.c:611:class_add_conn()) try to add conn on immature client dev
[38745.892918] Lustre: 12705:0:(mds_lov.c:1114:mds_notify()) MDS lus04-MDT0000: add target lus04-OST0005_UUID
[38745.936667] LustreError: 12706:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x1ec:0x0: rc -2
[38745.936816] LustreError: 12706:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1ec:0: rc -2
[38745.936956] LustreError: 12706:0:(llog_obd.c:291:cat_cancel_cb()) Cannot find handle for log 0x1ec
[38745.938176] LustreError: 12705:0:(lov_obd.c:289:lov_connect_obd()) Target lus04-OST0005_UUID not set up
[38745.938322] LustreError: 12705:0:(lov_obd.c:727:lov_add_target()) connect or notify failed (-22) for lus04-OST0005_UUID
[38745.938463] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0005_UUID 2:5 3:1
[38750.820202] LustreError: 12707:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[38750.820379] LustreError: 12707:0:(obd_config.c:372:class_setup()) setup lus04-OST0006-osc failed (-2)
[38750.820521] Lustre: cmd=cf003 0:lus04-OST0006-osc 1:lus04-OST0006_UUID 2:0@<0:0>
[38758.362456] LustreError: 12715:0:(obd_config.c:611:class_add_conn()) try to add conn on immature client dev
[38758.362609] Lustre: 12715:0:(mds_lov.c:1114:mds_notify()) MDS lus04-MDT0000: add target lus04-OST0006_UUID
[38758.402883] LustreError: 12716:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x1eb:0x0: rc -2
[38758.403024] LustreError: 12716:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1eb:0: rc -2
[38758.403159] LustreError: 12716:0:(llog_obd.c:291:cat_cancel_cb()) Cannot find handle for log 0x1eb
[38758.404281] LustreError: 12715:0:(lov_obd.c:289:lov_connect_obd()) Target lus04-OST0006_UUID not set up
[38758.404420] LustreError: 12715:0:(lov_obd.c:727:lov_add_target()) connect or notify failed (-22) for lus04-OST0006_UUID
[38758.404559] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0006_UUID 2:6 3:1
[38762.319547] LustreError: 12718:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
[38762.319692] LustreError: 12718:0:(obd_config.c:1199:class_config_llog_handler()) Skipped 4 previous similar messages
[38762.319835] Lustre: cmd=cf003 0:lus04-OST0007-osc 1:lus04-OST0007_UUID 2:0@<0:0>
[38769.310496] LustreError: 12721:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x1f0:0x0: rc -2
[38769.310651] LustreError: 12721:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1f0:0: rc -2
[38769.310794] LustreError: 12721:0:(llog_obd.c:291:cat_cancel_cb()) Cannot find handle for log 0x1f0
[38769.311941] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0007_UUID 2:7 3:1
[38775.699590] LustreError: 12735:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
[38775.699736] LustreError: 12735:0:(ldlm_lib.c:333:client_obd_setup()) Skipped 1 previous similar message
[38775.699899] LustreError: 12735:0:(obd_config.c:372:class_setup()) setup lus04-OST0008-osc failed (-2)
[38775.700035] LustreError: 12735:0:(obd_config.c:372:class_setup()) Skipped 1 previous similar message
[38775.700175] Lustre: cmd=cf003 0:lus04-OST0008-osc 1:lus04-OST0008_UUID 2:0@<0:0>
Kernel messages from one of the OSS's
[42070.117243] Lustre: Build Version: v1_8_9_WC1sanger1--PRISTINE-2.6.32.59-sles-lustre-1.8.8wc1
[42070.215106] Lustre: Lustre Client File System; http://www.lustre.org/
[42070.269066] LDISKFS-fs (dm-14): barriers disabled
[42070.290981] LDISKFS-fs (dm-14): mounted filesystem with ordered data mode
[42070.480698] LDISKFS-fs (dm-14): barriers disabled
[42070.490424] LDISKFS-fs (dm-14): mounted filesystem with ordered data mode
[42070.503841] Lustre: MGC172.17.148.4@tcp: Reactivating import
[42070.561443] Lustre: Filtering OBD driver; http://wiki.whamcloud.com/
[42070.592412] Lustre: lus04-OST0000: Now serving lus04-OST0000 on /dev/mapper/vd00 with recovery enabled
[42083.497821] LDISKFS-fs (dm-2): barriers disabled
[42083.518965] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode
[42083.681216] LDISKFS-fs (dm-2): barriers disabled
[42083.690062] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode
[42083.972670] Lustre: lus04-OST0002: Now serving lus04-OST0002 on /dev/mapper/vd02 with recovery enabled
[42090.839376] LDISKFS-fs (dm-5): barriers disabled
[42090.859872] LDISKFS-fs (dm-5): mounted filesystem with ordered data mode
[42091.040129] LDISKFS-fs (dm-5): barriers disabled
[42091.048959] LDISKFS-fs (dm-5): mounted filesystem with ordered data mode
[42091.389277] Lustre: lus04-OST0004: Now serving lus04-OST0004 on /dev/mapper/vd04 with recovery enabled
[42097.403696] LDISKFS-fs (dm-13): barriers disabled
[42097.425964] LDISKFS-fs (dm-13): mounted filesystem with ordered data mode
[42097.588491] LDISKFS-fs (dm-13): barriers disabled
[42097.597649] LDISKFS-fs (dm-13): mounted filesystem with ordered data mode
[42098.006214] Lustre: lus04-OST0006: Now serving lus04-OST0006 on /dev/mapper/vd06 with recovery enabled
[42104.273626] LDISKFS-fs (dm-6): barriers disabled
[42104.295832] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode
[42104.452328] LDISKFS-fs (dm-6): barriers disabled
[42104.461200] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode
[42104.809165] Lustre: lus04-OST0008: Now serving lus04-OST0008 on /dev/mapper/vd08 with recovery enabled
[42111.294964] LDISKFS-fs (dm-1): barriers disabled
[42111.317643] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode
[42111.483790] LDISKFS-fs (dm-1): barriers disabled
[42111.493072] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode
[42111.836007] Lustre: lus04-OST000a: Now serving lus04-OST000a on /dev/mapper/vd10 with recovery enabled
[42117.397322] LDISKFS-fs (dm-3): barriers disabled
[42117.417872] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode
[42117.577000] LDISKFS-fs (dm-3): barriers disabled
[42117.585901] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode
[42124.916269] LDISKFS-fs (dm-12): barriers disabled
[42124.940165] LDISKFS-fs (dm-12): mounted filesystem with ordered data mode
[42125.119575] LDISKFS-fs (dm-12): barriers disabled
[42125.128708] LDISKFS-fs (dm-12): mounted filesystem with ordered data mode
[42125.516470] Lustre: lus04-OST000e: Now serving lus04-OST000e on /dev/mapper/vd14 with recovery enabled
[42125.516474] Lustre: Skipped 1 previous similar message
root@lus04-oss1:~#
root@lus04-mds1:~# lctl dl
0 UP mgs MGS MGS 13
1 UP mgc MGC172.17.148.4@tcp ec0779df-4709-87c5-fbb3-c1370bccaab5 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lus04-mdtlov lus04-mdtlov_UUID 4
4 UP mds lus04-MDT0000 lus04-MDT0000_UUID 3
5 AT osc lus04-OST0000-osc lus04-mdtlov_UUID 1
6 AT osc lus04-OST0001-osc lus04-mdtlov_UUID 1
7 AT osc lus04-OST0002-osc lus04-mdtlov_UUID 1
8 AT osc lus04-OST0003-osc lus04-mdtlov_UUID 1
9 AT osc lus04-OST0004-osc lus04-mdtlov_UUID 1
10 AT osc lus04-OST0005-osc lus04-mdtlov_UUID 1
11 AT osc lus04-OST0006-osc lus04-mdtlov_UUID 1
12 AT osc lus04-OST0007-osc lus04-mdtlov_UUID 1
13 AT osc lus04-OST0008-osc lus04-mdtlov_UUID 1
|
|
"Possibly need a second opinion, but those params don't look at all right to me. I see 2 --mgsnode options + a --failnode option. Think there should be only 1 --mdsnode + ! --failnode.
Just for simplification purposes could you try just setting 1 --mgsnode and no --failnode? I understand this probably isn't your desired target config, but in the interest of getting the fs back up to possibly copy files off simpler is better. If we can get that to work we can work on turning failover back on again later."
Of course, I also missed the --erase-params on the MGS. Right now we only want to get the file system back for a few days so we can copy the data off and reformat it ( so a day or 2 for 240TB ).
|
|
After unmounting all the OSS's and attempting to unmount the MDT
Oct 18 23:10:38 lus04-mds1 kernel: [38784.645245] Lustre: 12744:0:(mds_lov.c:1114:mds_notify()) MDS lus04-MDT0000: add target lus04-OST0008_UUID
Oct 18 23:10:38 lus04-mds1 kernel: [38784.645247] Lustre: 12744:0:(mds_lov.c:1114:mds_notify()) Skipped 1 previous similar message
Oct 18 23:10:38 lus04-mds1 kernel: [38784.686343] LustreError: 12745:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x1ef:0x0: rc -2
Oct 18 23:10:38 lus04-mds1 kernel: [38784.686490] LustreError: 12745:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1ef:0: rc -2
Oct 18 23:10:38 lus04-mds1 kernel: [38784.686625] LustreError: 12745:0:(llog_obd.c:291:cat_cancel_cb()) Cannot find handle for log 0x1ef
Oct 18 23:10:38 lus04-mds1 kernel: [38784.687672] LustreError: 12744:0:(lov_obd.c:289:lov_connect_obd()) Target lus04-OST0008_UUID not set up
Oct 18 23:10:38 lus04-mds1 kernel: [38784.687816] LustreError: 12744:0:(lov_obd.c:289:lov_connect_obd()) Skipped 1 previous similar message
Oct 18 23:10:38 lus04-mds1 kernel: [38784.687952] LustreError: 12744:0:(lov_obd.c:727:lov_add_target()) connect or notify failed (-22) for lus04-OST0008_UUID
Oct 18 23:10:38 lus04-mds1 kernel: [38784.688093] LustreError: 12744:0:(lov_obd.c:727:lov_add_target()) Skipped 1 previous similar message
Oct 18 23:10:38 lus04-mds1 kernel: [38784.688235] Lustre: cmd=cf00d 0:lus04-mdtlov 1:lus04-OST0008_UUID 2:8 3:1
Oct 18 23:26:35 lus04-mds1 kernel: [39740.309228] Lustre: Failing over lus04-MDT0000
Oct 18 23:26:35 lus04-mds1 kernel: [39740.310959] LustreError: 13468:0:(lov_obd.c:1012:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
Oct 18 23:27:05 lus04-mds1 kernel: [39770.254590] Lustre: Mount still busy with 14 refs after 30 secs.
Oct 18 23:27:35 lus04-mds1 kernel: [39800.201341] Lustre: Mount still busy with 14 refs after 60 secs.
Oct 18 23:28:05 lus04-mds1 kernel: [39830.148112] Lustre: Mount still busy with 14 refs after 90 secs.
Oct 18 23:28:35 lus04-mds1 kernel: [39860.094852] Lustre: Mount still busy with 14 refs after 120 secs.
Oct 18 23:29:05 lus04-mds1 kernel: [39890.041601] Lustre: Mount still busy with 14 refs after 150 secs.
Oct 18 23:29:35 lus04-mds1 kernel: [39919.988359] Lustre: Mount still busy with 14 refs after 180 secs.
Oct 18 23:30:05 lus04-mds1 kernel: [39949.935117] Lustre: Mount still busy with 14 refs after 210 secs.
Oct 18 23:30:35 lus04-mds1 kernel: [39979.881872] Lustre: Mount still busy with 14 refs after 240 secs.
How should I proceed on the unmount of the MDT
|
|
Try umount -f on the MDT. If that doesn't work then just force reboot the server & bring it up again.
Looking back over the history here I suspect something is wrong with the failover config.
All the tunefs params show up with no_primnode flags set. Also show a failnode param set. Some of the earlier logs suggest that MGS & MDT is attempting to communicate with the failover nid when you are mounting OSTs on the primary server. Think we may get a bit farther without failover for now.
|
|
I am going to be doing that in an instant, but I thought it would be worth mentioning that this filesystem was originally formatted 3+years ago with both possible servers specified as failover nids, not just the alternative host.
This is from my notes from, note that we had two failnodes configured one of which was the "primary" node.
[15/10/2013 12:22:35] James Beal:
root@lus04-oss1:~# tunefs.lustre --erase-params --writeconf --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.6@tcp --failnode 172.17.148.7@tcp --param ost.quota_type=ug /dev/mapper/vd08
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-OST0008
Index: 8
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.148.5@tcp mgsnode=172.17.148.4@tcp failover.node=172.17.148.6@tcp failover.node=172.17.148.7@tcp ost.quota_type=ug2
Permanent disk data:
Target: lus04=OST0008
Index: 8
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x142
(OST update writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp failover.node=172.17.148.6@tcp failover.node=172.17.148.7@tcp ost.quota_type=ug
Writing CONFIGS/mountdata
|
|
hmm. I notice that there is never a --fsname= option in your tunefs cmds, but the LustreFS reported in the params being set is still lus04. Suggests to me that there is only some partial erasure going on.
Just to satisfy me could you do a round of tunefs where the only cmd line option is --erase-params then follow that with another round where you do --writeconf with all the desired params, including --fsname=lus04, but no --erase-params?
|
|
It occurs to me to ask are you possibly running the same tunefs cmds for a given device on both the primary node and the backup node where that device is visible? Think that would have very ill effects.
|
|
"It occurs to me to ask are you possibly running the same tunefs cmds for a given device on both the primary node and the backup node where that device is visible? Think that would have very ill effects."
It is always worth asking those questions but no, I am only running the tunefs on the active node, the MGS and MDT I do by hand and the OSS I am keying off the commented mounts in /etc/fstab. I am just about to do this with the OSS(s) now.
root@lus04-mds1:~# tunefs.lustre --erase-params /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
root@lus04-mds1:~# tunefs.lustre --erase-params /dev/lus04-mdt0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1001
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1041
(MDT update no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
root@lus04-mds1:~# tunefs.lustre --writeconf --mgs --fsname=lus04 /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
root@lus04-mds1:~# tunefs.lustre --writeconf --mgsnode 172.17.148.4@tcp --fsname=lus04 /dev/lus04-mdt0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1041
(MDT update no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1141
(MDT update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp
Writing CONFIGS/mountdata
root@lus04-mds1:~#
|
|
And a sample OSS
tunefs.lustre --erase-params
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-OST001d
Index: 29
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.148.4@tcp
Permanent disk data:
Target: lus04-OST001d
Index: 29
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters:
Writing CONFIGS/mountdata
tunefs.lustre --ost --writeconf --mgsnode 172.17.148.4@tcp --fsname=lus04
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-OST001d
Index: 29
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters:
Permanent disk data:
Target: lus04-OST001d
Index: 29
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.148.4@tcp
Writing CONFIGS/mountdata
I havn't attempted to mount them yet as the unmounting of the MDT does not go well.
|
|
I see all your nodes still have no_primnode flag. Have been trying to get that set on a local test fs, can't figure out how. Looking at the code of mkfs.lustre it seems to be associated with setting a failover, failnode, or servicenode option. But none of those options are in your recent tunefs commands. Seems like some things are not getting cleared by erase or writeconf. I may need to consult wiser experts.
|
|
In the last few days this file system has had tunefs.lustre run a lot, we have used servicenode, failnode and had it having failover or no failover. I do have a record of what has happened with the file system but it is over 100 pages in word....
It is now 12:45 in the morning local time for me, I appreciate your efforts. If there is anything more I can do then add it to the ticket and I will try and do it tomorrow family permitting. As I said this system was formatted with say on .4, failnode=.4, failnode=.5 which worked well enough however it feels that fixing that broke things but I have no evidence for that.
|
|
Sorry, I didn't realize it was so late for you. by all means quit for now. I will try to check back on this ticket over the weekend, but can't promise. Thanks for your quick response and turnaround on my requests.
|
|
"Sorry, I didn't realize it was so late for you. by all means quit for now. I will try to check back on this ticket over the weekend, but can't promise. Thanks for your quick response and turnaround on my requests."
I realise that with you in the states that we will not have that much overlap so I am happy to stay up late to get the support. I completely understand about the weekend. Have a good weekend and if anything comes up then put it here and I will attend to it as promptly as I can manage 
|
|
Part one of try everything on the failnode and see if it all works if a prime node is not defined.
root@lus04-mds1:~# tunefs.lustre --writeconf --erase-params --mgs --failnode 172.17.148.5@tcp --fsname=lus04 /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: failover.node=172.17.148.5@tcp
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: failover.node=172.17.148.5@tcp
Writing CONFIGS/mountdata
root@lus04-mds1:~# tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mgsnode=172.17.148.5@tcp --failnode 172.17.148.5@tcp --fsname=lus04 /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: failover.node=172.17.148.5@tcp
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp failover.node=172.17.148.5@tcp
Writing CONFIGS/mountdata
Now to mount these on their partner node .5
Oct 19 13:22:51 lus04-mds2 kernel: [94212.007840] Lustre: MGS MGS started
Oct 19 13:22:51 lus04-mds2 kernel: [94212.008444] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 19 13:23:05 lus04-mds2 kernel: [94225.421033] LDISKFS-fs (dm-0): barriers disabled
Oct 19 13:23:05 lus04-mds2 kernel: [94225.424375] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode
Oct 19 13:23:05 lus04-mds2 kernel: [94225.538080] LDISKFS-fs (dm-0): barriers disabled
Oct 19 13:23:05 lus04-mds2 kernel: [94225.540969] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode
Oct 19 13:23:05 lus04-mds2 kernel: [94225.557986] Lustre: MGS: Logs for fs lus04 were removed by user request. All servers must be restarted in order to regenerate the logs.
Oct 19 13:23:05 lus04-mds2 kernel: [94225.574965] Lustre: Enabling user_xattr
Oct 19 13:23:05 lus04-mds2 kernel: [94225.604777] Lustre: lus04-MDT0000: Now serving lus04-MDT0000 on /dev/mapper/lus04--mdt0-lus04 with recovery enabled
tunefs.lustre --ost --writeconf --mgsnode 172.17.148.5@tcp --fsname=lus04 --ost --failnode=172.17.148.7@tcp /dev/mapper/vd00
And to mount on the partner pair, note " sent from MGC172.17.148.4@tcp to NID 172.17.148.4@tcp 105s ago has timed out", The MGS is mounted on .5
Oct 19 13:27:55 lus04-oss2 kernel: [93559.887640] LDISKFS-fs (dm-13): barriers disabled
Oct 19 13:27:55 lus04-oss2 kernel: [93559.910142] LDISKFS-fs (dm-13): mounted filesystem with ordered data mode
Oct 19 13:27:55 lus04-oss2 kernel: [93560.104185] LDISKFS-fs (dm-13): barriers disabled
Oct 19 13:27:55 lus04-oss2 kernel: [93560.113434] LDISKFS-fs (dm-13): mounted filesystem with ordered data mode
Oct 19 13:29:40 lus04-oss2 kernel: [93664.929907] Lustre: 32416:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1449276099002414 sent from MGC172.17.148.4@tcp to NID 172.17.148.4@tcp 105s ago has timed out (105s prior to deadline).
Oct 19 13:29:40 lus04-oss2 kernel: [93664.929910] req@ffff8802f5e27c00 x1449276099002414/t0 o250->MGS@MGC172.17.148.4@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1382185780 ref 1 fl Rpc:N/0/0 rc 0/0
Oct 19 13:30:00 lus04-oss2 kernel: [93684.895825] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 19 13:30:00 lus04-oss2 kernel: [93684.914641] LustreError: 2228:0:(llog.c:381:llog_process()) cannot start thread: -513
Oct 19 13:30:00 lus04-oss2 kernel: [93684.914671] LustreError: 2228:0:(mgc_request.c:1094:mgc_copy_llog()) Failed to copy remote log lus04-OST0000 (-513)
Oct 19 13:30:00 lus04-oss2 kernel: [93684.944136] LustreError: 2228:0:(llog.c:381:llog_process()) cannot start thread: -513
Oct 19 13:30:00 lus04-oss2 kernel: [93684.944173] LustreError: 15c-8: MGC172.17.148.4@tcp: The configuration from log 'lus04-OST0000' failed (-513). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct 19 13:30:00 lus04-oss2 kernel: [93684.944224] LustreError: 2228:0:(obd_mount.c:1143:server_start_targets()) failed to start server lus04-OST0000: -513
Oct 19 13:30:00 lus04-oss2 kernel: [93684.944253] LustreError: 2228:0:(obd_mount.c:1672:server_fill_super()) Unable to start targets: -513
Oct 19 13:30:00 lus04-oss2 kernel: [93684.944381] LustreError: 2228:0:(obd_mount.c:1455:server_put_super()) no obd lus04-OST0000
Oct 19 13:30:00 lus04-oss2 kernel: [93684.944531] LustreError: 2228:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Oct 19 13:30:00 lus04-oss2 kernel: [93684.944662] LustreError: 2228:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Oct 19 13:30:00 lus04-oss2 kernel: [93685.048913] Lustre: server umount lus04-OST0000 complete
Oct 19 13:30:00 lus04-oss2 kernel: [93685.048919] LustreError: 2228:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-513)
Oct 19 13:30:00 lus04-oss2 kernel: [93685.076179] LDISKFS-fs (dm-13): barriers disabled
Oct 19 13:30:00 lus04-oss2 kernel: [93685.085465] LDISKFS-fs (dm-13): mounted filesystem with ordered data mode
Oct 19 13:30:00 lus04-oss2 kernel: [93685.194214] LDISKFS-fs (dm-13): barriers disabled
Oct 19 13:30:00 lus04-oss2 kernel: [93685.203402] LDISKFS-fs (dm-13): mounted filesystem with ordered data mode
Oct 19 13:30:04 lus04-oss2 kernel: [93689.375697] LustreError: 137-5: UUID 'lus04-OST0000_UUID' is not available for connect (no target)
Oct 19 13:30:04 lus04-oss2 kernel: [93689.375831] LustreError: 2735:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (19) req@ffff880614092000 x1449326212546592/t0 o8><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1382185904 ref 1 fl Interpret:/0/0 rc -19/0
Oct 19 13:31:45 lus04-oss2 kernel: [93790.023923] Lustre: 32416:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1449276099002422 sent from MGC172.17.148.4@tcp to NID 172.17.148.4@tcp 105s ago has timed out (105s prior to deadline).
Oct 19 13:31:45 lus04-oss2 kernel: [93790.023926] req@ffff8802f5a62800 x1449276099002422/t0 o250->MGS@MGC172.17.148.4@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1382185905 ref 1 fl Rpc:N/0/0 rc 0/0
Oct 19 13:31:55 lus04-oss2 kernel: [93799.663025] LustreError: 2228:0:(mgc_request.c:365:mgc_requeue_add()) log lus04-OST0000: cannot start requeue thread (-513),no more log updates!
Oct 19 13:31:55 lus04-oss2 kernel: [93799.663175] LustreError: 2228:0:(mgc_request.c:638:mgc_blocking_ast()) cancel CB failed -513:
Oct 19 13:31:55 lus04-oss2 kernel: [93799.663308] LustreError: 2228:0:(mgc_request.c:639:mgc_blocking_ast()) ### MGC ast ns: MGC172.17.148.4@tcp lock: ffff8802a8324e00/0x85ea67077e7a36d8 lrc: 5/0,0 mode: --/CR res: 224151172460/0 rrc: 1 type: PLN flags: 0x4002c90 remote: 0x0 expref: -99 pid: 2228 timeout 0
Oct 19 13:31:55 lus04-oss2 kernel: [93799.664025] LustreError: 2228:0:(llog.c:381:llog_process()) cannot start thread: -513
Oct 19 13:31:55 lus04-oss2 kernel: [93799.664164] LustreError: 15c-8: MGC172.17.148.4@tcp: The configuration from log 'lus04-OST0000' failed (-513). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct 19 13:31:55 lus04-oss2 kernel: [93799.664426] LustreError: 2228:0:(obd_mount.c:1143:server_start_targets()) failed to start server lus04-OST0000: -513
Oct 19 13:31:55 lus04-oss2 kernel: [93799.664560] LustreError: 2228:0:(obd_mount.c:1672:server_fill_super()) Unable to start targets: -513
Oct 19 13:31:55 lus04-oss2 kernel: [93799.664741] LustreError: 2228:0:(obd_mount.c:1455:server_put_super()) no obd lus04-OST0000
Oct 19 13:31:55 lus04-oss2 kernel: [93799.801800] Lustre: server umount lus04-OST0000 complete
Oct 19 13:31:55 lus04-oss2 kernel: [93799.801806] LustreError: 2228:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-513)
|
|
root@lus04-mds1:~# tunefs.lustre --writeconf --erase-params --mgs --fsname=lus04 /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp failover.node=172.17.148.5@tcp
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
At this point I made a mistake, I wrote the MDT config into the MGS by accident.
root@lus04-mds1:~# tunefs.lustre --writeconf --erase-params --mgsnode=172.17.148.4@tcp --mdt --fsname=lus04 /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1144
(MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: lus04-MDTffff
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1145
(MDT MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp
Writing CONFIGS/mountdata
I tried to reverse this.
root@lus04-mds1:~# tunefs.lustre --writeconf --erase-params /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1145
(MDT MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1145
(MDT MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
I now have two mistakes I need to fix in the flags.
root@lus04-mds1:~# tunefs.lustre --writeconf --erase-params --mgs --fsname=lus04 /dev/lus04-mgs0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1145
(MDT MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1145
(MDT MGS update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
I will copy a backup I did of the MGS ( on thursday ) over to the machine locally and attempt to mount that as a loop device.
|
|
I can now get the OSS to mount and it looks ok however a client will not mount
Start with a clean MGS
root@lus04-mds1:/# dd if=/dev/zero bs=1M count=128 of=/home/MGS_DONOT_DELETE
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 0.0937643 s, 1.4 GB/s
root@lus04-mds1:/# losetup /dev/loop2 /home/MGS_DONOT_DELETE
root@lus04-mds1:/# mkfs.lustre --mgs --fsname=lus04 /dev/loop2
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x74
(MGS needs_index first_time update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
checking for existing Lustre data: not found
device size = 128MB
formatting backing filesystem ldiskfs on /dev/loop2
target name MGS
4k blocks 32768
options -q -O uninit_bg,dir_nlink -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L MGS -q -O uninit_bg,dir_nlink -E lazy_journal_init -F /dev/loop2 32768
Writing CONFIGS/mountdata
root@lus04-mds1:~# mount -t lustre /dev/loop2 /export/MGS
And the MDT
tunefs.lustre /dev/lus04-mdt0/lus04
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1001
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1001
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp
exiting before disk write.
Do all the OSS mounts on the partners and look at the devices
root@lus04-mds1:/root# lctl dl
0 UP mgs MGS MGS 5
1 UP mgc MGC172.17.148.4@tcp 5024ddc3-8729-483f-148c-5bbfe6326be2 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lus04-mdtlov lus04-mdtlov_UUID 4
4 UP mds lus04-MDT0000 lus04-MDT0000_UUID 3
5 UP osc lus04-OST0000-osc lus04-mdtlov_UUID 5
6 UP osc lus04-OST0001-osc lus04-mdtlov_UUID 5
7 UP osc lus04-OST0002-osc lus04-mdtlov_UUID 5
8 UP osc lus04-OST0003-osc lus04-mdtlov_UUID 5
9 UP osc lus04-OST0004-osc lus04-mdtlov_UUID 5
10 UP osc lus04-OST0005-osc lus04-mdtlov_UUID 5
11 UP osc lus04-OST0006-osc lus04-mdtlov_UUID 5
12 UP osc lus04-OST0007-osc lus04-mdtlov_UUID 5
13 UP osc lus04-OST0008-osc lus04-mdtlov_UUID 5
14 UP osc lus04-OST0009-osc lus04-mdtlov_UUID 5
15 UP osc lus04-OST000a-osc lus04-mdtlov_UUID 5
16 UP osc lus04-OST000b-osc lus04-mdtlov_UUID 5
17 UP osc lus04-OST000c-osc lus04-mdtlov_UUID 5
18 UP osc lus04-OST000d-osc lus04-mdtlov_UUID 5
19 UP osc lus04-OST000e-osc lus04-mdtlov_UUID 5
20 UP osc lus04-OST000f-osc lus04-mdtlov_UUID 5
21 UP osc lus04-OST0010-osc lus04-mdtlov_UUID 5
22 UP osc lus04-OST0011-osc lus04-mdtlov_UUID 5
23 UP osc lus04-OST0012-osc lus04-mdtlov_UUID 5
24 UP osc lus04-OST0013-osc lus04-mdtlov_UUID 5
25 UP osc lus04-OST0014-osc lus04-mdtlov_UUID 5
26 UP osc lus04-OST0015-osc lus04-mdtlov_UUID 5
27 UP osc lus04-OST0016-osc lus04-mdtlov_UUID 5
28 UP osc lus04-OST0017-osc lus04-mdtlov_UUID 5
29 UP osc lus04-OST0018-osc lus04-mdtlov_UUID 5
30 UP osc lus04-OST0019-osc lus04-mdtlov_UUID 5
31 UP osc lus04-OST001a-osc lus04-mdtlov_UUID 5
32 UP osc lus04-OST001b-osc lus04-mdtlov_UUID 5
33 UP osc lus04-OST001c-osc lus04-mdtlov_UUID 5
34 UP osc lus04-OST001d-osc lus04-mdtlov_UUID 5
See that a client mount fails
mount /lustre/scratch104
mount.lustre: mount lus04-mds1@tcp0:/lus04 at /lustre/scratch104 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
root@bc-11-2-07:~# tail -f /var/log/kern.log
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.814834] Lustre: cmd=cf003 0:lus04-MDT0000-mdc 1:lus04-MDT0000_UUID 2:0@<0:0>
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.815018] LustreError: 15c-8: MGC172.17.148.4@tcp: The configuration from log 'lus04-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.815345] LustreError: 11016:0:(llite_lib.c:1099:ll_fill_super()) Unable to process log: -2
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.815751] LustreError: 11016:0:(obd_config.c:443:class_cleanup()) Device 68 not setup
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.816019] LustreError: 11016:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.816192] LustreError: 11016:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Skipped 1 previous similar message
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.816344] LustreError: 11016:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.816497] LustreError: 11016:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) Skipped 1 previous similar message
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.817443] Lustre: client lus04-client(ffff88015c402800) umount complete
Oct 19 16:26:29 bc-11-2-07 kernel: [17774424.817530] LustreError: 11016:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-2)
Clean out the client config from the MGS
root@lus04-mds1:/# umount /export/MGS
root@lus04-mds1:/# mount -t ldiskfs /dev/loop1 /export/MGS
root@lus04-mds1:/# cd /export/MGS/CONFIGS
root@lus04-mds1:/export/MGS/CONFIGS# mkdir ../CONFIGS_OLD1
root@lus04-mds1:/export/MGS/CONFIGS# mv lus04-client ../CONFIGS_OLD1
root@lus04-mds1:/export/MGS/CONFIGS# cd /
root@lus04-mds1:/# umount /export/MGS
Now tunefs the OSS so that they believe they are owned by one server, but have a failover mid
root@lus04-oss1:~# for i in `seq -w 00 07 ` ; do tunefs.lustre --ost --writeconf --erase-params --mgsnode 172.17.148.4@tcp --fsname=lus04 --ost --failnode=172.17.148.7@tcp /dev/mapper/vd$i ; done
Now mount them on the partner node
root@lus04-oss2:~# for i in `seq -w 00 07 ` ; do mount -t lustre /dev/mapper/vd$i /export/vd$i; done
See that the MGS/MDT looks ok
root@lus04-mds1:~# lctl dl
0 UP mgs MGS MGS 13
1 UP mgc MGC172.17.148.4@tcp 75489c46-b5a0-50ff-86e9-3f688e8a1de8 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lus04-mdtlov lus04-mdtlov_UUID 4
4 UP mds lus04-MDT0000 lus04-MDT0000_UUID 3
5 UP osc lus04-OST0000-osc lus04-mdtlov_UUID 5
6 UP osc lus04-OST0001-osc lus04-mdtlov_UUID 5
7 UP osc lus04-OST0002-osc lus04-mdtlov_UUID 5
8 UP osc lus04-OST0003-osc lus04-mdtlov_UUID 5
9 UP osc lus04-OST0004-osc lus04-mdtlov_UUID 5
10 UP osc lus04-OST0005-osc lus04-mdtlov_UUID 5
11 UP osc lus04-OST0006-osc lus04-mdtlov_UUID 5
12 UP osc lus04-OST0007-osc lus04-mdtlov_UUID 5
13 UP osc lus04-OST0008-osc lus04-mdtlov_UUID 5
14 UP osc lus04-OST0009-osc lus04-mdtlov_UUID 5
15 UP osc lus04-OST000a-osc lus04-mdtlov_UUID 5
16 UP osc lus04-OST000b-osc lus04-mdtlov_UUID 5
17 UP osc lus04-OST000c-osc lus04-mdtlov_UUID 5
18 UP osc lus04-OST000d-osc lus04-mdtlov_UUID 5
19 UP osc lus04-OST000e-osc lus04-mdtlov_UUID 5
20 UP osc lus04-OST000f-osc lus04-mdtlov_UUID 5
21 UP osc lus04-OST0010-osc lus04-mdtlov_UUID 5
22 UP osc lus04-OST0011-osc lus04-mdtlov_UUID 5
23 UP osc lus04-OST0012-osc lus04-mdtlov_UUID 5
24 UP osc lus04-OST0013-osc lus04-mdtlov_UUID 5
25 UP osc lus04-OST0014-osc lus04-mdtlov_UUID 5
26 UP osc lus04-OST0015-osc lus04-mdtlov_UUID 5
27 UP osc lus04-OST0016-osc lus04-mdtlov_UUID 5
28 UP osc lus04-OST0017-osc lus04-mdtlov_UUID 5
29 UP osc lus04-OST0018-osc lus04-mdtlov_UUID 5
30 UP osc lus04-OST0019-osc lus04-mdtlov_UUID 5
31 UP osc lus04-OST001a-osc lus04-mdtlov_UUID 5
32 UP osc lus04-OST001b-osc lus04-mdtlov_UUID 5
33 UP osc lus04-OST001c-osc lus04-mdtlov_UUID 5
34 UP osc lus04-OST001d-osc lus04-mdtlov_UUID 5
root@lus04-mds1:~#
Try and mount the client and get frustrated.
Oct 19 16:49:04 bc-11-2-07 kernel: [17775779.740161] Lustre: Removed LNI 172.17.115.32@tcp
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.590501] Lustre: Build Version: v1_8_9_WC1sanger1--PRISTINE-2.6.32.59-sles-lustre-1.8.8wc1
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.662391] Lustre: Added LNI 172.17.115.32@tcp [8/256/0/180]
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.662573] Lustre: Accept secure, port 988
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.752886] Lustre: Lustre Client File System; http://www.lustre.org/
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.773139] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.775766] LustreError: 11961:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.775996] LustreError: 11961:0:(obd_config.c:372:class_setup()) setup lus04-MDT0000-mdc-ffff88041200d000 failed (-2)
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.776149] LustreError: 11961:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.776301] Lustre: cmd=cf003 0:lus04-MDT0000-mdc 1:lus04-MDT0000_UUID 2:0@<0:0>
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.776486] LustreError: 15c-8: MGC172.17.148.4@tcp: The configuration from log 'lus04-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.776819] LustreError: 11891:0:(llite_lib.c:1099:ll_fill_super()) Unable to process log: -2
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.777235] LustreError: 11891:0:(obd_config.c:443:class_cleanup()) Device 2 not setup
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.777504] LustreError: 11891:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.777659] LustreError: 11891:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.786546] Lustre: client lus04-client(ffff88041200d000) umount complete
Oct 19 16:49:12 bc-11-2-07 kernel: [17775787.786635] LustreError: 11891:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-2)
Try and reset the lus04-client log
root@lus04-mds1:~# umount /export/MGS
root@lus04-mds1:~# mount -t ldiskfs /dev/loop2 /export/MGS
root@lus04-mds1:~# cd /export/MGS/CONFIGS/
root@lus04-mds1:/export/MGS/CONFIGS# mkdir ../CONFIGS1
root@lus04-mds1:/export/MGS/CONFIGS# mv lus04-client ../CONFIGS1
root@lus04-mds1:/export/MGS/CONFIGS# cd /
root@lus04-mds1:/# umount /export/MGS
root@lus04-mds1:/# mount -t lustre /dev/loop2 /export/MGS
Mount a client
Oct 19 16:51:43 bc-11-2-07 kernel: [17775938.418669] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 19 16:51:43 bc-11-2-07 kernel: [17775938.421159] LustreError: 156-2: The client profile 'lus04-client' could not be read from the MGS. Does that filesystem exist?
Oct 19 16:51:43 bc-11-2-07 kernel: [17775938.421516] LustreError: 11971:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Oct 19 16:51:43 bc-11-2-07 kernel: [17775938.421688] LustreError: 11971:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Oct 19 16:51:43 bc-11-2-07 kernel: [17775938.422630] Lustre: client lus04-client(ffff8801c6b9d400) umount complete
Oct 19 16:51:43 bc-11-2-07 kernel: [17775938.422713] LustreError: 11971:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-22)
Reintroduce the MDT
root@lus04-mds1:/# umount /export/MDT0
root@lus04-mds1:/# mount -t lustre /dev/lus04-mdt0/lus04 /export/MDT0
And try the client mount again.
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.316139] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.319088] LustreError: 11998:0:(ldlm_lib.c:333:client_obd_setup()) can't add initial connection
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.319323] LustreError: 11998:0:(obd_config.c:372:class_setup()) setup lus04-MDT0000-mdc-ffff88040f93fc00 failed (-2)
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.319477] LustreError: 11998:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.319628] Lustre: cmd=cf003 0:lus04-MDT0000-mdc 1:lus04-MDT0000_UUID 2:0@<0:0>
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.319815] LustreError: 15c-8: MGC172.17.148.4@tcp: The configuration from log 'lus04-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.320147] LustreError: 11988:0:(llite_lib.c:1099:ll_fill_super()) Unable to process log: -2
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.320558] LustreError: 11988:0:(obd_config.c:443:class_cleanup()) Device 2 not setup
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.320812] LustreError: 11988:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Oct 19 16:52:23 bc-11-2-07 kernel: [17775979.320983] LustreError: 11988:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Oct 19 16:52:24 bc-11-2-07 kernel: [17775979.329645] Lustre: client lus04-client(ffff88040f93fc00) umount complete
Oct 19 16:52:24 bc-11-2-07 kernel: [17775979.329734] LustreError: 11988:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-2)
|
|
losetup /dev/loop3 /nfs/ssg_data01/jb23/lus04-mdt_in_use_talk_to_me_first
tune2fs -O ^quota /dev/loop3
tune2fs 1.42.7.wc1 (12-Apr-2013)
tunefs.lustre --dryrun /dev/loop3
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04=MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1101
(MDT writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.4@tcp
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1101
(MDT writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.4@tcp
exiting before disk write.
tunefs.lustre --writeconf --erase-params --mgsnode 172.17.148.4@tcp --fsname=lus04 /dev/loop3
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04=MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1101
(MDT writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.4@tcp
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1141
(MDT update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp
Writing CONFIGS/mountdata
Unmount the OSS's, and reset the config,.
root@lus04-oss1:~# for i in `seq -w 08 14` ; do tunefs.lustre --ost --writeconf --erase-params --mgsnode 172.17.148.4@tcp --fsname=lus04 --ost --failnode=172.17.148.7@tcp /dev/mapper/vd$i ; done
root@lus04-oss2:~# for i in `seq -w 00 07 ` ; do tunefs.lustre --ost --writeconf --erase-params --mgsnode 172.17.148.4@tcp --fsname=lus04 --ost --failnode=172.17.148.6@tcp /dev/mapper/vd$i ; done
root@lus04-oss3:~# for i in `seq -w 15 22` ; do tunefs.lustre --ost --writeconf --erase-params --mgsnode 172.17.148.4@tcp --fsname=lus04 --ost --failnode=172.17.148.9@tcp /dev/mapper/vd$i ; done
root@lus04-oss4:~# for i in `seq -w 23 29` ; do tunefs.lustre --ost --writeconf --erase-params --mgsnode 172.17.148.4@tcp --fsname=lus04 --ost --failnode=172.17.148.8@tcp /dev/mapper/vd$i ; done
root@lus04-mds1:~# mount | grep lus
/dev/loop2 on /export/MGS type lustre (rw)
root@lus04-mds1:~# mount -t lustre /dev/loop3 /export/MDT0
root@lus04-mds1:~# tail /var/log/kern.log
Oct 20 08:09:44 lus04-mds1 kernel: [66192.187279] LDISKFS-fs (loop3): barriers disabled
Oct 20 08:09:44 lus04-mds1 kernel: [66192.214036] LDISKFS-fs (loop3): mounted filesystem with ordered data mode
Oct 20 08:10:25 lus04-mds1 kernel: [66232.710174] LDISKFS-fs (loop3): barriers disabled
Oct 20 08:10:25 lus04-mds1 kernel: [66232.712679] LDISKFS-fs (loop3): mounted filesystem with ordered data mode
Oct 20 08:10:25 lus04-mds1 kernel: [66232.790005] LDISKFS-fs (loop3): barriers disabled
Oct 20 08:10:25 lus04-mds1 kernel: [66232.791717] LDISKFS-fs (loop3): mounted filesystem with ordered data mode
Oct 20 08:10:25 lus04-mds1 kernel: [66232.803940] Lustre: MGS: Logs for fs lus04 were removed by user request. All servers must be restarted in order to regenerate the logs.
Oct 20 08:10:25 lus04-mds1 kernel: [66232.806465] Lustre: Enabling user_xattr
Oct 20 08:10:25 lus04-mds1 kernel: [66232.889602] Lustre: Mounting lus04-MDT0000 at first time on 2.0 FS, remove all clients for interop needs
Oct 20 08:10:25 lus04-mds1 kernel: [66232.941098] Lustre: lus04-MDT0000: Now serving lus04-MDT0000 on /dev/loop3 with recovery enabled
root@lus04-oss1:~# for i in `seq -w 00 07` ; do mount -t lustre /dev/mapper/vd$i /export/vd$i ; done
root@lus04-oss2:~# for i in `seq -w 08 14` ; do mount -t lustre /dev/mapper/vd$i /export/vd$i ; done
root@lus04-oss4:~# for i in `seq -w 15 22` ; do mount -t lustre /dev/mapper/vd$i /export/vd$i ; done
root@lus04-oss3:~# for i in `seq -w 23 29` ; do mount -t lustre /dev/mapper/vd$i /export/vd$i ; done
Wait for a bit
grep -i statu /proc/fs/lustre/obdfilter/lus04-OST*/recovery_status
/proc/fs/lustre/obdfilter/lus04-OST0017/recovery_status:status: RECOVERING
/proc/fs/lustre/obdfilter/lus04-OST0018/recovery_status:status: RECOVERING
/proc/fs/lustre/obdfilter/lus04-OST0019/recovery_status:status: RECOVERING
/proc/fs/lustre/obdfilter/lus04-OST001a/recovery_status:status: RECOVERING
/proc/fs/lustre/obdfilter/lus04-OST001b/recovery_status:status: RECOVERING
/proc/fs/lustre/obdfilter/lus04-OST001c/recovery_status:status: RECOVERING
/proc/fs/lustre/obdfilter/lus04-OST001d/recovery_status:status: RECOVERING
root@lus04-mds1:~# lctl dl
0 UP mgs MGS MGS 13
1 UP mgc MGC172.17.148.4@tcp f450e477-6ad3-c2d7-72f2-ace4ad6d6513 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lus04-mdtlov lus04-mdtlov_UUID 4
4 UP mds lus04-MDT0000 lus04-MDT0000_UUID 3
5 UP osc lus04-OST0000-osc lus04-mdtlov_UUID 5
6 UP osc lus04-OST0001-osc lus04-mdtlov_UUID 5
7 UP osc lus04-OST0002-osc lus04-mdtlov_UUID 5
8 UP osc lus04-OST0003-osc lus04-mdtlov_UUID 5
9 UP osc lus04-OST0004-osc lus04-mdtlov_UUID 5
10 UP osc lus04-OST0005-osc lus04-mdtlov_UUID 5
11 UP osc lus04-OST0006-osc lus04-mdtlov_UUID 5
12 UP osc lus04-OST0007-osc lus04-mdtlov_UUID 5
13 UP osc lus04-OST0008-osc lus04-mdtlov_UUID 5
14 UP osc lus04-OST0009-osc lus04-mdtlov_UUID 5
15 UP osc lus04-OST000a-osc lus04-mdtlov_UUID 5
16 UP osc lus04-OST000b-osc lus04-mdtlov_UUID 5
17 UP osc lus04-OST000c-osc lus04-mdtlov_UUID 5
18 UP osc lus04-OST000d-osc lus04-mdtlov_UUID 5
19 UP osc lus04-OST000e-osc lus04-mdtlov_UUID 5
20 UP osc lus04-OST000f-osc lus04-mdtlov_UUID 5
21 UP osc lus04-OST0010-osc lus04-mdtlov_UUID 5
22 UP osc lus04-OST0011-osc lus04-mdtlov_UUID 5
23 UP osc lus04-OST0012-osc lus04-mdtlov_UUID 5
24 UP osc lus04-OST0013-osc lus04-mdtlov_UUID 5
25 UP osc lus04-OST0014-osc lus04-mdtlov_UUID 5
26 UP osc lus04-OST0015-osc lus04-mdtlov_UUID 5
27 UP osc lus04-OST0016-osc lus04-mdtlov_UUID 5
28 UP osc lus04-OST0017-osc lus04-mdtlov_UUID 5
29 UP osc lus04-OST0018-osc lus04-mdtlov_UUID 5
30 UP osc lus04-OST0019-osc lus04-mdtlov_UUID 5
31 UP osc lus04-OST001a-osc lus04-mdtlov_UUID 5
32 UP osc lus04-OST001b-osc lus04-mdtlov_UUID 5
33 UP osc lus04-OST001c-osc lus04-mdtlov_UUID 5
34 UP osc lus04-OST001d-osc lus04-mdtlov_UUID 5
root@lus04-mds1:~#
Mounting client fails again.
Try and reset the client log
root@lus04-mds1:~# losetup -a
/dev/loop0: [001c]:5830 (/mnt/users/jb23/lus04-mgs_in_use)
/dev/loop1: [001c]:5830 (/mnt/users/jb23/lus04-mgs_in_use)
/dev/loop2: [0846]:134525730 (/home/MGS_DONOT_DELETE)
/dev/loop3: [0019]:10440828 (/nfs/ssg_data01/jb23/lus04-mdt_in_use_talk_to_me_first)
root@lus04-mds1:~# mount -t ldiskfs /dev/loop3 /export/MDT0
root@lus04-mds1:~# mount -t ldiskfs /dev/loop2 /export/MGS
root@lus04-mds1:~# cd /export/MDT0/
CATALOGS lost+found/ oi.16.10 oi.16.20 oi.16.30 oi.16.40 oi.16.50 oi.16.60 REMOTE_PARENT_DIR/
changelog_catalog lov_objid oi.16.11 oi.16.21 oi.16.31 oi.16.41 oi.16.51 oi.16.61 ROOT/
changelog_users lov_objseq oi.16.12 oi.16.22 oi.16.32 oi.16.42 oi.16.52 oi.16.62 seq_ctl
CONFIGS/ lquota.group oi.16.13 oi.16.23 oi.16.33 oi.16.43 oi.16.53 oi.16.63 seq_srv
CONFIGS_OLD/ lquota.user oi.16.14 oi.16.24 oi.16.34 oi.16.44 oi.16.54 oi.16.7
fld lquota_v2.group oi.16.15 oi.16.25 oi.16.35 oi.16.45 oi.16.55 oi.16.8
health_check lquota_v2.user oi.16.16 oi.16.26 oi.16.36 oi.16.46 oi.16.56 oi.16.9
last_rcvd O/ oi.16.17 oi.16.27 oi.16.37 oi.16.47 oi.16.57 OI_scrub
lfsck_bookmark OBJECTS/ oi.16.18 oi.16.28 oi.16.38 oi.16.48 oi.16.58 PENDING/
lfsck_namespace oi.16.0 oi.16.19 oi.16.29 oi.16.39 oi.16.49 oi.16.59 quota_master/
LOGS/ oi.16.1 oi.16.2 oi.16.3 oi.16.4 oi.16.5 oi.16.6 quota_slave/
root@lus04-mds1:~# cd /export/MDT0/CONFIGS
root@lus04-mds1:/export/MDT0/CONFIGS# mkdir ../C
CATALOGS CONFIGS/ CONFIGS_OLD/
root@lus04-mds1:/export/MDT0/CONFIGS# mkdir ../CONFIGS_OLD1
root@lus04-mds1:/export/MDT0/CONFIGS# mv lus04-client ../CONFIGS_OLD1
root@lus04-mds1:/export/MDT0/CONFIGS# cd /
root@lus04-mds1:/# umount /export/MDT0
root@lus04-mds1:/# cd /export/MGS/CONFIGS/
root@lus04-mds1:/export/MGS/CONFIGS# mkdir ../CONFIGS1
root@lus04-mds1:/export/MGS/CONFIGS# mv lus04-client ../CONFIGS1
root@lus04-mds1:/export/MGS/CONFIGS# cd /
root@lus04-mds1:/# umount /export/MGS
root@lus04-mds1:/# lustre_rmmod
fsfilt_ldiskfs 75227 0
obdclass 582304 10 mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,mdc,lquota,osc,ptlrpc
lvfs 43190 12 mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,mdc,lquota,osc,ptlrpc,obdclass
libcfs 248201 14 mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,mdc,lquota,osc,ksocklnd,ptlrpc,obdclass,lnet,lvfs
ldiskfs 291319 1 fsfilt_ldiskfs
mbcache 8134 2 ldiskfs,ext4
jbd2 63282 3 fsfilt_ldiskfs,ldiskfs,ext4
crc16 1659 2 ldiskfs,ext4
root@lus04-mds1:/# lustre_rmmod
open /proc/sys/lnet/dump_kernel failed: No such file or directory
open(dump_kernel) failed: No such file or directory
root@lus04-mds1:/# mount -t lustre /dev/loop2 /export/MGS
root@lus04-mds1:/# mount -t lustre /dev/loop3 /export/MDT0
root@lus04-mds1:/# tail /var/log/kern.log
Oct 20 08:22:09 lus04-mds1 kernel: [66935.739538] Lustre: MGS MGS started
Oct 20 08:22:09 lus04-mds1 kernel: [66935.740321] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 20 08:22:15 lus04-mds1 kernel: [66942.327177] LDISKFS-fs (loop3): barriers disabled
Oct 20 08:22:15 lus04-mds1 kernel: [66942.342193] LDISKFS-fs (loop3): mounted filesystem with ordered data mode
Oct 20 08:22:16 lus04-mds1 kernel: [66942.480158] LDISKFS-fs (loop3): barriers disabled
Oct 20 08:22:16 lus04-mds1 kernel: [66942.481802] LDISKFS-fs (loop3): mounted filesystem with ordered data mode
Oct 20 08:22:16 lus04-mds1 kernel: [66942.511948] LustreError: 13c-e: Client log lus04-client has disappeared! Regenerating all logs.
Oct 20 08:22:16 lus04-mds1 kernel: [66942.514410] Lustre: MGS: Logs for fs lus04 were removed by user request. All servers must be restarted in order to regenerate the logs.
Oct 20 08:22:16 lus04-mds1 kernel: [66942.516976] Lustre: Enabling user_xattr
Oct 20 08:22:16 lus04-mds1 kernel: [66942.579581] Lustre: lus04-MDT0000: Now serving lus04-MDT0000 on /dev/loop3 with recovery enabled
Tune the OSS and remount them
Mount the client still fails.
mount /lustre/scratch104
mount.lustre: mount lus04-mds1@tcp0:/lus04 at /lustre/scratch104 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
root@bc-11-2-02:~# tail /var/log/kern.log
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.917419] LustreError: 23097:0:(obd_config.c:372:class_setup()) setup lus04-MDT0000-mdc-ffff88041f2c3400 failed (-2)
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.917572] LustreError: 23097:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command:
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.917742] Lustre: cmd=cf003 0:lus04-MDT0000-mdc 1:lus04-MDT0000_UUID 2:0@<0:0>
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.917930] LustreError: 15c-8: MGC172.17.148.4@tcp: The configuration from log 'lus04-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.918272] LustreError: 23027:0:(llite_lib.c:1099:ll_fill_super()) Unable to process log: -2
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.918686] LustreError: 23027:0:(obd_config.c:443:class_cleanup()) Device 2 not setup
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.918942] LustreError: 23027:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.919097] LustreError: 23027:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.927817] Lustre: client lus04-client(ffff88041f2c3400) umount complete
Oct 20 08:24:16 bc-11-2-02 kernel: [510327.927906] LustreError: 23027:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-2)
|
|
We have managed to mount the file system. I will update the ticket with more information once I have started the operation to copy all the data off it . Thanks for you help Bob
|
|
James, Glad to hear you worked out a way to mount the file system successfully. Will be curious to see exactly what combination of settings worked for you. Looking forward to your post of more information.
|
|
I have checked with Sven that he is happy with me posting this chat log.
After breaking the MGS by adding the –mdt flag to it we formatted a new MGS, which is documented earlier in this bug “19/Oct/13"
mkfs.lustre --mgs --fsname=lus04 /dev/loop2
We tried using service nodes rather than fail node, I also suspect that mounting things on the "secondary" node helped however I can't prove this as as it was working I was loathed to break the sequence to try experiments on it.
[21/10/2013 10:39:55] James Beal: root@lus04-mds1:/# tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.4 --servicenode=172.17.148.5 --fsname=lus04 --mgs /dev/loop2
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x74
(MGS needs_index first_time update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1174
(MGS needs_index first_time update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
[21/10/2013 10:51:55] James Beal: root@lus04-mds1:/# tunefs.lustre --writeconf --erase-params --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --servicenode=172.17.148.4 --servicenode=172.17.148.5 --fsname=lus04 /dev/loop3
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1001
(MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp
Permanent disk data:
Target: lus04-MDT0000
Index: 0
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1141
(MDT update writeconf no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp failover.node=172.17.148.4@tcp failover.node=172.17.148.5@tcp
Writing CONFIGS/mountdata
Unload the lustre modules on mds1
[21/10/2013 10:53:56] James Beal: root@lus04-mds1:/# mount -t lustre /dev/loop2 /export/MGS
[21/10/2013 10:54:02] James Beal: mount -t lustre /dev/loop3 /export/MDT0
[21/10/2013 10:54:22] Sven Trautmann: ok
[21/10/2013 10:54:24] James Beal: 162261.673514] LDISKFS-fs (loop3): barriers disabled
[162261.676015] LDISKFS-fs (loop3): mounted filesystem with ordered data mode
[162261.765128] LDISKFS-fs (loop3): barriers disabled
[162261.766875] LDISKFS-fs (loop3): mounted filesystem with ordered data mode
[162261.798124] Lustre: MGS: Logs for fs lus04 were removed by user request. All servers must be restarted in order to regenerate the logs.
[162261.800164] Lustre: Enabling user_xattr
[162261.801410] Lustre: lus04-MDT0000: Now serving lus04-MDT0000 on /dev/loop3 with recovery enabled
[21/10/2013 10:54:30] James Beal: Shall I try and mount a client ?
[21/10/2013 10:55:18] Sven Trautmann: try a client, i'm not very confident it will work
[21/10/2013 10:55:34] James Beal: I think it is as likely to work as with the OST's
[21/10/2013 10:55:43] Sven Trautmann: :x
[21/10/2013 10:56:35] James Beal: taking its time
[21/10/2013 10:57:22] James Beal: Oct 21 10:55:49 bc-11-2-07 kernel: [17927384.672219] Lustre: Acceptor stopping
Oct 21 10:55:51 bc-11-2-07 kernel: [17927386.672142] Lustre: Removed LNI 172.17.115.32@tcp
Oct 21 10:56:21 bc-11-2-07 kernel: [17927416.866605] Lustre: Build Version: v1_8_9_WC1sanger1--PRISTINE-2.6.32.59-sles-lustre-1.8.8wc1
Oct 21 10:56:21 bc-11-2-07 kernel: [17927416.937734] Lustre: Added LNI 172.17.115.32@tcp [8/256/0/180]
Oct 21 10:56:21 bc-11-2-07 kernel: [17927416.937915] Lustre: Accept secure, port 988
Oct 21 10:56:21 bc-11-2-07 kernel: [17927417.028382] Lustre: Lustre Client File System; http://www.lustre.org/
Oct 21 10:56:21 bc-11-2-07 kernel: [17927417.048662] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 21 10:56:21 bc-11-2-07 kernel: [17927417.052298] LustreError: 11-0: an error occurred while communicating with 172.17.148.4@tcp. The mds_connect operation failed with -11
[21/10/2013 10:57:30] James Beal: Still trying
[21/10/2013 10:58:17] Sven Trautmann: resource temporarily unavailable, sounds like missing ost
[21/10/2013 10:58:26] James Beal: Oct 21 10:58:15 bc-11-2-07 kernel: [17927530.816907] LustreError: 11-0: an error occurred while communicating with 172.17.148.4@tcp. The mds_connect operation failed with -11
[21/10/2013 10:58:31] James Beal: How about adding one OST
[21/10/2013 10:58:34] James Beal: vd00 ?
[21/10/2013 10:58:46] James Beal: Or moving the MDS to .5 ?
[21/10/2013 10:58:47] Sven Trautmann: you did the last one already?
[21/10/2013 10:58:54] Sven Trautmann: 29
[21/10/2013 10:59:05] James Beal: Ok I will need to writeconf it again.
[21/10/2013 10:59:10] Sven Trautmann: ok
[21/10/2013 10:59:25] James Beal: root@lus04-mds1:/# lctl dl
0 UP mgs MGS MGS 13
1 UP mgc MGC172.17.148.4@tcp 0654a817-b2ec-a591-d828-bb850199cfe1 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lus04-mdtlov lus04-mdtlov_UUID 4
4 UP mds lus04-MDT0000 lus04-MDT0000_UUID 3
[21/10/2013 10:59:57] James Beal: umounting all the OSS
[21/10/2013 11:00:16] Sven Trautmann: the ost where still mounted?
[21/10/2013 11:00:22] James Beal: Yes.
[21/10/2013 11:00:32] Sven Trautmann: ok
[21/10/2013 11:01:11] James Beal: tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.9 --servicenode=172.17.148.8 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/vd29
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lus04-OST001d
Index: 29
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1002
(OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: failover.node=172.17.148.9@tcp failover.node=172.17.148.8@tcp mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp
Permanent disk data:
Target: lus04-OST001d
Index: 29
Lustre FS: lus04
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: failover.node=172.17.148.9@tcp failover.node=172.17.148.8@tcp mgsnode=172.17.148.4@tcp mgsnode=172.17.148.5@tcp
Writing CONFIGS/mountdata
[21/10/2013 11:01:34] Sven Trautmann: ok
[21/10/2013 11:01:51] James Beal: oot@lus04-oss4:~# mount -t lustre /dev/mapper/vd29 /export/vd29
[21/10/2013 11:02:08] James Beal: Oct 21 11:01:44 lus04-oss4 kernel: [247251.590225] Lustre: lus04-OST001d: Now serving lus04-OST001d on /dev/mapper/vd29 with recovery enabled
Oct 21 11:01:44 lus04-oss4 kernel: [247251.590229] Lustre: lus04-OST001d: Will be in recovery for at least 5:00, or until 1 client reconnects
Oct 21 11:01:48 lus04-oss4 kernel: [247255.743705] LustreError: 20470:0:(ldlm_lib.c:887:target_handle_connect()) lus04-OST001d: NID 172.17.148.4@tcp (lus04-mdtlov_UUID) reconnected with 1 conn_cnt; cookies not random?
Oct 21 11:01:48 lus04-oss4 kernel: [247255.743908] LustreError: 20470:0:(ldlm_lib.c:887:target_handle_connect()) Skipped 7 previous similar messages
Oct 21 11:01:48 lus04-oss4 kernel: [247255.744049] LustreError: 20470:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (114) req@ffff88062b1e5c00 x1449498209419298/t0 o8><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1382349808 ref 1 fl Interpret:/0/0 rc -114/0
Oct 21 11:01:48 lus04-oss4 kernel: [247255.744250] LustreError: 20470:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 7 previous similar messages
[21/10/2013 11:02:20] Sven Trautmann: good
[21/10/2013 11:02:27] James Beal: root@lus04-mds1:/# lctl dl
0 UP mgs MGS MGS 9
1 UP mgc MGC172.17.148.4@tcp 0654a817-b2ec-a591-d828-bb850199cfe1 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lus04-mdtlov lus04-mdtlov_UUID 4
4 UP mds lus04-MDT0000 lus04-MDT0000_UUID 3
5 UP osc lus04-OST001d-osc lus04-mdtlov_UUID 5
[21/10/2013 11:02:32] Sven Trautmann: good
[21/10/2013 11:02:39] James Beal: Client still waiting
[21/10/2013 11:02:43] James Beal: Shall I interupt it
[21/10/2013 11:02:49] James Beal: And try again ?
[21/10/2013 11:02:54] Sven Trautmann: no, the recovery needs to finish
[21/10/2013 11:02:58] James Beal: 
[21/10/2013 11:03:09] Sven Trautmann: what is the mds saying about recovery?
[21/10/2013 11:03:31] James Beal: Oct 21 11:02:25 lus04-mds1 kernel: [162781.493095] Lustre: lus04-MDT0000: temporarily refusing client connection from 172.17.115.32@tcp
Oct 21 11:02:25 lus04-mds1 kernel: [162781.493110] LustreError: 12347:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (11) req@ffff88061f970800 x1449498384531486/t0 o38><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1382349845 ref 1 fl Interpret:/0/0 rc -11/0
root@lus04-mds1:/#
[21/10/2013 11:03:42] James Beal: Which looks hopeful
[21/10/2013 11:04:17] Sven Trautmann: agree
[21/10/2013 11:04:23] James Beal: root@lus04-oss4:~# cat /proc/fs/lustre/obdfilter/lus04-OST001d/recovery_status
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/1
delayed_clients: 0/1
completed_clients: 0/1
replayed_requests: 0/??
queued_requests: 0
next_transno: 103079215105
[21/10/2013 11:04:39] James Beal: I would be happier if that was a bit different
[21/10/2013 11:06:15] James Beal: Oct 21 11:05:55 lus04-oss4 kernel: [247502.171140] Lustre: lus04-OST001d: Recovery period over after 0:01, of 1 clients 1 recovered and 0 were evicted.
Oct 21 11:05:55 lus04-oss4 kernel: [247502.171145] Lustre: Skipped 7 previous similar messages
Oct 21 11:05:55 lus04-oss4 kernel: [247502.171148] Lustre: lus04-OST001d: sending delayed replies to recovered clients
Oct 21 11:05:55 lus04-oss4 kernel: [247502.171150] Lustre: Skipped 7 previous similar messages
Oct 21 11:05:55 lus04-oss4 kernel: [247502.171824] Lustre: lus04-OST001d: received MDS connection from 172.17.148.4@tcp
Oct 21 11:05:55 lus04-oss4 kernel: [247502.171828] Lustre: Skipped 6 previous similar messages
Oct 21 11:05:55 lus04-oss4 kernel: [247502.172491] Lustre: 20474:0:(filter.c:3129:filter_destroy_precreated()) lus04-OST001d: deleting orphan objects from 50974178 to 50974199, orphan objids won't be reused any more.
Oct 21 11:05:55 lus04-oss4 kernel: [247502.172497] Lustre: 20474:0:(filter.c:3129:filter_destroy_precreated()) Skipped 6 previous similar messages
[21/10/2013 11:06:26] James Beal: cat /proc/fs/lustre/obdfilter/lus04-OST001d/recovery_status
status: COMPLETE
recovery_start: 1382349955
recovery_duration: 0
delayed_clients: 0/1
completed_clients: 1/1
replayed_requests: 0
last_transno: 103079215104
[21/10/2013 11:06:40] James Beal: Oct 21 11:05:55 lus04-mds1 kernel: [162991.353819] Lustre: lus04-OST001d-osc: Connection restored to service lus04-OST001d using nid 172.17.148.9@tcp.
Oct 21 11:05:55 lus04-mds1 kernel: [162991.354453] Lustre: MDS lus04-MDT0000: lus04-OST001d_UUID now active, resetting orphans
[21/10/2013 11:06:48] James Beal: ct 21 11:06:35 bc-11-2-07 kernel: [17928030.831136] LustreError: 18685:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-513)
Oct 21 11:06:35 bc-11-2-07 kernel: [17928030.839158] Lustre: MGC172.17.148.4@tcp: Reactivating import
Oct 21 11:06:35 bc-11-2-07 kernel: [17928030.844564] Lustre: Client lus04-client(ffff880260dc8c00) mount complete
[21/10/2013 11:07:07] Sven Trautmann: nice 
[21/10/2013 11:07:28] James Beal: Lets get the rest of the OSS up.
[21/10/2013 11:07:35] Sven Trautmann: 
[21/10/2013 11:07:42] Sven Trautmann: writeconf on all
[21/10/2013 11:07:47] James Beal: Yes
[21/10/2013 11:07:57] James Beal: A quick break 
[21/10/2013 11:08:04] Sven Trautmann: sure
[21/10/2013 11:11:12] James Beal: Back.
[21/10/2013 11:12:12] Sven Trautmann: ok, you do the tunefs on all OST?
[21/10/2013 11:12:18] James Beal: I will
[21/10/2013 11:12:19] James Beal: root@lus04-oss3:/# for i in `seq -w 23 28 ` ; do echo tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.9 --servicenode=172.17.148.8 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/$i ; done
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.9 --servicenode=172.17.148.8 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/23
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.9 --servicenode=172.17.148.8 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/24
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.9 --servicenode=172.17.148.8 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/25
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.9 --servicenode=172.17.148.8 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/26
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.9 --servicenode=172.17.148.8 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/27
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.9 --servicenode=172.17.148.8 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/28
[21/10/2013 11:13:05] James Beal: for i in `seq -w 15 22 ` ; do echo tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/$i ; done
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/15
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/16
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/17
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/18
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/19
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/20
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/21
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.8 --servicenode=172.17.148.9 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/22
[21/10/2013 11:14:12] Sven Trautmann: looks ok
[21/10/2013 11:14:29] James Beal: root@lus04-oss2:~# for i in `seq -w 07 14 ` ; do echo tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/$i ; done
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/07
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/08
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/09
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/10
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/11
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/12
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/13
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.6 --servicenode=172.17.148.7 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/14
[21/10/2013 11:15:07] Sven Trautmann: i guess, i'm not sure if the servicenode order has any influence at all
[21/10/2013 11:15:13] James Beal: root@lus04-oss1:~# for i in `seq -w 00 06 ` ; do echo tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.7 --servicenode=172.17.148.6 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/$i ; done
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.7 --servicenode=172.17.148.6 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/00
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.7 --servicenode=172.17.148.6 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/01
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.7 --servicenode=172.17.148.6 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/02
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.7 --servicenode=172.17.148.6 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/03
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.7 --servicenode=172.17.148.6 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/04
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.7 --servicenode=172.17.148.6 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/05
tunefs.lustre --writeconf --erase-params --servicenode=172.17.148.7 --servicenode=172.17.148.6 --mgsnode 172.17.148.4@tcp --mgsnode 172.17.148.5@tcp --fsname=lus04 /dev/mapper/06
[21/10/2013 11:15:21] James Beal: Nor do I but I am trying to be consistent
[21/10/2013 11:15:34] James Beal: Happy with those ?
[21/10/2013 11:17:17] James Beal: ping
[21/10/2013 11:17:37] Sven Trautmann: as happy as can be
[21/10/2013 11:17:45] James Beal: ok going to do that
[21/10/2013 11:17:55] Sven Trautmann: 
[21/10/2013 11:18:16] James Beal: Was missing a vd...
[21/10/2013 11:19:19] James Beal: I will start the OST mounts
[21/10/2013 11:19:51] Sven Trautmann: ok
[21/10/2013 11:21:36] James Beal: for i in `seq -w 01 06 `;
do
mount -t lustre /dev/mapper/vd$i /export/vd$i
date
sleep 10
done
Mon Oct 21 11:21:11 BST 2013
Mon Oct 21 11:21:21 BST 2013
Mon Oct 21 11:21:31 BST 2013
[21/10/2013 11:22:21] James Beal: root@lus04-oss2:~# for i in `seq -w 01 06 `;
do
mount -t lustre /dev/mapper/vd$i /export/vd$i
date
sleep 10
done
Mon Oct 21 11:21:11 BST 2013
Mon Oct 21 11:21:21 BST 2013
Mon Oct 21 11:21:31 BST 2013
Mon Oct 21 11:21:42 BST 2013
Mon Oct 21 11:21:52 BST 2013
Mon Oct 21 11:22:02 BST 2013
root@lus04-oss2:~#
[21/10/2013 11:23:19] Sven Trautmann: lctl dl on the client?
[21/10/2013 11:23:41] James Beal: root@bc-11-2-07:~# lctl dl
0 UP mgc MGC172.17.148.4@tcp b2a666ed-eb18-98c4-5c4e-a98d7120a06b 5
1 UP lov lus04-clilov-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 4
2 UP mdc lus04-MDT0000-mdc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
3 UP osc lus04-OST001d-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
4 UP osc lus04-OST0000-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
5 UP osc lus04-OST0001-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
6 UP osc lus04-OST0002-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
7 UP osc lus04-OST0003-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
8 UP osc lus04-OST0004-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
9 UP osc lus04-OST0005-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
10 UP osc lus04-OST0006-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
11 UP osc lus04-OST0007-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
12 UP osc lus04-OST0008-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
13 UP osc lus04-OST0009-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
14 UP osc lus04-OST000a-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
15 UP osc lus04-OST000b-osc-ffff880260dc8c00 ddad28f4-48ec-9bb6-043d-230b0b8696f1 5
[21/10/2013 11:23:42] James Beal: 
[21/10/2013 11:24:04] Sven Trautmann: there are some missing?
[21/10/2013 11:24:15] James Beal: I havn't finished the mounts
[21/10/2013 11:24:18] Sven Trautmann: ok
[21/10/2013 11:24:24] James Beal: root@lus04-oss1:~# for i in `seq -w 07 14 `; do mount -t lustre /dev/mapper/vd$i /export/vd$i; date ; sleep 10; done
Mon Oct 21 11:22:46 BST 2013
Mon Oct 21 11:22:56 BST 2013
Mon Oct 21 11:23:06 BST 2013
Mon Oct 21 11:23:16 BST 2013
Mon Oct 21 11:23:27 BST 2013
Mon Oct 21 11:23:37 BST 2013
Mon Oct 21 11:23:47 BST 2013
Mon Oct 21 11:23:58 BST 2013
[21/10/2013 11:24:37] Sven Trautmann: right
[21/10/2013 11:26:32] James Beal: for i in `seq -w 15 22 `; do mount -t lustre /dev/mapper/vd$i /export/vd$i; date ; sleep 10; done
Mon Oct 21 11:25:01 BST 2013
Mon Oct 21 11:25:11 BST 2013
Mon Oct 21 11:25:21 BST 2013
Mon Oct 21 11:25:32 BST 2013
Mon Oct 21 11:25:42 BST 2013
Mon Oct 21 11:25:52 BST 2013
Mon Oct 21 11:26:02 BST 2013
Mon Oct 21 11:26:13 BST 2013
root@lus04-oss3:/#
[21/10/2013 11:26:58] Sven Trautmann: recovery will take some time, i guess
[21/10/2013 11:27:04] James Beal: 5 minutes 
[21/10/2013 11:27:28] Sven Trautmann: if everything goes as planned, yes
[21/10/2013 11:27:49] James Beal: root@lus04-oss2:~# grep -i status /proc/fs/lustre/obdfilter/lus04-OST*/recovery_status
/proc/fs/lustre/obdfilter/lus04-OST0000/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lus04-OST0001/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lus04-OST0002/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lus04-OST0003/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lus04-OST0004/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lus04-OST0005/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lus04-OST0006/recovery_status:status: COMPLETE
|
|
I had a similar problem today on another system.
For cosmetic reasons i had to change the hostnames and IPs of OSS and MDS servers.
I did the following:
1. unmounted Lustre
2. changed the IP addresses of the lnet interfaces (o2ib) and
3. did a tunefs on all targets.
4. mount mgs, mdt and osts
After the first mount i got on the MDS:
Oct 22 16:32:34 pfs2n12 kernel: : LustreError: 28201:0:(ldlm_lib.c:383:client_obd_setup()) can't add initial connection
Oct 22 16:32:34 pfs2n12 kernel: : LustreError: 28201:0:(obd_config.c:565:class_setup()) setup pfs2dat2-OST000a-osc-MDT0000 failed (-2)
Oct 22 16:32:35 pfs2n12 kernel: : LustreError: 28201:0:(obd_config.c:1491:class_config_llog_handler()) Err -2 on cfg command:
Oct 22 16:32:35 pfs2n12 kernel: : Lustre: cmd=cf003 0:pfs2dat2-OST000a-osc-MDT0000 1:pfs2dat2-OST000a_UUID 2:0@<0:0>
Oct 22 16:32:50 pfs2n12 kernel: : Lustre: 28083:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from bd8a639b-be45-6048-47b8-568404d7547d@172.26.8.19@o2ib t0 exp (null) cur 1382452370 last 0
Oct 22 16:32:50 pfs2n12 kernel: : Lustre: 28083:0:(ldlm_lib.c:952:target_handle_connect()) Skipped 2 previous similar messages
the OSC device on the MDS was in state AT:
0 UP mgs MGS MGS 19
1 UP mgc MGC172.26.8.12@o2ib c4ff7eb0-8c6f-9199-45f6-f75e490ac101 5
2 UP lov pfs2dat2-MDT0000-mdtlov pfs2dat2-MDT0000-mdtlov_UUID 4
3 UP mdt pfs2dat2-MDT0000 pfs2dat2-MDT0000_UUID 3
4 UP mds mdd_obd-pfs2dat2-MDT0000 mdd_obd_uuid-pfs2dat2-MDT0000 3
5 AT osc pfs2dat2-OST000a-osc-MDT0000 pfs2dat2-MDT0000-mdtlov_UUID 1
just like in the problem described here.
In my case the reason for the problem was that i forgot to unload the lnet kernel module after i change the IP of the Infiniband port. The lnet id did no longer match the IP of the underlying IB interface on the OSS.
After a clean lustre_rmmod the problem was gone and all OSTs could connect without problem.
|
|
Closing old bug.
|
Generated at Sat Feb 10 01:39:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.