[LU-11211] Performance degradation in mdtest Created: 04/Aug/18 Updated: 12/Aug/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Abe | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We are observing performance degradation in mutest IOPs testing at around 50K IOPS with zfs. Configuration has DNE and DOM, however it seems there is only one MDT that is being utilized out of 3 MDTs:
steps for the configuration in place:
root@mds-201 ~]# zpool list
lfs setstripe -E 1M -L mdt -E -1 /mnt/lustre/domdir
snapshots of the results: Command line used: ./mdtest -d /mnt/lustre/domdir-mdts/testdir-4 -n 47662 -F -e -u -i 1 44 tasks, 2097128 files SUMMARY: (of 1 iterations) – finished at 08/03/2018 09:41:01 –
thanks, Abe
|
| Comments |
| Comment by Andreas Dilger [ 04/Aug/18 ] |
|
There are two independent layout parameters that you need to configure for your testing - the directory layout, which controls the MDT where the files will be created, and the file layout, which controls where the data will be located. The lfs mkdir command works very similarly to the lfs setstripe command, and can also be used as lfs setdirstripe to create new subdirectories with non-default parameters. There are two possible options for distributing files across MDTs:
If you want to create a set of directories for multiple threads/jobs, create one or more remote directories on multiple MDTs as below: lfs setstripe -E 1M -L mdt -E -1 /mnt/lustre/domdir
for mdt_idx in {1..4}; do
lfs mkdir -i $mdt_idx /mnt/lustre/domdir/dir-$mdt_idx
done
The file layout will be inherited by the new subdirectories below domdir, and the files/directories created within each dir-N subdirectory will be created on the specific MDT. If you want to create a single directory that distributes files within the directory across multiple MDTs: lfs mkdir -c 4 /mnt/lustre/domdir lfs setstripe -E 1M -L mdt -E -1 /mnt/lustre/domdir The files/directories created within the domdir directory will be distributed across the MDTs based on the filename hash, so is not guaranteed to be a uniform distribution if you are testing with e.g. 4 threads and 4 MDTs. With larger numbers of threads, the distribution of files and directories created within this specific directory will be spread relatively evenly between MDTs. For lower-level subdirectories, they will typically be created on the same MDT as the "top level" subdirectory itself, unless you use the "-D" option, which will cause the default lfs mkdir settings to be inherited by newly-created subdirectories as well. Note that the creation of remote or striped directories are themselves fairly slow, files created within the striped or remote subdirectories scales fairly well. |
| Comment by Abe [ 04/Aug/18 ] |
|
HI Andreas,
I tried the lfs cli, the stripe dir are not being created:
do these have to be run on the client or ads server ? root@client1-221 ~]# lfs setstripe -E 1M -L mdt -E -1 /mnt/lustre/domdir [root@client1-221 ~]# for mdt_idx in {1..3}; do lfs mkdir: unable to open '/mnt/lustre/domdir/dir-3': Not a directory (20)
thanks, Abe
|
| Comment by Andreas Dilger [ 04/Aug/18 ] |
|
And, what does "lfs df -I" and "lctl dl" on the client show? It almost seems like the client is not connected to the MDTs. |
| Comment by Abe [ 04/Aug/18 ] |
|
Hi Andreas, This is what is shown on the client: [root@client1-221 ~]# lfs df -i filesystem_summary: 138043726 1814 138041912 0% /mnt/lustre [root@client1-221 ~]# lctl dl
thanks, Abe |
| Comment by Andreas Dilger [ 04/Aug/18 ] |
|
How did you manage to get MDT0005 without any of the intervening MDTs? It might be that we don't handle discontiguous MDT indices very well, but that wouldn't explain why MDT0001 is failing. Also, you appear to be missing OST0000. Not sure if that is related, but not standard in any case. It might be the problem is an omission on my part. I thought the "domdir" already existed, but if not then the first "lfs setstripe ... /mnt/lustre/domdir" command will create it as a regular file. Sorry for the confusion. Instead, please remove that directory first and run "mkdir /mnt/lustre/domdir", or use some other new directory for testing. |
| Comment by Abe [ 04/Aug/18 ] |
|
Hi Andreas, 1. rm -rf /mnt/lustre/domdir thanks, |
| Comment by Abe [ 05/Aug/18 ] |
|
Hi Andreas, I went ahead and rebuild the fs and it looks cleaner now, however I'm still getting the error:lfs mkdir -i 3 /mnt/lustre/domdir/dir-3 root@client1-221 ~]# ls -l /mnt/lustre/domdir lfs mkdir: unable to open '/mnt/lustre/domdir/dir-3': Not a directory (20) [root@client1-221 ~]# lfs df -i filesystem_summary: 333527757 978 333526779 0% /mnt/lustre [root@client1-221 ~]# lctl dl
Thanks, Abe |
| Comment by Andreas Dilger [ 05/Aug/18 ] |
|
Abe, to be clear, you must create the directory before the "lfs setstripe" command: mkdir /mnt/lustre/domdir lfs setstripe -E 1M -L mdt -E -1 /mnt/lustre/domdir lfs mkdir -i ... |
| Comment by Abe [ 05/Aug/18 ] |
|
This seems to have worked after removing and creating the dir. how do we issue mdtest command with -d directory. do we specify all the created 3 directories for the 3 mats as in:
./mdtest -d /mnt/lustre/domdir/dir-0 /mnt/lustre/domdir/dir-1 /mnt/lustre/domdir/dir-2
thanks, Abe
|
| Comment by Abe [ 05/Aug/18 ] |
|
Hi Andreas, It seems to only use one mdt and it is not using the other mdts. when we issue the mdt command do we need to specify all the directories : e.g : ./mdtest -d /mnt/lustre/domdir/dir-0 /mnt/lustre/domdir/dir-1 /mnt/lustre/domdir/dir-2 ???
[root@mds-201 ~]# zpool list
thanks, Abe |
| Comment by Andreas Dilger [ 05/Aug/18 ] |
|
If you want all of the subdirectories under domdir to also be created as remote directories, you could try the following: lfs mkdir -c -1 /mnt/lustre/domdir lfs setdirstripe -D -c -1 /mnt/lustre/domdir lfs setstripe -E 1M -L mdt -E -1 /mnt/lustre/domdir This will make every subdirectory striped across all MDTs, and will inherit this setting for further subdirectories. That is not necessarily ideal for every directory,, but it will allow you to distribute the mdtest workload across multiple MDTs more easily. We are working on better ways to achieve this goal, but this may be sufficient for your current needs. |
| Comment by Abe [ 07/Aug/18 ] |
|
Hi Andreas, this seems to have worked, the workload got distributed across the mdts: root@mds-201 ~]# zpool list
Performance have gone up by 10% until the cpu util went up to 100%, if we are to add another mds server will DNE work and the workload get distributed across the 2 MDS servers using the same namespace ?
thanks, Abe
|
| Comment by Andreas Dilger [ 07/Aug/18 ] |
|
In our testing in the past, adding a second MDT on the same MDS improves performance by 50%, but with enough clients the increase in performance with a separate MDS per MDT was much better, about 90%. |
| Comment by Abe [ 07/Aug/18 ] |
|
Hi Andreas,
I'm adding a 2nd MDS server with its own mdts, how do the clients mount to the same namespace to two separate servers having different ip addresses ? e.g: On the clients servers: ( They will have 2 separate mounts ?) 1st Mds server mount: mount -t lustre 10.10.10.200@o2ib:10.10.10.201@o2ib:/tempAA /mnt/lustre And the 2nd Mds server mount: mount -t lustre 10.10.10.200@o2ib:10.10.10.202@o2ib:/tempAA /mnt/lustre thanks, Abe
|
| Comment by Andreas Dilger [ 07/Aug/18 ] |
|
Abe, the IP address (or more correctly "Lustre NID") listed on the client mount is the address and failover for the MGS. This will typically be located on MDS0 with MDT0000. It is preferred to have the MGS use a separate device so that it can be failed over independently of MDT0000. In any case, the clients do not need to change anything for their mount command, since there is only a single MGS for the filesystem. The actual connections to the MDT(s) are handled internally by the Lustre configuration log in the same way as with OSTs. |
| Comment by Abe [ 10/Aug/18 ] |
|
Hi Andreas, Below is the config for 2 mds servers, 1 msg server and 1 client.. when I mount the client, access to the fs /mnt/lustre is very slow: is there something wrong with the way, I'm mounting the client ? do I need specify the kids for the Mgc and the 2 mds servers in the mount command? I do see an error when I tried to mount the client ..
root@client1-221 ~]# mount -t lustre 10.10.10.251@o2ib:10.10.10.201@o2ib:10.10.10.200@o2ib:/tempAA /mnt/lustre [root@client1-221 ~]# 134.067195] LNet: 1355:0:(o2iblnd.c:943:kiblnd_create_conn()) peer 10.10.10.251@o2ib - queue depth reduced from 128 to 63 to allow for qp creation [ 134.237519] LustreError: 1935:0:(mgc_request.c:1576:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 10.10.10.200@o2ib [ 134.239654] Lustre: 1935:0:(mgc_request.c:1802:mgc_process_recover_nodemap_log()) MGC10.10.10.251@o2ib: error processing recovery log tempAA-cliir: rc = -2 [ 134.239726] LustreError: 1935:0:(mgc_request.c:2132:mgc_process_log()) MGC10.10.10.251@o2ib: recover log tempAA-cliir failed, not fatal: rc = -2 [ 134.251092] Lustre: Mounted tempAA-client ot@client1-221 ~]# ls -l /mnt/lustre ^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
2 mds config: mds 1; [root@mds-201 ~]# [root@mds-201 ~]# lctl dl 0 UP osd-zfs tempAA-MDT0000-osd tempAA-MDT0000-osd_UUID 18 1 UP mgc MGC10.10.10.251@o2ib ba5bb4ef-c13e-30b7-318b-ba23b06f65bd 4 2 UP mds MDS MDS_uuid 2 3 UP lod tempAA-MDT0000-mdtlov tempAA-MDT0000-mdtlov_UUID 3 4 UP mdt tempAA-MDT0000 tempAA-MDT0000_UUID 56 5 UP mdd tempAA-MDD0000 tempAA-MDD0000_UUID 3 6 UP qmt tempAA-QMT0000 tempAA-QMT0000_UUID 3 7 UP lwp tempAA-MDT0000-lwp-MDT0000 tempAA-MDT0000-lwp-MDT0000_UUID 4 8 UP osd-zfs tempAA-MDT0001-osd tempAA-MDT0001-osd_UUID 17 9 UP lod tempAA-MDT0001-mdtlov tempAA-MDT0001-mdtlov_UUID 3 10 UP mdt tempAA-MDT0001 tempAA-MDT0001_UUID 34 11 UP mdd tempAA-MDD0001 tempAA-MDD0001_UUID 3 12 UP osp tempAA-MDT0000-osp-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 13 UP lwp tempAA-MDT0000-lwp-MDT0001 tempAA-MDT0000-lwp-MDT0001_UUID 4 14 UP osd-zfs tempAA-MDT0002-osd tempAA-MDT0002-osd_UUID 17 15 UP lod tempAA-MDT0002-mdtlov tempAA-MDT0002-mdtlov_UUID 3 16 UP mdt tempAA-MDT0002 tempAA-MDT0002_UUID 32 17 UP mdd tempAA-MDD0002 tempAA-MDD0002_UUID 3 18 UP osp tempAA-MDT0000-osp-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 19 UP osp tempAA-MDT0001-osp-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 20 UP lwp tempAA-MDT0000-lwp-MDT0002 tempAA-MDT0000-lwp-MDT0002_UUID 4 21 UP osd-zfs tempAA-MDT0003-osd tempAA-MDT0003-osd_UUID 17 22 UP lod tempAA-MDT0003-mdtlov tempAA-MDT0003-mdtlov_UUID 3 23 UP mdt tempAA-MDT0003 tempAA-MDT0003_UUID 32 24 UP mdd tempAA-MDD0003 tempAA-MDD0003_UUID 3 25 UP osp tempAA-MDT0000-osp-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 26 UP osp tempAA-MDT0001-osp-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 27 UP osp tempAA-MDT0002-osp-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 28 UP lwp tempAA-MDT0000-lwp-MDT0003 tempAA-MDT0000-lwp-MDT0003_UUID 4 29 UP osd-zfs tempAA-MDT0004-osd tempAA-MDT0004-osd_UUID 17 30 UP lod tempAA-MDT0004-mdtlov tempAA-MDT0004-mdtlov_UUID 3 31 UP mdt tempAA-MDT0004 tempAA-MDT0004_UUID 32 32 UP mdd tempAA-MDD0004 tempAA-MDD0004_UUID 3 33 UP osp tempAA-MDT0000-osp-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 34 UP osp tempAA-MDT0001-osp-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 35 UP osp tempAA-MDT0002-osp-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 36 UP osp tempAA-MDT0003-osp-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 37 UP lwp tempAA-MDT0000-lwp-MDT0004 tempAA-MDT0000-lwp-MDT0004_UUID 4 38 UP osp tempAA-MDT0004-osp-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 39 UP osp tempAA-MDT0003-osp-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 40 UP osp tempAA-MDT0004-osp-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 41 UP osp tempAA-MDT0002-osp-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 42 UP osp tempAA-MDT0003-osp-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 43 UP osp tempAA-MDT0004-osp-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 44 UP osp tempAA-MDT0001-osp-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 45 UP osp tempAA-MDT0002-osp-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 46 UP osp tempAA-MDT0003-osp-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 47 UP osp tempAA-MDT0004-osp-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 48 UP osp tempAA-OST0005-osc-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 49 UP osp tempAA-OST0006-osc-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 50 UP osp tempAA-OST0005-osc-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 51 UP osp tempAA-OST0006-osc-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 52 UP osp tempAA-OST0005-osc-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 53 UP osp tempAA-OST0006-osc-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 54 UP osp tempAA-OST0005-osc-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 55 UP osp tempAA-OST0006-osc-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 56 UP osp tempAA-OST0005-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 57 UP osp tempAA-OST0006-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 58 UP osp tempAA-OST0007-osc-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 59 UP osp tempAA-OST0007-osc-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 60 UP osp tempAA-OST0007-osc-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 61 UP osp tempAA-OST0007-osc-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 62 UP osp tempAA-OST0007-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 63 UP osp tempAA-OST0008-osc-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 64 UP osp tempAA-OST0008-osc-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 65 UP osp tempAA-OST0008-osc-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 66 UP osp tempAA-OST0008-osc-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 67 UP osp tempAA-OST0008-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 68 UP osp tempAA-OST0001-osc-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 69 UP osp tempAA-OST0002-osc-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 70 UP osp tempAA-OST0003-osc-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 71 UP osp tempAA-OST0001-osc-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 72 UP osp tempAA-OST0002-osc-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 73 UP osp tempAA-OST0003-osc-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 74 UP osp tempAA-OST0001-osc-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 75 UP osp tempAA-OST0002-osc-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 76 UP osp tempAA-OST0003-osc-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 77 UP osp tempAA-OST0001-osc-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 78 UP osp tempAA-OST0002-osc-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 79 UP osp tempAA-OST0003-osc-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 80 UP osp tempAA-OST0001-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 81 UP osp tempAA-OST0002-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 82 UP osp tempAA-OST0003-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 83 UP osp tempAA-OST0004-osc-MDT0004 tempAA-MDT0004-mdtlov_UUID 4 84 UP osp tempAA-OST0004-osc-MDT0003 tempAA-MDT0003-mdtlov_UUID 4 85 UP osp tempAA-OST0004-osc-MDT0002 tempAA-MDT0002-mdtlov_UUID 4 86 UP osp tempAA-OST0004-osc-MDT0001 tempAA-MDT0001-mdtlov_UUID 4 87 UP osp tempAA-OST0004-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 [root@mds-201 ~]# mount sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,size=65200336k,nr_inodes=16300084,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) configfs on /sys/kernel/config type configfs (rw,relatime) /dev/mapper/centos-root on / type xfs (rw,relatime,attr2,inode64,noquota) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=785) mqueue on /dev/mqueue type mqueue (rw,relatime) debugfs on /sys/kernel/debug type debugfs (rw,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime) /dev/sda2 on /boot type xfs (rw,relatime,attr2,inode64,noquota) /dev/mapper/centos-home on /home type xfs (rw,relatime,attr2,inode64,noquota) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=13042812k,mode=700) mdtpool/mdt on /mnt/lustre/mdt type lustre (ro,svname=tempAA-MDT0000,mgsnode=10.10.10.251@o2ib:10.10.10.201@o2ib,osd=osd-zfs) mdtpool1/mdt1 on /mnt/lustre/mdt1 type lustre (ro,svname=tempAA-MDT0001,mgsnode=10.10.10.251@o2ib:10.10.10.201@o2ib,osd=osd-zfs) mdtpool2/mdt2 on /mnt/lustre/mdt2 type lustre (ro,svname=tempAA-MDT0002,mgsnode=10.10.10.251@o2ib:10.10.10.201@o2ib,osd=osd-zfs) mdtpool3/mdt3 on /mnt/lustre/mdt3 type lustre (ro,svname=tempAA-MDT0003,mgsnode=10.10.10.251@o2ib:10.10.10.201@o2ib,osd=osd-zfs) mdtpool4/mdt4 on /mnt/lustre/mdt4 type lustre (ro,svname=tempAA-MDT0004,mgsnode=10.10.10.251@o2ib:10.10.10.201@o2ib,osd=osd-zfs) [root@mds-201 ~]#
mds 2: ** [root@mgs-200 ~]# lctl dl 0 UP osd-zfs tempAA-MDT0000-osd tempAA-MDT0000-osd_UUID 18 1 UP mgc MGC10.10.10.251@o2ib 4dfe3fbf-5953-a8e6-3fb6-9eebff8592e3 4 2 UP mds MDS MDS_uuid 2 3 UP lod tempAA-MDT0000-mdtlov tempAA-MDT0000-mdtlov_UUID 3 4 UP mdt tempAA-MDT0000 tempAA-MDT0000_UUID 2 5 UP mdd tempAA-MDD0000 tempAA-MDD0000_UUID 3 6 UP qmt tempAA-QMT0000 tempAA-QMT0000_UUID 3 7 UP osp tempAA-MDT0001-osp-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 8 UP osp tempAA-MDT0002-osp-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 9 UP osp tempAA-MDT0003-osp-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 10 UP osp tempAA-MDT0004-osp-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 11 UP lwp tempAA-MDT0000-lwp-MDT0000 tempAA-MDT0000-lwp-MDT0000_UUID 4 12 UP osp tempAA-OST0005-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 13 UP osp tempAA-OST0006-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 14 UP osp tempAA-OST0007-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 15 UP osp tempAA-OST0008-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 16 UP osp tempAA-OST0001-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 17 UP osp tempAA-OST0002-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 18 UP osp tempAA-OST0003-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4 19 UP osp tempAA-OST0004-osc-MDT0000 tempAA-MDT0000-mdtlov_UUID 4
root@sbb-client1 ~]# lctl dl 0 UP osd-zfs MGS-osd MGS-osd_UUID 4 1 UP mgs MGS MGS 18 2 UP mgc MGC10.10.10.251@o2ib 6f0306ce-d9bf-1556-1288-9800b8b62090 4 [root@sbb-client1 ~]# moun -bash: moun: command not found [root@sbb-client1 ~]# mount sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,size=32833268k,nr_inodes=8208317,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) configfs on /sys/kernel/config type configfs (rw,relatime) /dev/mapper/rhel-root on / type xfs (rw,relatime,attr2,inode64,noquota) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=34,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=17604) debugfs on /sys/kernel/debug type debugfs (rw,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime) mqueue on /dev/mqueue type mqueue (rw,relatime) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime) /dev/mapper/rhel-home on /home type xfs (rw,relatime,attr2,inode64,noquota) /dev/sda2 on /boot type xfs (rw,relatime,attr2,inode64,noquota) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6569624k,mode=700) mgspool/mgt on /mnt/lustre/mgt type lustre (ro,svname=MGS,nosvc,mgs,osd=osd-zfs) [root@sbb-client1 ~]#
client1 config:
root@client1-221 ~]# lctl dl 0 UP mgc MGC10.10.10.251@o2ib 8d56ab9b-2220-9262-99f0-7558b40523ba 4 1 UP lov tempAA-clilov-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 3 2 UP lmv tempAA-clilmv-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 3 UP mdc tempAA-MDT0000-mdc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 4 UP mdc tempAA-MDT0001-mdc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 5 UP mdc tempAA-MDT0002-mdc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 6 UP mdc tempAA-MDT0003-mdc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 7 UP mdc tempAA-MDT0004-mdc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 8 UP osc tempAA-OST0005-osc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 9 UP osc tempAA-OST0006-osc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 10 UP osc tempAA-OST0007-osc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 11 UP osc tempAA-OST0008-osc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 12 UP osc tempAA-OST0001-osc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 13 UP osc tempAA-OST0002-osc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 14 UP osc tempAA-OST0003-osc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 15 UP osc tempAA-OST0004-osc-ffff88105b505800 05280644-a903-6f5e-abfc-21a149e8384b 4 [root@client1-221 ~]# mount sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,size=32836108k,nr_inodes=8209027,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) configfs on /sys/kernel/config type configfs (rw,relatime) /dev/mapper/centos-root on / type xfs (rw,relatime,attr2,inode64,noquota) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=33,pgrp=1,timeout=300,minproto=5,maxproto=5,direct) mqueue on /dev/mqueue type mqueue (rw,relatime) debugfs on /sys/kernel/debug type debugfs (rw,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime) /dev/sda2 on /boot type xfs (rw,relatime,attr2,inode64,noquota) /dev/sda1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro) /dev/mapper/centos-home on /home type xfs (rw,relatime,attr2,inode64,noquota) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6569420k,mode=700) 10.10.10.251@o2ib:10.10.10.201@o2ib:/tempAA on /mnt/lustre type lustre (rw,lazystatfs) [root@client1-221 ~]#
thanks, Abe
|
| Comment by Abe [ 10/Aug/18 ] |
|
also, a note here when I try to mount the fs on the client mount -t lustre 10.10.10.251@o2ib:10.10.10.201@o2ib:/tempAA /mnt/lustre get this error on the client server domes: [ 1810.256672] LustreError: 2179:0:(mgc_request.c:1576:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 10.10.10.200@o2ib
thanks, Abe
|
| Comment by Andreas Dilger [ 10/Aug/18 ] |
|
As I previously mentioned, you should NOT specify the MDS NID on the mount command line. Only the MGS NID (primary and backup) should be on the mount command line. Sometimes the MGS is on the same node as the MDS, but with DNE there may be many MDS nodes, and they should definitely NOT be listed. This is likely causing the slow mount as the client is trying to contact the MGS on each of the listed NIDs. I also see that in your example, you have an "MDT0000" listed on both MDS1 and MDS2. That is not a valid configuration, as each MDT needs to have a different index. This would cause severed corruption of the filesystem to have multiple MDT0000 devices in the same filesystem. |
| Comment by Abe [ 10/Aug/18 ] |
|
Hi Andreas,
I have modified mds200 to have only MDT006 and not MDT000. Also, we are only using one MGS (10.10.10.251) without a backup & 2 MDS servers (MDS200 (10.10.10.200) & MDS201 (10.10.10.201)) and the command used on the client to mount fs: mount -t lustre 10.10.10.251@o2ib:/tempAA /mnt/lustre
but access the filesystem is still slow!!! root@client1-221 ~]# mkdir /mnt/lustre/aadomdir hangs!!
mds #1 mkfs --mgs --fsname=tempAA --reformat --servicenode=10.10.10.200@o2ib --servicenode=10.10.10.200@o2ib --mgsnode=10.10.10.251@o2ib --mgsnode=10.10.10.251@o2ib --backfstype=zfs mgspool6/mdt6 mds-200 ~]# lctl dl
mds #2 mkfs.lustre --mdt --fsname=$NAME --reformat --index=0 --servicenode=$MGS_NID --servicenode=$MDT_NID --mgsnode=$MGS_NID --mgsnode=$MDT_NID --backfstype=zfs mdtpool/mdt mkfs.lustre --mdt --fsname=$NAME --reformat --index=1 --servicenode=$MGS_NID --servicenode=$MDT_NID --mgsnode=$MGS_NID --mgsnode=$MDT_NID --backfstype=zfs mdtpool1/mdt1 mkfs.lustre --mdt --fsname=$NAME --reformat --index=3 --servicenode=$MGS_NID --servicenode=$MDT_NID --mgsnode=$MGS_NID --mgsnode=$MDT_NID --backfstype=zfs mdtpool3/mdt3 mkfs.lustre --mdt --fsname=$NAME --reformat --index=4 --servicenode=$MGS_NID --servicenode=$MDT_NID --mgsnode=$MGS_NID --mgsnode=$MDT_NID --backfstype=zfs mdtpool4/mdt4 mount -t lustre mdtpool/mdt /mnt/lustre/mdt [root@mds-201 ~]# lctl dl
mgs config: zpool create -f -O canmount=off -o cachefile=none mgspool sdb mkfs.lustre --mgs --fsname=$NAME --reformat --servicenode=$MGS_NID --servicenode=$MDT_NID --mgsnode=$MGS_NID --mgsnode=$MDT_NID --backfstype=zfs mgspool/mgt lctl dl zpool list
client mount: root@client1-221 ~]# mount thanks, Abe |
| Comment by Andreas Dilger [ 10/Aug/18 ] |
|
Abe, mds #1 mkfs.lustre --mdt --fsname=tempAA --reformat --index=0 --servicenode=10.10.10.251@o2ib --servicenode=10.10.10.200@o2ib --mgsnode=10.10.10.251@o2ib --mgsnode=10.10.10.200@o2ib --backfstype=zfs mdtpool6/mdt6 mkfs.lustre --mdt --fsname=tempAA --reformat --index=0 --servicenode=10.10.10.251@o2ib --servicenode=10.10.10.200@o2ib --mgsnode=10.10.10.251@o2ib --mgsnode=10.10.10.200@o2ib --backfstype=zfs mdtpool7/mdt7 which means you have two "tempAA-MDT0000" (same --index=0 option for both MDTs) formatted on mds #1, but on two different ZFS datasets. That is not good. You also show: mds #2 mkfs.lustre --mdt --fsname=tempAA --reformat --index=0 --servicenode=10.10.10.251@o2ib --servicenode=10.10.10.201@o2ib --mgsnode=10.10.10.251@o2ib --mgsnode=10.10.10.201@o2ib --backfstype=zfs mdtpool/mdt That means you have another "tempAA-MDT0000" on mds #2. There should only be a single MDT0000 in the whole filesystem. You need to use a unique --index=N option for each MDT, so --index=6 for mdtpool6/mdt6 and --index=7 for mdtpool7/mdt7 at format time. It is likely that the current filesystem is corrupted, so I would suggest reformatting it from scratch, since it is only a test filesystem. the command used on the client to mount fs: mount -t lustre 10.10.10.251@o2ib:/tempAA /mnt/lustre This appears to be correct for a single MGS node, once the other issues are fixed up. |
| Comment by Abe [ 12/Aug/18 ] |
|
Hi Andreas, The fs is more accessible now after making sure the index for mdtpool6 & 7 are using index 6,7. The MDTs are participating in the mdt test workload except for mdtpool7 not sure why is this the case since they are all configured the same. Also, the zfs pool go to degraded state after starting the mdt test for 5min Any insight on this ? [root@mds-201 ~]# zpool list [root@mgs-200 ~]# zpool list pool: mdtpool4 NAME STATE READ WRITE CKSUM
thanks, Abe
|
| Comment by Andreas Dilger [ 12/Aug/18 ] |
|
If the pool is degraded like this, it means there is some problem with the devices below the Lustre level. One possibility if you are seeing problems with two zpools is that the devices are configured incorrectly and a disk is shared between the two pools? Alternately, it is possible there is a marginal cable or power supply that has problems under heavy load? |
| Comment by Abe [ 12/Aug/18 ] |
|
Hi Andreas, definitely there is a problem on the power supply for one of the clients which I will replace the power supplies tomorrow: [root@client1-221 ~]# Message from syslogd@client1-221 at Aug 12 03:49:41 ... Message from syslogd@client1-221 at Aug 12 03:49:41 ... Message from syslogd@client1-221 at Aug 12 03:49:41 ...
Not sure about the ssd drives being shared as the root cause of the degradation, I think zfs does not allow pool configuration with shared ssds.. wonder if there is a way to check whether the ssds are shared across the mds servers .. output of lsblk on the ssd bod: [root@mds-201 ~]# lsblk
root@mds-201 ~]# zpool status |more NAME STATE READ WRITE CKSUM errors: No known data errors pool: mdtpool1 using 'zpool clear' or replace the device with 'zpool replace'. NAME STATE READ WRITE CKSUM errors: No known data errors
thanks, Abe
|