[LU-9925] mount.lustre: /dev/sda has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool Created: 29/Aug/17 Updated: 20/Oct/17 Resolved: 20/Oct/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Brian Murrell (Inactive) | Assignee: | WC Triage |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre: Build Version: 2.10.0_62_ge1d3a0e |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We are occasionally seeing errors trying to mount a Lustre target such as: # mount -t lustre /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk15 /mnt/testfs-OST0000 mount.lustre: /dev/sda has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool # echo $? 19
This is after previously having formatted and "registered" (i.e. initial mount) the target: # mkfs.lustre --ost --mgsnode=10.14.82.168@tcp0 --mgsnode=10.14.82.169@tcp0 --failnode=10.14.82.171@tcp0 --index=0 --mkfsoptions=-J size=2048 --backfstype=ldiskfs --fsname=testfs /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk15 # mount -t lustre /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk15 /mnt/testfs-OST0000 # umount /dev/sda
The kernel log contains after the above: Aug 29 06:01:44 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: LDISKFS-fs (sda): file extents enabled, maximum tree depth=5 Aug 29 06:01:44 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: LDISKFS-fs (sda): mounted filesystem with ordered data mode. Opts: errors=remount-ro Aug 29 06:01:44 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: LDISKFS-fs (sda): file extents enabled, maximum tree depth=5 Aug 29 06:01:44 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: LDISKFS-fs (sda): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc Aug 29 06:01:44 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0000: new disk, initializing Aug 29 06:01:44 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: Lustre: srv-testfs-OST0000: No data found on store. Initialize space Aug 29 06:01:44 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0000: Imperative Recovery not enabled, recovery window 300-900 Aug 29 06:01:45 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: Lustre: Failing over testfs-OST0000 Aug 29 06:01:46 lotus-44vm17.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0000 complete So it sure seems as if the format and registration are successful.
However (occasionally, as noted above) when we subsequently [re-]mount that target with the command: # mount -t lustre /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk15 /mnt/testfs-OST0000 we get an error (also as noted at the top of this report): mount.lustre: /dev/sda has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool # echo $? 19
We are certain we have the correct device(s) on this node: # ls -l /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk15 lrwxrwxrwx 1 root root 9 Aug 29 08:50 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk15 -> ../../sda |
| Comments |
| Comment by Oleg Drokin [ 29/Aug/17 ] |
|
does it only happen if you have failover super soon after you just mounted it for the first time? |
| Comment by John Hammond [ 29/Aug/17 ] |
|
Can you add --verbose to the mount options? Or, even better, add --verbose and run strace -f mount ... and attach the output? |
| Comment by Brian Murrell (Inactive) [ 29/Aug/17 ] |
So just to be clear, our process for every target of every filesystem that we create is to format the targets first, then mount them to register them with the MGS then immediately unmount them so that they are all back in a state to be mounted when the the filesystem is started up. So we run this process of mkfs; mount; umount followed by another mount when we want to start the filesystem up, every single time but only intermittently see this issue. |
| Comment by Brian Murrell (Inactive) [ 01/Sep/17 ] |
|
I've attached 7 straces of this issue happening. Hopefully between them all there is enough information to go forward with. |
| Comment by John Hammond [ 01/Sep/17 ] |
|
Hi Brian, I see that in every case the target to be mounted in OST0000? Does this ever happen for other targets? Are all of these from the same physical node? Has the real disk backing this target been checked for errors? Have you checked the dmesg of the VM host for anything suspicious? |
| Comment by Brian Murrell (Inactive) [ 01/Sep/17 ] |
|
Hi jhammond. In many/most cases we are only creating filesystems with a very small number of OSTs – 1 or 2 probably-- so the probability that this could happen on OST0000 are just naturally high. Also those straces came from a single test run. I would have to check, but it's entirely possible that they are actually from trying to mount the same target repeatedly, in an effort to try to be resilient to transient errors. As for disk errors, I have not seen any in all of the investigation I have done into this issue which has included poring over many /var/log/messages files. |
| Comment by John Hammond [ 01/Sep/17 ] |
|
> In many/most cases we are only creating filesystems with a very small number of OSTs – 1 or 2 probably-- so the probability that this could happen on OST0000 are just naturally high. But can you confirm that? |
| Comment by Brian Murrell (Inactive) [ 05/Sep/17 ] |
|
I cannot confirm every instance we have had, but the ones I looked at since the end of last week were all with OST0000. To keep the statistics in perspective though, If that fails on OST0000, the test will be failed at that point and no other OST will be formatted and registered. So we wouldn't know of cases where it fails on OST0000 but would succeed on any subsequent OSTs. We do know however that it does not succeed on OST0000 and then fail on a subsequent one, for the number of runs I was able to investigate. |
| Comment by John Hammond [ 05/Sep/17 ] |
|
What about running mount through strace? |
| Comment by Brian Murrell (Inactive) [ 06/Sep/17 ] |
That's what the 7 attached strace logs are of: 30471 execve("/usr/bin/mount", ["mount", "--verbose", "-t", "lustre", "/dev/disk/by-id/scsi-0QEMU_QEMU_"..., "/mnt/testfs-OST0000"], [/* 38 vars */]) = 0
...
Which obviously itself then execve()s mount.lustre 30472 execve("/sbin/mount.lustre", ["/sbin/mount.lustre", "/dev/sda", "/mnt/testfs-OST0000", "-v", "-o", "rw"], [/* 37 vars */]) = 0
Which is what is exiting with the 19 exit code: 30472 exit_group(19) = ? 30472 +++ exited with 19 +++ So the question of the hour is under what conditions will mount.lustre exit with 19? |
| Comment by John Hammond [ 06/Sep/17 ] |
|
Sorry, my mistake. Can you attach the output as well? |
| Comment by John Hammond [ 06/Sep/17 ] |
|
Nevermind about the output, I see what I need. debugfs is not finding the correct magic in the super-block. 30474 write(2, "debugfs 1.42.13.wc6 (05-Feb-2017"..., 34 <unfinished ...>
30474 <... write resumed> ) = 34
30474 open("/dev/sda", O_RDONLY <unfinished ...>
30474 <... open resumed> ) = 6
30474 stat("/dev/sda", <unfinished ...>
30474 <... stat resumed> {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 0), ...}) = 0
30474 ioctl(6, BLKDISCARDZEROES <unfinished ...>
30474 <... ioctl resumed> , [0]) = 0
30474 pread64(6, <unfinished ...>
30474 <... pread64 resumed> "\0\0\0\0\0\0\0\0\0\0\0\0'\232\36\0\372\376\t\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024, 1024) = 1024
30474 close(6) = 0
30474 write(2, "/dev/sda", 8 <unfinished ...>
30474 <... write resumed> ) = 8
30474 write(2, ": ", 2 <unfinished ...>
30474 <... write resumed> ) = 2
30474 write(2, "Bad magic number in super-block", 31 <unfinished ...>
30474 <... write resumed> ) = 31
30474 write(2, " ", 1 <unfinished ...>
30474 <... write resumed> ) = 1
30474 write(2, "while opening filesystem", 24 <unfinished ...>
Are these test sessions uploaded to maloo? If not then can you attach the dmesg contents from the OSS VM OSS and from the host machine? |
| Comment by Brian Murrell (Inactive) [ 07/Sep/17 ] |
|
IML's testing is not done in autotest and/or uploaded to Maloo.
Can you clarify what you are looking for? I suspect there is probably a typo in the above. `dmesg`s are long gone for any previous test runs. If you can tell me what you are looking for from them I will see about adding it to the data collection. |
| Comment by John Hammond [ 07/Sep/17 ] |
|
dmesg contents from the OSS VM and the physical host machine. |
| Comment by Brian Murrell (Inactive) [ 07/Sep/17 ] |
|
What are you looking for in them specifically? |
| Comment by John Hammond [ 08/Sep/17 ] |
|
Lustre chatter, mount, unmount, SCSI chatter, IO errors, other things that may be found in them. In other words, their contents. |
| Comment by John Hammond [ 12/Sep/17 ] |
|
Brian, could you point me to the scripts that manage mkfs, registration, and mount? |
| Comment by Brian Murrell (Inactive) [ 12/Sep/17 ] |
|
This is not a series of "scripts" that are producing this situation. This is IML and it's test framework which is quite a bit more than just a few scripts. It is a fully state-driven tool for device discovery, presentation, Lustre filesystem creation and life-cycle management all in a browser interface, including ongoing stats reporting. I am investigating this issue further on our end and will update here when I have discovered more about what conditions or steps are producing the error. |
| Comment by Brian Murrell (Inactive) [ 18/Sep/17 ] |
|
Some more digging into this issue has raised a question... Can (or should) the MMP block of a device continue to be kept updated even if that device does not show as being mounted on the host that the MMP block says is updating it? For example, notice that /dev/sda does not appear in /proc/mounts on lotus-43vm8: # echo $HOSTNAME lotus-43vm8 # cat /proc/mounts rootfs / rootfs rw 0 0 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 devtmpfs /dev devtmpfs rw,nosuid,size=928284k,nr_inodes=232071,mode=755 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 /dev/mapper/vg_00-lv_root / ext4 rw,relatime,data=ordered 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=31,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=10600 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 nfsd /proc/fs/nfsd nfsd rw,relatime 0 0 /dev/vda1 /boot ext3 rw,relatime,data=ordered 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 /etc/auto.misc /misc autofs rw,relatime,fd=6,pgrp=1012,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=17033 0 0 -hosts /net autofs rw,relatime,fd=13,pgrp=1012,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=17041 0 0 auto.direct /root/lab autofs rw,relatime,fd=19,pgrp=1012,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=17045 0 0 auto.direct /scratch autofs rw,relatime,fd=19,pgrp=1012,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=17045 0 0 auto.direct /root/chef autofs rw,relatime,fd=19,pgrp=1012,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=17045 0 0 auto.home /home autofs rw,relatime,fd=25,pgrp=1012,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=17053 0 0 lotus-2.lotus.hpdd.lab.intel.com:/export/home/hacluster /home/hacluster nfs4 rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.14.82.147,local_lock=none,addr=10.14.80.5 0 0 tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=188348k,mode=700 0 0 Yet it's MMP block says it's freshly being used on lotus-43vm8: # date +%s 1505722898 # debugfs -c -R dump_mmp /dev/sda debugfs 1.42.13.wc6 (05-Feb-2017) /dev/sda: catastrophic mode - not reading inode or group bitmaps block_number: 31911 update_interval: 5 check_interval: 10 sequence: 0000003c time: 1505722894 -- Mon Sep 18 01:21:34 2017 node_name: lotus-43vm8 device_name: sda magic: 0x4d4d50 So updated within 4 seconds of checking. Maybe it was just unmounted within the last 4 seconds, so lets wait a while and check again: # sleep 60 # cat /proc/mounts rootfs / rootfs rw 0 0 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 devtmpfs /dev devtmpfs rw,nosuid,size=928284k,nr_inodes=232071,mode=755 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 /dev/mapper/vg_00-lv_root / ext4 rw,relatime,data=ordered 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=31,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=10600 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 nfsd /proc/fs/nfsd nfsd rw,relatime 0 0 /dev/vda1 /boot ext3 rw,relatime,data=ordered 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 /etc/auto.misc /misc autofs rw,relatime,fd=6,pgrp=1012,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=17033 0 0 -hosts /net autofs rw,relatime,fd=13,pgrp=1012,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=17041 0 0 auto.direct /root/lab autofs rw,relatime,fd=19,pgrp=1012,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=17045 0 0 auto.direct /scratch autofs rw,relatime,fd=19,pgrp=1012,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=17045 0 0 auto.direct /root/chef autofs rw,relatime,fd=19,pgrp=1012,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=17045 0 0 auto.home /home autofs rw,relatime,fd=25,pgrp=1012,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=17053 0 0 tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=188348k,mode=700 0 0 # date +%s 1505722958 # debugfs -c -R dump_mmp /dev/sda debugfs 1.42.13.wc6 (05-Feb-2017) /dev/sda: catastrophic mode - not reading inode or group bitmaps block_number: 31911 update_interval: 5 check_interval: 10 sequence: 00000048 time: 1505722954 -- Mon Sep 18 01:22:34 2017 node_name: lotus-43vm8 device_name: sda magic: 0x4d4d50 So it is still reporting less than the interval value from last update by lotus-43vm8. But {{/proc/mounts)) says it's not mounted. So why would the MMP block continue to be updated? |
| Comment by John Hammond [ 18/Sep/17 ] |
|
> So why would the MMP block continue to be updated? Probably because some process is updating it. Have you checked the process tree? You might try 'pgrep -l kmmpd'. Are these updates continuing or are they isolated? Anything in the dmesg between the two checks of the mmp block? Can you enable tracing of the IML daemon and see what it's doing in this time? |
| Comment by Brian Murrell (Inactive) [ 18/Sep/17 ] |
Even though it's not in the mounts /proc/ table any more? Shouldn't any such processes be terminated if/when the device is unmounted? Otherwise another node will be prevented form using the device even though it was supposedly freed from use elsewhere.
No. Have not done that yet. How is kmmpd supposed to work? Is it supposed to be stopped for a device that is unmounted?
What do you mean? LIke, if instead of just sleeping for one 60 second interval and then checking, if I did that, say, 10 times, would it continue to update? If you define a frequency and total period that you would like me to check for MMP updates, I'll update the testing to report that.
I have not checked that either. I will add that check to the debugging patch and report back on the next instance.
Well, IML does not do anything directly about starting and stopping (i.e. mounting and unmounting) targets but rather sets up Pacemaker to do those operations. When this situation (MMP blocking continuing to be updated when the device is not (supposed to be) unmounted) happens, Pacemaker has been shut down on both nodes that are configured to manage the target and as such, it has unmounted the target on both nodes, which the /proc/mounts table is reflecting. The MMP blocking being updated does not seem to confirm though that the devices are actually unmounted as it's continuing to be updated. |
| Comment by John Hammond [ 18/Sep/17 ] |
|
Would it be possible for me to login to lotus43-vm8? |
| Comment by Brian Murrell (Inactive) [ 18/Sep/17 ] |
|
Not to examine the state as it was reported in the previous comment as that state has been cleared and subsequent tests have been started. The next time we hit this issue, that would be possible. |
| Comment by John Hammond [ 20/Sep/17 ] |
|
Note also that debugfs can give you some superblock stats that may be useful here: # debugfs -c -R stats /tmp/lustre-ost1 debugfs 1.42.13.wc5 (15-Apr-2016) /tmp/lustre-ost1: catastrophic mode - not reading inode or group bitmaps ... Filesystem created: Wed Sep 20 10:28:56 2017 Last mount time: Wed Sep 20 10:28:59 2017 Last write time: Wed Sep 20 10:29:00 2017 Mount count: 3 Maximum mount count: -1 Last checked: Wed Sep 20 10:28:56 2017 ... |
| Comment by Brian Murrell (Inactive) [ 20/Sep/17 ] |
|
Does the Last write time get updated if nothing is actually modifying the filesystem? |
| Comment by John Hammond [ 20/Sep/17 ] |
|
> Does the Last write time get updated if nothing is actually modifying the filesystem? I don't believe so. |
| Comment by Brian Murrell (Inactive) [ 21/Sep/17 ] |
|
So, in debugging this issue further, I got an instance where a mount of a lustre target failed to return within 300 seconds. When this happens pacemaker would consider the mount failed and the resource not started, so any shutdown of pacemaker would not try to stop the target. But if it actually did complete the mount after the 300s then there would be an obvious state mismatch with what pacemaker believes. I could then see how this could lead to targets being mounted where they are not believed to be and being scribbled on by another node. I've uploaded a lustre-long-mount.dk If somebody could take a look at it that would probably be quite useful. The pacemaker transactions for that timed out mount that go along with the debug log: Sep 20 16:50:37 localhost crmd[8542]: notice: State transition S_IDLE -> S_POLICY_ENGINE Sep 20 16:50:37 localhost pengine[8541]: notice: On loss of CCM Quorum: Ignore Sep 20 16:50:37 localhost pengine[8541]: notice: Scheduling Node lotus-57vm7.lotus.hpdd.lab.intel.com for shutdown Sep 20 16:50:37 localhost pengine[8541]: notice: * Shutdown lotus-57vm7.lotus.hpdd.lab.intel.com Sep 20 16:50:37 localhost pengine[8541]: notice: Move st-fencing#011(Started lotus-57vm7.lotus.hpdd.lab.intel.com -> lotus-57vm8.lotus.hpdd.lab.intel.com) Sep 20 16:50:37 localhost pengine[8541]: notice: Move testfs-OST0000_8ea5a6#011(Started lotus-57vm7.lotus.hpdd.lab.intel.com -> lotus-57vm8.lotus.hpdd.lab.intel.com) Sep 20 16:50:37 localhost pengine[8541]: notice: Calculated transition 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-57.bz2 Sep 20 16:50:37 localhost crmd[8542]: notice: Initiating stop operation st-fencing_stop_0 on lotus-57vm7.lotus.hpdd.lab.intel.com Sep 20 16:50:37 localhost crmd[8542]: notice: Initiating stop operation testfs-OST0000_8ea5a6_stop_0 on lotus-57vm7.lotus.hpdd.lab.intel.com Sep 20 16:50:37 localhost stonith-ng[8538]: notice: On loss of CCM Quorum: Ignore Sep 20 16:50:37 localhost crmd[8542]: notice: Initiating start operation st-fencing_start_0 locally on lotus-57vm8.lotus.hpdd.lab.intel.com Sep 20 16:50:38 localhost crmd[8542]: notice: Result of start operation for st-fencing on lotus-57vm8.lotus.hpdd.lab.intel.com: 0 (ok) Sep 20 16:50:38 localhost stonith-ng[8538]: notice: On loss of CCM Quorum: Ignore Sep 20 16:50:39 localhost crmd[8542]: notice: Initiating start operation testfs-OST0000_8ea5a6_start_0 locally on lotus-57vm8.lotus.hpdd.lab.intel.com Sep 20 16:50:39 localhost stonith-ng[8538]: notice: On loss of CCM Quorum: Ignore Sep 20 16:50:39 localhost crmd[8542]: notice: do_shutdown of peer lotus-57vm7.lotus.hpdd.lab.intel.com is complete Sep 20 16:50:39 localhost attrd[8540]: notice: Node lotus-57vm7.lotus.hpdd.lab.intel.com state is now lost Sep 20 16:50:39 localhost attrd[8540]: notice: Removing all lotus-57vm7.lotus.hpdd.lab.intel.com attributes for peer loss Sep 20 16:50:39 localhost attrd[8540]: notice: Lost attribute writer lotus-57vm7.lotus.hpdd.lab.intel.com Sep 20 16:50:39 localhost attrd[8540]: notice: Purged 1 peers with id=2 and/or uname=lotus-57vm7.lotus.hpdd.lab.intel.com from the membership cache Sep 20 16:50:39 localhost stonith-ng[8538]: notice: Node lotus-57vm7.lotus.hpdd.lab.intel.com state is now lost Sep 20 16:50:39 localhost stonith-ng[8538]: notice: Purged 1 peers with id=2 and/or uname=lotus-57vm7.lotus.hpdd.lab.intel.com from the membership cache Sep 20 16:50:39 localhost cib[8537]: notice: Node lotus-57vm7.lotus.hpdd.lab.intel.com state is now lost Sep 20 16:50:39 localhost cib[8537]: notice: Purged 1 peers with id=2 and/or uname=lotus-57vm7.lotus.hpdd.lab.intel.com from the membership cache Sep 20 16:50:39 localhost corosync[7952]: [TOTEM ] A new membership (10.14.83.47:12) was formed. Members left: 2 Sep 20 16:50:39 localhost corosync[7952]: [QUORUM] Members[1]: 1 Sep 20 16:50:39 localhost corosync[7952]: [MAIN ] Completed service synchronization, ready to provide service. Sep 20 16:50:39 localhost crmd[8542]: notice: Node lotus-57vm7.lotus.hpdd.lab.intel.com state is now lost Sep 20 16:50:39 localhost crmd[8542]: notice: do_shutdown of peer lotus-57vm7.lotus.hpdd.lab.intel.com is complete Sep 20 16:50:39 localhost pacemakerd[8536]: notice: Node lotus-57vm7.lotus.hpdd.lab.intel.com state is now lost Sep 20 16:50:39 localhost kernel: LDISKFS-fs (sde): file extents enabled, maximum tree depth=5 Sep 20 16:50:40 localhost kernel: LDISKFS-fs (sde): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc Sep 20 16:51:45 localhost pacemakerd[8536]: notice: Shutting down Pacemaker Sep 20 16:51:45 localhost pacemakerd[8536]: notice: Stopping crmd Sep 20 16:51:45 localhost crmd[8542]: notice: Caught 'Terminated' signal Sep 20 16:51:45 localhost crmd[8542]: notice: Shutting down cluster resource manager Sep 20 16:51:45 localhost crmd[8542]: warning: Input I_SHUTDOWN received in state S_TRANSITION_ENGINE from crm_shutdown Sep 20 16:51:45 localhost crmd[8542]: notice: Transition aborted by status-1-shutdown doing modify shutdown=1505951505: Transient attribute change Sep 20 16:55:39 localhost lrmd[8539]: warning: testfs-OST0000_8ea5a6_start_0 process (PID 17811) timed out Sep 20 16:55:39 localhost lrmd[8539]: warning: testfs-OST0000_8ea5a6_start_0:17811 - timed out after 300000ms Sep 20 16:55:39 localhost crmd[8542]: error: Result of start operation for testfs-OST0000_8ea5a6 on lotus-57vm8.lotus.hpdd.lab.intel.com: Timed Out Sep 20 16:55:39 localhost crmd[8542]: warning: Action 7 (testfs-OST0000_8ea5a6_start_0) on lotus-57vm8.lotus.hpdd.lab.intel.com failed (target: 0 vs. rc: 1): Error Sep 20 16:55:39 localhost crmd[8542]: warning: Action 7 (testfs-OST0000_8ea5a6_start_0) on lotus-57vm8.lotus.hpdd.lab.intel.com failed (target: 0 vs. rc: 1): Error Sep 20 16:55:39 localhost crmd[8542]: notice: Transition 37 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-57.bz2): Complete Sep 20 16:55:39 localhost stonith-ng[8538]: notice: On loss of CCM Quorum: Ignore Sep 20 16:55:39 localhost pengine[8541]: notice: On loss of CCM Quorum: Ignore Sep 20 16:55:39 localhost pengine[8541]: warning: Processing failed op start for testfs-OST0000_8ea5a6 on lotus-57vm8.lotus.hpdd.lab.intel.com: unknown error (1) Sep 20 16:55:39 localhost pengine[8541]: warning: Processing failed op start for testfs-OST0000_8ea5a6 on lotus-57vm8.lotus.hpdd.lab.intel.com: unknown error (1) Sep 20 16:55:39 localhost pengine[8541]: notice: Scheduling Node lotus-57vm8.lotus.hpdd.lab.intel.com for shutdown Sep 20 16:55:39 localhost pengine[8541]: notice: * Shutdown lotus-57vm8.lotus.hpdd.lab.intel.com Sep 20 16:55:39 localhost pengine[8541]: notice: Stop st-fencing#011(lotus-57vm8.lotus.hpdd.lab.intel.com) Sep 20 16:55:39 localhost pengine[8541]: notice: Stop testfs-OST0000_8ea5a6#011(lotus-57vm8.lotus.hpdd.lab.intel.com) Sep 20 16:55:39 localhost pengine[8541]: notice: Stop testfs-OST0001_2b9c2d#011(lotus-57vm8.lotus.hpdd.lab.intel.com) Sep 20 16:55:39 localhost pengine[8541]: notice: Calculated transition 38, saving inputs in /var/lib/pacemaker/pengine/pe-input-58.bz2 Sep 20 16:55:39 localhost pengine[8541]: notice: On loss of CCM Quorum: Ignore Sep 20 16:55:39 localhost pengine[8541]: warning: Processing failed op start for testfs-OST0000_8ea5a6 on lotus-57vm8.lotus.hpdd.lab.intel.com: unknown error (1) Sep 20 16:55:39 localhost pengine[8541]: warning: Processing failed op start for testfs-OST0000_8ea5a6 on lotus-57vm8.lotus.hpdd.lab.intel.com: unknown error (1) Sep 20 16:55:39 localhost pengine[8541]: warning: Forcing testfs-OST0000_8ea5a6 away from lotus-57vm8.lotus.hpdd.lab.intel.com after 1000000 failures (max=3) Sep 20 16:55:39 localhost pengine[8541]: notice: Scheduling Node lotus-57vm8.lotus.hpdd.lab.intel.com for shutdown Sep 20 16:55:39 localhost pengine[8541]: notice: * Shutdown lotus-57vm8.lotus.hpdd.lab.intel.com Sep 20 16:55:39 localhost pengine[8541]: notice: Stop st-fencing#011(lotus-57vm8.lotus.hpdd.lab.intel.com) Sep 20 16:55:39 localhost pengine[8541]: notice: Stop testfs-OST0000_8ea5a6#011(lotus-57vm8.lotus.hpdd.lab.intel.com) Sep 20 16:55:39 localhost pengine[8541]: notice: Stop testfs-OST0001_2b9c2d#011(lotus-57vm8.lotus.hpdd.lab.intel.com) Sep 20 16:55:39 localhost pengine[8541]: notice: Calculated transition 39, saving inputs in /var/lib/pacemaker/pengine/pe-input-59.bz2 Sep 20 16:55:39 localhost crmd[8542]: notice: Initiating stop operation st-fencing_stop_0 locally on lotus-57vm8.lotus.hpdd.lab.intel.com Sep 20 16:55:39 localhost crmd[8542]: notice: Initiating stop operation testfs-OST0000_8ea5a6_stop_0 locally on lotus-57vm8.lotus.hpdd.lab.intel.com Sep 20 16:55:39 localhost crmd[8542]: notice: Initiating stop operation testfs-OST0001_2b9c2d_stop_0 locally on lotus-57vm8.lotus.hpdd.lab.intel.com Sep 20 16:55:39 localhost crmd[8542]: notice: Result of stop operation for st-fencing on lotus-57vm8.lotus.hpdd.lab.intel.com: 0 (ok) Sep 20 16:55:39 localhost stonith-ng[8538]: notice: On loss of CCM Quorum: Ignore Sep 20 16:55:39 localhost crmd[8542]: notice: Result of stop operation for testfs-OST0000_8ea5a6 on lotus-57vm8.lotus.hpdd.lab.intel.com: 0 (ok) Sep 20 16:55:39 localhost stonith-ng[8538]: notice: On loss of CCM Quorum: Ignore Sep 20 16:55:39 localhost kernel: Lustre: Failing over testfs-OST0001 Sep 20 16:55:39 localhost kernel: Lustre: server umount testfs-OST0001 complete Sep 20 16:55:41 localhost crmd[8542]: notice: Result of stop operation for testfs-OST0001_2b9c2d on lotus-57vm8.lotus.hpdd.lab.intel.com: 0 (ok) Sep 20 16:55:41 localhost stonith-ng[8538]: notice: On loss of CCM Quorum: Ignore Sep 20 16:55:41 localhost crmd[8542]: notice: Transition 39 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Complete Sep 20 16:55:41 localhost crmd[8542]: notice: Disconnected from the LRM Sep 20 16:55:41 localhost crmd[8542]: notice: Disconnected from Corosync Sep 20 16:55:41 localhost crmd[8542]: notice: Disconnected from the CIB Sep 20 16:55:41 localhost pacemakerd[8536]: notice: Stopping pengine Sep 20 16:55:41 localhost pengine[8541]: notice: Caught 'Terminated' signal Sep 20 16:55:41 localhost pacemakerd[8536]: notice: Stopping attrd Sep 20 16:55:41 localhost attrd[8540]: notice: Caught 'Terminated' signal Sep 20 16:55:41 localhost pacemakerd[8536]: notice: Stopping lrmd Sep 20 16:55:41 localhost lrmd[8539]: notice: Caught 'Terminated' signal Sep 20 16:55:41 localhost pacemakerd[8536]: notice: Stopping stonith-ng Sep 20 16:55:41 localhost stonith-ng[8538]: notice: Caught 'Terminated' signal Sep 20 16:55:41 localhost pacemakerd[8536]: notice: Stopping cib Sep 20 16:55:41 localhost cib[8537]: notice: Caught 'Terminated' signal Sep 20 16:55:41 localhost cib[8537]: notice: Caught 'Terminated' signal Sep 20 16:55:41 localhost cib[8537]: notice: Disconnected from Corosync Sep 20 16:55:41 localhost cib[8537]: notice: Disconnected from Corosync Sep 20 16:55:41 localhost rsyslogd: imjournal: journal reloaded... [v8.24.0 try http://www.rsyslog.com/e/0 ] Sep 20 16:55:41 localhost pacemakerd[8536]: notice: Shutdown complete
|
| Comment by Brian Murrell (Inactive) [ 23/Sep/17 ] |
|
In discussing the lustre-long-mount.dk What I think is happening is that in certain cases, such as when the MGS is unavailable, mount is taking a long time – longer than the 300 seconds that Pacemaker wants to give it and so Pacemaker gives up waiting on the mount and assumes the mount failed because it didn't return. Pacemaker is then shut down and since it failed to start the OST it doesn't do anything to try to stop it. Then minutes after even Pacemaker is stopped, the OST finally finishes mounting which is now after the time that we believe everything to be shut down. We then wipe the disks with dd, not knowing one is still actually mounted on an OSS and in doing the wiping, we nuke the MMP block and so MMP stops for that disk, while it's still mounted. The next test comes along and uses the disks, including the OST that was mounted late, wiped and MMP disabled. It formats the disk including the one that is still actually mounted since that completed after we believed everything to be stopped. So any scrobbling that the OST that mounted it late does will trash any filesystem that is created on it by the subsequent test. So the biggest question to come out of all of this is why is the mount taking so long – longer than 300 seconds even, but then finally succeeding? Should mounting an OST take that long (or even longer as it seems), even if the MGS is unavailable? |
| Comment by John Hammond [ 26/Sep/17 ] |
|
Brian, You mentioned that "in the meanwhile i will add orderly shut-down of the filesystem to our cleanup process instead of the ad-hoc-ish methods being used now." Have you had a chance to do that? And if so was it effective in avoiding this issue? Also is it effective in avoiding LU-9976? |
| Comment by Brian Murrell (Inactive) [ 27/Sep/17 ] |
The code change is written.
It's still in testing but looking good. But it has gaps. It relies on the IML manager being up and in working order to orchestrate the orderly filesystem shutdown. Given that this filesystem shutdown/cleanup functionality is in our test harness and is meant to clean up state between tests, and that the test harness is about testing new code for bugs, it's entirely possible that a patch introduces a bug that prevents the filesystem from being shut down in the orderly fashion and the intra-test cleanup will need to fall back on being able to remove targets in whatever order it can – as it's doing currently – which is what is triggering this bug. If that happens, it could be that this failure taints the subsequent test(s) in the test run. It's also reasonable to assume that this is a regression in Lustre since this test cleanup code has been working in IML for literally years – since it's inception and is only just now, as of our bump up to 2.10, causing these problems.
That's unknown. The frequency of LU-9976 is quite a bit lower than this issue and the orderly-shutdown workaround for this issue is not in our master branch yet and so not seeing a lot of traffic. So that said, we'd still like to know what our course of action should be in IML if/when IML hits LU-9976 in it's processes. |
| Comment by Brian Murrell (Inactive) [ 20/Oct/17 ] |
|
I believe we can close this now. |