[LU-9799] mount doesn't return an error when failing Created: 26/Jul/17 Updated: 18/Aug/17 Resolved: 17/Aug/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Brian Murrell (Inactive) | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre: Build Version: 2.10.0_5_gbb3c407 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
When mount -t lustre ... has failed to actually mount a target, the exit code of mount does not reflect this: # mount -t lustre zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13/MGS /mnt/MGS e2label: No such file or directory while trying to open zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13/MGS Couldn't find valid filesystem superblock. # echo $? 0 This of course wreaks havoc on systems such as IML which rely on the exit code of one step in the process of starting a filesystem to decide if it should continue with subsequent steps. |
| Comments |
| Comment by Brian Murrell (Inactive) [ 31/Jul/17 ] |
|
But what is interesting about this command and error is that clearly the user is trying to mount a ZFS target so why is ldiskfs_read_ldd() being called for it? Isn't osd_read_ldd() supposed to know that the target is (supposed to be formatted) ZFS at this point through ldd->ldd_mount_type? Could this be a symptom of a failed mkfs on the target perhaps? Any other suggestions of what this could be a symptom of would be welcome. |
| Comment by Peter Jones [ 01/Aug/17 ] |
|
Nathaniel Can you please advise? Thanks Peter |
| Comment by Nathaniel Clark [ 03/Aug/17 ] |
|
ldd_mount_type should be set by osd_is_lustre() or it would error out with "... has not been formatted with mkfs.lustre..." Are you sure it didn't mount it? The errors listed are ldiskfs errors, which is okay when mounting a ZFS device. |
| Comment by Brian Murrell (Inactive) [ 03/Aug/17 ] |
Indeed, that is my understanding from reading the code also.
I got here actually because the error I was trying to find the source of was: Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running? Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:1846:server_fill_super()) Unable to start targets: -108 Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:1560:server_put_super()) no obd testfs-OST0001 Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-108) So that would indicate that it didn't mount.
But why would one get any ldiskfs errors when trying to mount a ZFS device if osd_is_lustre() correctly set ldd_mount_type? If ldd_mount_type is correct then why would ldiskfs_read_ldd() (or ldiskfs_<anything> for that matter) even have any opportunity to produce any ldiskfs related errors? |
| Comment by Nathaniel Clark [ 07/Aug/17 ] |
|
Can you give me exact versions of what you using, and maybe an sosreport from the host? I'm having trouble reproducing even the initial error messages. |
| Comment by Brian Murrell (Inactive) [ 08/Aug/17 ] |
The Lustre build version is in the Environment field of this ticket.
Yes, it's not very reproducible. We only hit it occasionally. But since we do create Lustre filesystems so many times during our test runs, even only very intermittent issues can hit seemingly frequently. I think the first step is to agree on the expected behaviour. My position, from reading the code is that one should never be able to get the e2label: No such file or directory while trying to open... error if the target was properly formatted since the codepath that that error exists in should be only for ldiskfs and if the target was formatted ZFS then it should be an unreachable codepath. You seem to indicate that it could happen though. I explained why I don't think one should be able to hit that codepath but you must disagree with my explanation if you think it can be hit. Could you explain what I missed in my analysis? |
| Comment by Nathaniel Clark [ 08/Aug/17 ] |
But the ZFS, OS, e2fsprogs versions aren't. I can get sort of close, but not with your pool name: [root@ieel-mds03 tmp]# mount -t lustre zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS /mnt/MGT; echo $? zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS: No such file or directory while opening filesystem e2label: No such file or directory while trying to open zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS Couldn't find valid filesystem superblock. mount.lustre: zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS failed to read permanent mount data: 254 If e2label were to print anything (even a newline) to stdout on error, the final return code would be 0. |
| Comment by Brian Murrell (Inactive) [ 08/Aug/17 ] |
It's probably moot, but just for clarity, the ZFS version is whatever is built by Jenkins with b2_10. e2fsprogs is most recent GA and O/S is RHEL 7.4. I doubt any of these are particularly relevant though.
I think you got much more than just "sort of close". I think you got an exact reproduction. The names of pools, etc. I think is quite irrelevant. The question is still though, when the e2label call is only in the ldiskfs OSD codepath, in ldiskfs_read_ldd(), why is that being hit for a ZFS formatted target? My reading of the code is that by the time osd_read_ldd() is supposed to call either zfs_read_ldd() or ldiskfs_read_ldd(), the format of the target is known and stored in ldd->ldd_mount_type, so only the relevant one of either zfs_read_ldd() or ldiskfs_read_ldd() should be called, not both and so why are we getting an error from the e2label that is only in ldiskfs_read_ldd()? |
| Comment by Nathaniel Clark [ 08/Aug/17 ] |
|
The name of the pool is absolutely key for my reproduction: lustre/utils/libmount_utils_ldiskfs.c: 448 /* Check whether the file exists in the device */
449 static int file_in_dev(char *file_name, char *dev_name)
450 {
451 FILE *fp;
452 char debugfs_cmd[256];
453 unsigned int inode_num;
454 int i;
455
456 /* Construct debugfs command line. */
457 snprintf(debugfs_cmd, sizeof(debugfs_cmd),
458 "%s -c -R 'stat %s' '%s' 2>&1 | egrep '(Inode|unsupported)'",
459 DEBUGFS, file_name, dev_name);
460
Notice the ...|egrep ... on line 458. That will report out report output text if the egrep matches the pool name on an error, since stderr is also redirected through the egrep. |
| Comment by Brian Murrell (Inactive) [ 09/Aug/17 ] |
|
Ahhh. So, relevant only to force that codepath. But clearly that is not the problem in my case. |
| Comment by Nathaniel Clark [ 09/Aug/17 ] |
|
Right. But something similar must be happening, though I am at a loss as to what. |
| Comment by Nathaniel Clark [ 10/Aug/17 ] |
|
Okay, I was testing something totally different and ran into this: [root@ieel-mds03 ~]# mount -t lustre MGS/MGT /mnt/MGT e2label: No such file or directory while trying to open MGS/MGT Couldn't find valid filesystem superblock. But on closer inspection: [root@ieel-mds03 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/cl_ieel--mds03-root 6486016 1867296 4618720 29% / devtmpfs 496568 0 496568 0% /dev tmpfs 508324 39216 469108 8% /dev/shm tmpfs 508324 13188 495136 3% /run tmpfs 508324 0 508324 0% /sys/fs/cgroup /dev/sda1 1038336 193444 844892 19% /boot ieel-storage:/home 40572928 38486912 2086016 95% /home tmpfs 101668 0 101668 0% /run/user/0 MGS 5047168 0 5047168 0% /MGS MGS/MGT 5007744 0 5005696 0% /mnt/MGT The filesystem did mount, and each umount and remount, I get the same error message, but it succeeds in mounting. |
| Comment by Gerrit Updater [ 10/Aug/17 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/28456 |
| Comment by Gerrit Updater [ 17/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28456/ |
| Comment by Peter Jones [ 17/Aug/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 17/Aug/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28581 |
| Comment by Gerrit Updater [ 18/Aug/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28581/ |