[LU-9799] mount doesn't return an error when failing Created: 26/Jul/17  Updated: 18/Aug/17  Resolved: 17/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.1, Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: Brian Murrell (Inactive) Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre: Build Version: 2.10.0_5_gbb3c407


Issue Links:
Duplicate
is duplicated by LU-9853 mount.lustre noisy on mount Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When mount -t lustre ... has failed to actually mount a target, the exit code of mount does not reflect this:

# mount -t lustre zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13/MGS /mnt/MGS
e2label: No such file or directory while trying to open zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13/MGS
Couldn't find valid filesystem superblock.
# echo $?
0

This of course wreaks havoc on systems such as IML which rely on the exit code of one step in the process of starting a filesystem to decide if it should continue with subsequent steps.



 Comments   
Comment by Brian Murrell (Inactive) [ 31/Jul/17 ]

But what is interesting about this command and error is that clearly the user is trying to mount a ZFS target so why is ldiskfs_read_ldd() being called for it?

Isn't osd_read_ldd() supposed to know that the target is (supposed to be formatted) ZFS at this point through ldd->ldd_mount_type?

Could this be a symptom of a failed mkfs on the target perhaps?

Any other suggestions of what this could be a symptom of would be welcome.

Comment by Peter Jones [ 01/Aug/17 ]

Nathaniel

Can you please advise?

Thanks

Peter

Comment by Nathaniel Clark [ 03/Aug/17 ]

ldd_mount_type should be set by osd_is_lustre() or it would error out with "... has not been formatted with mkfs.lustre..."

Are you sure it didn't mount it? The errors listed are ldiskfs errors, which is okay when mounting a ZFS device.

Comment by Brian Murrell (Inactive) [ 03/Aug/17 ]

ldd_mount_type should be set by osd_is_lustre() or it would error out with "... has not been formatted with mkfs.lustre..."

Indeed, that is my understanding from reading the code also.

Are you sure it didn't mount it?

I got here actually because the error I was trying to find the source of was:

Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running?
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:1846:server_fill_super()) Unable to start targets: -108
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:1560:server_put_super()) no obd testfs-OST0001
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-108)

So that would indicate that it didn't mount.

The errors listed are ldiskfs errors, which is okay when mounting a ZFS device.

But why would one get any ldiskfs errors when trying to mount a ZFS device if osd_is_lustre() correctly set ldd_mount_type? If ldd_mount_type is correct then why would ldiskfs_read_ldd() (or ldiskfs_<anything> for that matter) even have any opportunity to produce any ldiskfs related errors?

Comment by Nathaniel Clark [ 07/Aug/17 ]

Can you give me exact versions of what you using, and maybe an sosreport from the host? I'm having trouble reproducing even the initial error messages.

Comment by Brian Murrell (Inactive) [ 08/Aug/17 ]

utopiabound:

Can you give me exact versions of what you using,

The Lustre build version is in the Environment field of this ticket.

I'm having trouble reproducing even the initial error messages

Yes, it's not very reproducible. We only hit it occasionally. But since we do create Lustre filesystems so many times during our test runs, even only very intermittent issues can hit seemingly frequently.

I think the first step is to agree on the expected behaviour. My position, from reading the code is that one should never be able to get the e2label: No such file or directory while trying to open... error if the target was properly formatted since the codepath that that error exists in should be only for ldiskfs and if the target was formatted ZFS then it should be an unreachable codepath. You seem to indicate that it could happen though. I explained why I don't think one should be able to hit that codepath but you must disagree with my explanation if you think it can be hit.

Could you explain what I missed in my analysis?

Comment by Nathaniel Clark [ 08/Aug/17 ]

The Lustre build version is in the Environment field of this ticket.

But the ZFS, OS, e2fsprogs versions aren't.

I can get sort of close, but not with your pool name:

[root@ieel-mds03 tmp]# mount -t lustre zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS /mnt/MGT; echo $?
zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS: No such file or directory while opening filesystem
e2label: No such file or directory while trying to open zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS
Couldn't find valid filesystem superblock.
mount.lustre: zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS failed to read permanent mount data: 
254

If e2label were to print anything (even a newline) to stdout on error, the final return code would be 0.

Comment by Brian Murrell (Inactive) [ 08/Aug/17 ]

utopiabound:

But the ZFS, OS, e2fsprogs versions aren't.

It's probably moot, but just for clarity, the ZFS version is whatever is built by Jenkins with b2_10. e2fsprogs is most recent GA and O/S is RHEL 7.4. I doubt any of these are particularly relevant though.

I can get sort of close, but not with your pool name:

I think you got much more than just "sort of close".  I think you got an exact reproduction. The names of pools, etc. I think is quite irrelevant.

The question is still though, when the e2label call is only in the ldiskfs OSD codepath, in ldiskfs_read_ldd(), why is that being hit for a ZFS formatted target?

My reading of the code is that by the time osd_read_ldd() is supposed to call either zfs_read_ldd() or ldiskfs_read_ldd(), the format of the target is known and stored in ldd->ldd_mount_type, so only the relevant one of either zfs_read_ldd() or ldiskfs_read_ldd() should be called, not both and so why are we getting an error from the e2label that is only in ldiskfs_read_ldd()?

Comment by Nathaniel Clark [ 08/Aug/17 ]

The name of the pool is absolutely key for my reproduction:

lustre/utils/libmount_utils_ldiskfs.c:

 448 /* Check whether the file exists in the device */
 449 static int file_in_dev(char *file_name, char *dev_name)
 450 {
 451         FILE *fp;
 452         char debugfs_cmd[256];
 453         unsigned int inode_num;
 454         int i;
 455 
 456         /* Construct debugfs command line. */
 457         snprintf(debugfs_cmd, sizeof(debugfs_cmd),
 458                  "%s -c -R 'stat %s' '%s' 2>&1 | egrep '(Inode|unsupported)'",
 459                  DEBUGFS, file_name, dev_name);
 460 

Notice the ...|egrep ... on line 458. That will report out report output text if the egrep matches the pool name on an error, since stderr is also redirected through the egrep.

Comment by Brian Murrell (Inactive) [ 09/Aug/17 ]

Ahhh.  So, relevant only to force that codepath.  But clearly that is not the problem in my case.

Comment by Nathaniel Clark [ 09/Aug/17 ]

Right. But something similar must be happening, though I am at a loss as to what.

Comment by Nathaniel Clark [ 10/Aug/17 ]

Okay, I was testing something totally different and ran into this:

[root@ieel-mds03 ~]# mount -t lustre MGS/MGT /mnt/MGT
e2label: No such file or directory while trying to open MGS/MGT
Couldn't find valid filesystem superblock.

But on closer inspection:

[root@ieel-mds03 ~]# df
Filesystem                      1K-blocks     Used Available Use% Mounted on
/dev/mapper/cl_ieel--mds03-root   6486016  1867296   4618720  29% /
devtmpfs                           496568        0    496568   0% /dev
tmpfs                              508324    39216    469108   8% /dev/shm
tmpfs                              508324    13188    495136   3% /run
tmpfs                              508324        0    508324   0% /sys/fs/cgroup
/dev/sda1                         1038336   193444    844892  19% /boot
ieel-storage:/home               40572928 38486912   2086016  95% /home
tmpfs                              101668        0    101668   0% /run/user/0
MGS                               5047168        0   5047168   0% /MGS
MGS/MGT                           5007744        0   5005696   0% /mnt/MGT

The filesystem did mount, and each umount and remount, I get the same error message, but it succeeds in mounting.

Comment by Gerrit Updater [ 10/Aug/17 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/28456
Subject: LU-9799 mount: Call read_ldd with initialized mount type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aeffd031d67a99896657617f7acd6dafd7a7722c

Comment by Gerrit Updater [ 17/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28456/
Subject: LU-9799 mount: Call read_ldd with initialized mount type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0108281c65545df169faaa0ce0690fb021680643

Comment by Peter Jones [ 17/Aug/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 17/Aug/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28581
Subject: LU-9799 mount: Call read_ldd with initialized mount type
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 776cae93b762e819cf80eed04d57bfc4040f09f0

Comment by Gerrit Updater [ 18/Aug/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28581/
Subject: LU-9799 mount: Call read_ldd with initialized mount type
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: f869b9da902fc305bfab8e902d0c1202aec6a7bc

Generated at Sat Feb 10 02:29:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.