[LU-9799] mount doesn't return an error when failing - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.10.1, Lustre 2.11.0
Affects Version/s: Lustre 2.10.0
Labels:
None
Environment:
Lustre: Build Version: 2.10.0_5_gbb3c407

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When mount -t lustre ... has failed to actually mount a target, the exit code of mount does not reflect this:

# mount -t lustre zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13/MGS /mnt/MGS
e2label: No such file or directory while trying to open zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13/MGS
Couldn't find valid filesystem superblock.
# echo $?
0

This of course wreaks havoc on systems such as IML which rely on the exit code of one step in the process of starting a filesystem to decide if it should continue with subsequent steps.

Attachments

Issue Links

is duplicated by

LU-9853 mount.lustre noisy on mount

Resolved

Activity

[LU-9799] mount doesn't return an error when failing

Brian Murrell (Inactive) added a comment - 08/Aug/17 6:28 PM

utopiabound:

But the ZFS, OS, e2fsprogs versions aren't.

It's probably moot, but just for clarity, the ZFS version is whatever is built by Jenkins with b2_10. e2fsprogs is most recent GA and O/S is RHEL 7.4. I doubt any of these are particularly relevant though.

I can get sort of close, but not with your pool name:

I think you got much more than just "sort of close". I think you got an exact reproduction. The names of pools, etc. I think is quite irrelevant.

The question is still though, when the e2label call is only in the ldiskfs OSD codepath, in ldiskfs_read_ldd(), why is that being hit for a ZFS formatted target?

My reading of the code is that by the time osd_read_ldd() is supposed to call either zfs_read_ldd() or ldiskfs_read_ldd(), the format of the target is known and stored in ldd->ldd_mount_type, so only the relevant one of either zfs_read_ldd() or ldiskfs_read_ldd() should be called, not both and so why are we getting an error from the e2label that is only in ldiskfs_read_ldd()?

Brian Murrell (Inactive) added a comment - 08/Aug/17 6:28 PM utopiabound : But the ZFS, OS, e2fsprogs versions aren't. It's probably moot, but just for clarity, the ZFS version is whatever is built by Jenkins with b2_10. e2fsprogs is most recent GA and O/S is RHEL 7.4. I doubt any of these are particularly relevant though. I can get sort of close, but not with your pool name: I think you got much more than just "sort of close". I think you got an exact reproduction. The names of pools, etc. I think is quite irrelevant. The question is still though, when the e2label call is only in the ldiskfs OSD codepath, in ldiskfs_read_ldd() , why is that being hit for a ZFS formatted target? My reading of the code is that by the time osd_read_ldd() is supposed to call either zfs_read_ldd() or ldiskfs_read_ldd() , the format of the target is known and stored in ldd->ldd_mount_type , so only the relevant one of either zfs_read_ldd() or ldiskfs_read_ldd() should be called, not both and so why are we getting an error from the e2label that is only in ldiskfs_read_ldd() ?

Nathaniel Clark added a comment - 08/Aug/17 4:08 PM

The Lustre build version is in the Environment field of this ticket.

But the ZFS, OS, e2fsprogs versions aren't.

I can get sort of close, but not with your pool name:

[root@ieel-mds03 tmp]# mount -t lustre zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS /mnt/MGT; echo $?
zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS: No such file or directory while opening filesystem
e2label: No such file or directory while trying to open zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS
Couldn't find valid filesystem superblock.
mount.lustre: zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS failed to read permanent mount data: 
254

If e2label were to print anything (even a newline) to stdout on error, the final return code would be 0.

Nathaniel Clark added a comment - 08/Aug/17 4:08 PM The Lustre build version is in the Environment field of this ticket. But the ZFS, OS, e2fsprogs versions aren't. I can get sort of close, but not with your pool name: [root@ieel-mds03 tmp]# mount -t lustre zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS /mnt/MGT; echo $? zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS: No such file or directory while opening filesystem e2label: No such file or directory while trying to open zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS Couldn't find valid filesystem superblock. mount.lustre: zfs_pool_scsi0QEMU_Inode_HARDDISK_disk13/MGS failed to read permanent mount data: 254 If e2label were to print anything (even a newline) to stdout on error, the final return code would be 0.

Brian Murrell (Inactive) added a comment - 08/Aug/17 11:50 AM

utopiabound:

Can you give me exact versions of what you using,

The Lustre build version is in the Environment field of this ticket.

I'm having trouble reproducing even the initial error messages

Yes, it's not very reproducible. We only hit it occasionally. But since we do create Lustre filesystems so many times during our test runs, even only very intermittent issues can hit seemingly frequently.

I think the first step is to agree on the expected behaviour. My position, from reading the code is that one should never be able to get the e2label: No such file or directory while trying to open... error if the target was properly formatted since the codepath that that error exists in should be only for ldiskfs and if the target was formatted ZFS then it should be an unreachable codepath. You seem to indicate that it could happen though. I explained why I don't think one should be able to hit that codepath but you must disagree with my explanation if you think it can be hit.

Could you explain what I missed in my analysis?

Brian Murrell (Inactive) added a comment - 08/Aug/17 11:50 AM utopiabound : Can you give me exact versions of what you using, The Lustre build version is in the Environment field of this ticket. I'm having trouble reproducing even the initial error messages Yes, it's not very reproducible. We only hit it occasionally. But since we do create Lustre filesystems so many times during our test runs, even only very intermittent issues can hit seemingly frequently. I think the first step is to agree on the expected behaviour. My position, from reading the code is that one should never be able to get the e2label: No such file or directory while trying to open... error if the target was properly formatted since the codepath that that error exists in should be only for ldiskfs and if the target was formatted ZFS then it should be an unreachable codepath. You seem to indicate that it could happen though. I explained why I don't think one should be able to hit that codepath but you must disagree with my explanation if you think it can be hit. Could you explain what I missed in my analysis?

Nathaniel Clark added a comment - 07/Aug/17 4:15 PM

Can you give me exact versions of what you using, and maybe an sosreport from the host? I'm having trouble reproducing even the initial error messages.

Nathaniel Clark added a comment - 07/Aug/17 4:15 PM Can you give me exact versions of what you using, and maybe an sosreport from the host? I'm having trouble reproducing even the initial error messages.

Brian Murrell (Inactive) added a comment - 03/Aug/17 3:49 PM

ldd_mount_type should be set by osd_is_lustre() or it would error out with "... has not been formatted with mkfs.lustre..."

Indeed, that is my understanding from reading the code also.

Are you sure it didn't mount it?

I got here actually because the error I was trying to find the source of was:

Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running?
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:1846:server_fill_super()) Unable to start targets: -108
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:1560:server_put_super()) no obd testfs-OST0001
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete
Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-108)

So that would indicate that it didn't mount.

The errors listed are ldiskfs errors, which is okay when mounting a ZFS device.

But why would one get any ldiskfs errors when trying to mount a ZFS device if osd_is_lustre() correctly set ldd_mount_type? If ldd_mount_type is correct then why would ldiskfs_read_ldd() (or ldiskfs_<anything> for that matter) even have any opportunity to produce any ldiskfs related errors?

Brian Murrell (Inactive) added a comment - 03/Aug/17 3:49 PM ldd_mount_type should be set by osd_is_lustre() or it would error out with "... has not been formatted with mkfs.lustre..." Indeed, that is my understanding from reading the code also. Are you sure it didn't mount it? I got here actually because the error I was trying to find the source of was: Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running? Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:1846:server_fill_super()) Unable to start targets: -108 Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:1560:server_put_super()) no obd testfs-OST0001 Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete Jun 27 15:59:53 lotus-55vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 12529:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-108) So that would indicate that it didn't mount. The errors listed are ldiskfs errors, which is okay when mounting a ZFS device. But why would one get any ldiskfs errors when trying to mount a ZFS device if osd_is_lustre() correctly set ldd_mount_type ? If ldd_mount_type is correct then why would ldiskfs_read_ldd() (or ldiskfs_<anything> for that matter) even have any opportunity to produce any ldiskfs related errors?

Nathaniel Clark added a comment - 03/Aug/17 12:08 PM

ldd_mount_type should be set by osd_is_lustre() or it would error out with "... has not been formatted with mkfs.lustre..."

Are you sure it didn't mount it? The errors listed are ldiskfs errors, which is okay when mounting a ZFS device.

Nathaniel Clark added a comment - 03/Aug/17 12:08 PM ldd_mount_type should be set by osd_is_lustre() or it would error out with "... has not been formatted with mkfs.lustre..." Are you sure it didn't mount it? The errors listed are ldiskfs errors, which is okay when mounting a ZFS device.

Peter Jones added a comment - 01/Aug/17 3:31 PM

Nathaniel

Can you please advise?

Thanks

Peter

Peter Jones added a comment - 01/Aug/17 3:31 PM Nathaniel Can you please advise? Thanks Peter

Brian Murrell (Inactive) added a comment - 31/Jul/17 4:51 PM

But what is interesting about this command and error is that clearly the user is trying to mount a ZFS target so why is ldiskfs_read_ldd() being called for it?

Isn't osd_read_ldd() supposed to know that the target is (supposed to be formatted) ZFS at this point through ldd->ldd_mount_type?

Could this be a symptom of a failed mkfs on the target perhaps?

Any other suggestions of what this could be a symptom of would be welcome.

Brian Murrell (Inactive) added a comment - 31/Jul/17 4:51 PM But what is interesting about this command and error is that clearly the user is trying to mount a ZFS target so why is ldiskfs_read_ldd() being called for it? Isn't osd_read_ldd() supposed to know that the target is (supposed to be formatted) ZFS at this point through ldd->ldd_mount_type ? Could this be a symptom of a failed mkfs on the target perhaps? Any other suggestions of what this could be a symptom of would be welcome.

People

Assignee:: Nathaniel Clark

Reporter:: Brian Murrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Jul/17 11:22 AM

Updated:: 18/Aug/17 11:34 PM

Resolved:: 17/Aug/17 4:32 AM