[LU-137] ioctl passthrough mechanism for Lustre OST/MDT mountpoints Created: 17/Mar/11  Updated: 15/Aug/23  Resolved: 19/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.5.0
Fix Version/s: Lustre 2.16.0

Type: New Feature Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: patch

Attachments: Text File e2fsprogs-resize_mtpt.patch     File ioctl-passthru-1_8.diff    
Issue Links:
Related
is related to LU-6202 clean up ioctl handling Open
is related to LU-11355 Add Lustre trim support Resolved
is related to LU-16835 lustre-initialization: Operation not ... Open
is related to LU-5822 health_check file not updating properly Resolved
is related to LU-4795 server_put_mount() calls server_dereg... Resolved
is related to LU-8151 OST/MDT /proc/mounts always shows "ro... Resolved
is related to LU-4931 New feature of giving server/storage ... Resolved
is related to LU-17027 missing #include <linux/file.h> in lu... Resolved
Bugzilla ID: 14,489
Rank (Obsolete): 8383

 Description   

Implement an interface for sending IO Control (ioctl) commands from userspace through the Lustre mount point to the underlying ldiskfs filesystem to allow execution of filesystem-wide ioctl() commands, such as resize. This will allow user-space tools that operate via ioctl() commands on the filesystem mountpoint to be used on the Lustre MDT and OST filesystems while they are mounted and in use subject to any limitations of the original ioctl() commands themselves.



 Comments   
Comment by Bryon Neitzel (Inactive) [ 06/Jun/11 ]

Assigning back to Andreas until we find a new owner for this project.

Comment by Andreas Dilger [ 07/Jun/11 ]

Prototype patch for ioctl passthrough. It is known to have problems (i.e. crash if used), but is close to what I think would work.

There are a few avenues for investigation as to why it is not working:

  • I had to make a change to the e2fsprogs library to allow resize2fs to work,
    because it was calling stat() on the mountpoint to verify that it was the
    same as the underlying block device, to determine if the filesystem is mounted.
    (first attachment)
  • This in turn required a change to the lustre server mountpoint to ensure that
    it is initializing the device number correctly, so that st_rdev for the Lustre
    mountpoint is the same as st_dev of the underlying device node. This is also
    included with the server_ioctl() function that is supposed to be working.
    (second attachment)
  • the pointer to the underlying ldiskfs device may be incorrect. The handling
    of the internal ldiskfs mountpoint is more complex than I'd like, so it isn't
    immediately clear that my code is pointing at the right device
  • the size of the ioctl parameter that is being copied could be incorrect
Comment by Andreas Dilger [ 07/Jun/11 ]

Patch for the e2fsprogs resize2fs tool to allow it to specify the directory mountpoint for resizing, instead of requiring one to specify a block device for the resize operation.

I don't think this should be required for the ioctl passthrough to work, but it simplifies the usage of resize2fs, and I thought I needed it until I fixed the Lustre sb->s_dev on the mounted filesystem.

Comment by Andreas Dilger [ 07/Jun/11 ]

Updated patch that includes sb->s_dev fix.

Comment by Kalpak Shah (Inactive) [ 23/May/13 ]

Hi Andreas, I am working on this ticket.

Comment by Andreas Dilger [ 23/May/13 ]

Unfortunately, I expect that the 1.8 version of the patch is completely useless for the current 2.1 and master code...

You might want to experiment with some simple ioctl (e.g. EXT4_IOC_GETFLAGS or EXT4_IOC_GETVERSION) to get that working before you try to have resize2fs calling the online resizer. The end goal is that at least EXT4_IOC_GROUP_EXTEND/EXT4_IOC_GROUP_ADD (old resize), EXT4_IOC_RESIZE_FS (new resize), and FITRIM work.

Comment by Swapnil Pimpale (Inactive) [ 16/Aug/13 ]

Hi Andeas,

I have ported the ioctl_passthru-1_8.patch to the latest master.
With this patch I tested EXT4_IOC_GETFLAGS and EXT4_IOC_GETVERSION ioctls on OST and MDT mountpoints. These ioctls work and return expected values.
I have added this a testcase in sanity.sh
The patch can be found here -> http://review.whamcloud.com/#/c/7354/

I tried the EXT4_IOC_SETVERSION ioctl but that resulted in a crash.
Is that expected?

Comment by Andreas Dilger [ 19/Aug/13 ]

It definitely shouldn't crash regardless if what ioctl is used, though I don't necessarily expect it to do anything useful. Presumably the GETVERSION and GETFLAGS ioctls return the correct values from the underlying root inode?

Next step is to figure out why it crashed and fix that.

Comment by Swapnil Pimpale (Inactive) [ 19/Aug/13 ]

Yes, GETVERSION and GETFLAGS ioctls return correct values which are as follows:

MDS:
GETVERSION: 0
GETFLAGS: 0x0

OST:
GETVERSION: 0
GETFLAGS: 0x80000

The crash occurred because of a NULL pointer dereference in mnt_want_write().
The stack is as follows:

<1>BUG: unable to handle kernel NULL pointer dereference at 00000000000000e2
<1>IP: [<ffffffff81193c34>] mnt_want_write+0x14/0x80
<4>PGD 1512d067 PUD c0d4067 PMD 0 
<4>Oops: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:02:00.0/irq
<4>CPU 1 
<4>Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) mdd(U) mgs(U) lquota(U) lfsck(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) jbd2 sha512_generic sha256_generic crc32c_intel nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 dm_mirror dm_region_hash dm_log uinput ppdev parport_pc parport e1000 sg vmware_balloon i2c_piix4 i2c_core shpchp ext3 jbd mbcache sd_mod crc_t10dif sr_mod cdrom mptspi mptscsih mptbase scsi_transport_spi pata_acpi ata_generic ata_piix dm_mod [last unloaded: libcfs]
<4>
<4>Pid: 26460, comm: ioctl_passthru Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
<4>RIP: 0010:[<ffffffff81193c34>]  [<ffffffff81193c34>] mnt_want_write+0x14/0x80
<4>RSP: 0018:ffff880020525cc8  EFLAGS: 00010246
<4>RAX: 0000000000000000 RBX: ffff880020525d78 RCX: 0000000000000003
<4>RDX: 0000000000000001 RSI: 0000000040086604 RDI: 0000000000000002
<4>RBP: ffff880020525cc8 R08: ffffffffa0462c80 R09: 0000000000000000
<4>R10: 0000000000000000 R11: 0000000000000206 R12: 00007fff78321100
<4>R13: ffff88003c9be5b0 R14: 00007fff78321100 R15: 0000000000000000
<4>FS:  00007f94d8a74700(0000) GS:ffff880002280000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>CR2: 00000000000000e2 CR3: 000000001dcfc000 CR4: 00000000000006e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ioctl_passthru (pid: 26460, threadinfo ffff880020524000, task ffff880020622aa0)
<4>Stack:
<4> ffff880020525d58 ffffffffa07198f4 0000000000000000 ffff88001dc499a8
<4><d> 0000000000000286 ffff88003c9be5b0 ffff880020525d18 ffffffff811902c0
<4><d> ffff880020525d18 ffff88003c9be5b0 ffff880020525d58 ffffffff8118ee10
<4>Call Trace:
<4> [<ffffffffa07198f4>] ldiskfs_ioctl+0xe4/0x940 [ldiskfs]
<4> [<ffffffff811902c0>] ? iput+0x30/0x70
<4> [<ffffffff8118ee10>] ? d_obtain_alias+0xc0/0x230
<4> [<ffffffffa0451b2a>] server_ioctl+0xba/0xf0 [obdclass]
<4> [<ffffffff81312cb3>] ? pty_write+0x73/0x80
<4> [<ffffffff8130c34e>] ? do_output_char+0x1de/0x210
<4> [<ffffffff81090d4c>] ? remove_wait_queue+0x3c/0x50
<4> [<ffffffff81052223>] ? __wake_up+0x53/0x70
<4> [<ffffffff81189012>] vfs_ioctl+0x22/0xa0
<4> [<ffffffff81310efe>] ? tty_ldisc_deref+0xe/0x10
<4> [<ffffffff81309e93>] ? tty_write+0x233/0x2a0
<4> [<ffffffff811891b4>] do_vfs_ioctl+0x84/0x580
<4> [<ffffffff81176602>] ? vfs_write+0x132/0x1a0
<4> [<ffffffff81189731>] sys_ioctl+0x81/0xa0
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: 8b 40 58 83 e0 01 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 65 8b 14 25 b8 e0 00 00 48 63 d2 <48> 8b 87 e0 00 00 00 48 03 04 d5 20 81 bf 81 83 00 01 0f ae f0
<1>RIP  [<ffffffff81193c34>] mnt_want_write+0x14/0x80
<4> RSP <ffff880020525cc8>
<4>CR2: 00000000000000e2

This was because the active_filp.f_path.mnt was not filled in before calling the ioctl.
I have fixed this problem in the next patchset (http://review.whamcloud.com/#/c/7354/4)
With this fix, I am able to test the SETVERSION and SETFLAGS ioctl.

I also tested the old online resizefs ioctl LDISKFS_IOC_GROUP_EXTEND (latest e2fsprogs) as follows and it seems to be working:

# df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1              28G   20G  6.3G  76% /
tmpfs                 435M   88K  435M   1% /dev/shm
/dev/sda1             486M  109M  352M  24% /boot
/dev/loop0            147M   18M  120M  13% /mnt/mds1
/dev/loop1            184M   26M  149M  15% /mnt/ost1
/dev/loop2            184M   26M  149M  15% /mnt/ost2
TM2@tcp:/lustre       367M   51M  297M  15% /mnt/lustre

# dd if=/dev/zero of=/tmp/lustre-ost-x bs=1M count=200

# mkfs.lustre --fsname=lustre --mgsnode=192.168.100.26@tcp0 --ost --index=2 --device-size=100000 /tmp/lustre-ost-x

# mount -t lustre -o loop /tmp/lustre-ost-x /mnt/ost-x/

# df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1              28G   20G  6.3G  76% /
tmpfs                 435M   88K  435M   1% /dev/shm
/dev/sda1             486M  109M  352M  24% /boot
/dev/loop0            147M   18M  120M  13% /mnt/mds1
/dev/loop1            184M   26M  149M  15% /mnt/ost1
/dev/loop2            184M   26M  149M  15% /mnt/ost2
TM2@tcp:/lustre       458M   56M  378M  13% /mnt/lustre
/dev/loop3             92M  5.3M   82M   7% /mnt/ost-x

~/e2fsprogs# ./build/resize/resize2fs /dev/loop3 125M
resize2fs 1.43-WIP (21-Jan-2013)
Filesystem at /dev/loop3 is mounted on /mnt/ost-x; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 1
Performing an on-line resize of /dev/loop3 to 32000 (4k) blocks.
The filesystem on /dev/loop3 is now 32000 blocks long.

# df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1              28G   20G  6.3G  76% /
tmpfs                 435M   88K  435M   1% /dev/shm
/dev/sda1             486M  109M  352M  24% /boot
/dev/loop0            147M   18M  120M  13% /mnt/mds1
/dev/loop1            184M   26M  149M  15% /mnt/ost1
/dev/loop2            184M   26M  149M  15% /mnt/ost2
TM2@tcp:/lustre       486M   56M  406M  13% /mnt/lustre
/dev/loop3            119M  5.3M  109M   5% /mnt/ost-x

# umount /mnt/ost-x/

# mount -t lustre -o loop /tmp/lustre-ost-x /mnt/ost-x/

# df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1              28G   20G  6.3G  76% /
tmpfs                 435M   88K  435M   1% /dev/shm
/dev/sda1             486M  109M  352M  24% /boot
/dev/loop0            147M   18M  120M  13% /mnt/mds1
/dev/loop1            184M   26M  149M  15% /mnt/ost1
/dev/loop2            184M   26M  149M  15% /mnt/ost2
TM2@tcp:/lustre       486M   56M  406M  13% /mnt/lustre
/dev/loop3            119M  5.3M  109M   5% /mnt/ost-x
Comment by Swapnil Pimpale (Inactive) [ 09/May/14 ]

Hi Alex,

This patch: http://review.whamcloud.com/#/c/8286/2 removes lsi_srv_mnt, lmi_mnt and ddp_mnt. It is mentioned in the commit message that vfsmount has become redundant because of the introduction of local storage device.
IOCTL passthrough patch needs a valid vfsmount.

The following is per discussion with Andreas:
Seems ext4_ioctl() needs filp for mnt_{want,drop}_write_file(), which uses file->f_path.mnt for a number of things, so it really needs a valid vfsmnt structure. It looks like the vfsmount is still available in osd_dt_dev(lsi->lsi_dt_dev)->od_mnt, but it looks like od_mnt is specific to the underlying OSD device (ldiskfs or ZFS). It looks like there are no direct ioctls for the DMU - they are all handled via /dev/zfs. there are two ioctls for ZFS files in ZPL, but we don't use that, so this only needs to work for ldiskfs mountpoints for now. But, it isn't safe to dig into the OSD structure directly, since this could crash on a ZFS-backed filesystem. It might make sense to add the vfsmnt into dt_device_param. We might need to add a ->dt_ioctl() method to dt_device_operations, since accessing the superblock directly is also bad.

Sadly, looking at this patch (LU-137) in light of ZFS-backed devices (i.e. anything 2.4 and later), it seems that there are quite a few things that are not "right" about it. ZFS doesn't even have a superblock, so that makes much of the patch invalid. Some of the info can be fetched via the osd_conf_get() interface, in particular s_dev is important for the resize ioctl. Note that even accessing the osd_device outside of the OSD code isn't possible, because the osd_device structure is different for each OSD.

What would be the best way to proceed in this case?

Comment by Alex Zhuravlev [ 12/May/14 ]

I don't think you want to manage ZFS pools (or btrfs devices) directly. we can provide vfsmount or superblock via osd_conf_get(), but again that will work for ldiskfs only.

Comment by Andreas Dilger [ 29/Aug/14 ]

Alex, what about adding a new dt_ioctl() method for the OSD API? It would of course be fine if the underlying OSD doesn't support the given ioctl (e.g. returns -ENOTTY) but gives a way to add features like this. I see that this was handled for FIEMAP by adding a dbo_fiemap_get() method (though it could return -EOPNOTSUPP a LOT earlier in the request processing), but that might result in an explosion of different methods if we add one for every ioctl. Other candidates that we might need in the future include FITRIM to trim unallocated space for SSD or thin-provisioned devices, EXT4_IOC_PRECACHE_EXTENTS to prefetch file extent metadata, EXT4_IOC_MOVE_EXT or EXT4_IOC_MIGRATE for data migration within the OST.

It isn't yet clear to me if we want separate ioctl methods for the whole device and per file, or if it is OK to just do the "device" ioctls the root inode.

Comment by Gerrit Updater [ 13/May/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20161
Subject: LU-137 osd: better stat info for server mountpoints
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: db5b4aaa6177d9ab179b46a8d3e5d13c5d2c2883

Comment by Gerrit Updater [ 11/Oct/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/23092
Subject: LU-137 obdclass: add dt_object_put() and use it
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cf4f59a5a7253e18ac98983216791cb730511d18

Comment by Gerrit Updater [ 26/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23092/
Subject: LU-137 obdclass: add dt_object_put() and use it
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5963af745b3aa14410d5ceb66f8a7b7d6aaf576a

Comment by James A Simmons [ 03/Oct/18 ]

Linking this since I plan on fixing the ioctl direction issue which will provide a proper interface for this as well.

Comment by Andreas Dilger [ 04/Apr/20 ]

It may be that this has been fixed via patch https://review.whamcloud.com/33131 "Subject: LU-11355 lustre: enable fstrim on lustre device", which added a generic ioctl passthrough from userspace to ldiskfs.

However, the last time I had tested this (several years ago) there were also some issues with e2fsprogs/resize2fs being unhappy that the block device (st_rdev) reported by the Lustre server stub mount did not match the underlying block device. That is what patch: http://review.whamcloud.com/20161 "LU-137 osd: better stat info for server mountpoints" was about, but I haven't updated it in several years.

Comment by Gerrit Updater [ 19/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/20161/
Subject: LU-137 osd-ldiskfs: pass through resize ioctl
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ac0380dc519aa15310670d164e98453861ef332a

Comment by Peter Jones [ 19/May/23 ]

Landed for 2.16. It's been a while since I closed a Jira ticket with a bugzilla id!

Generated at Sat Feb 10 01:04:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.