[LU-137] ioctl passthrough mechanism for Lustre OST/MDT mountpoints Created: 17/Mar/11 Updated: 15/Aug/23 Resolved: 19/May/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Bugzilla ID: | 14,489 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 8383 | ||||||||||||||||||||||||||||||||||||
| Description |
|
Implement an interface for sending IO Control (ioctl) commands from userspace through the Lustre mount point to the underlying ldiskfs filesystem to allow execution of filesystem-wide ioctl() commands, such as resize. This will allow user-space tools that operate via ioctl() commands on the filesystem mountpoint to be used on the Lustre MDT and OST filesystems while they are mounted and in use subject to any limitations of the original ioctl() commands themselves. |
| Comments |
| Comment by Bryon Neitzel (Inactive) [ 06/Jun/11 ] |
|
Assigning back to Andreas until we find a new owner for this project. |
| Comment by Andreas Dilger [ 07/Jun/11 ] |
|
Prototype patch for ioctl passthrough. It is known to have problems (i.e. crash if used), but is close to what I think would work. There are a few avenues for investigation as to why it is not working:
|
| Comment by Andreas Dilger [ 07/Jun/11 ] |
|
Patch for the e2fsprogs resize2fs tool to allow it to specify the directory mountpoint for resizing, instead of requiring one to specify a block device for the resize operation. I don't think this should be required for the ioctl passthrough to work, but it simplifies the usage of resize2fs, and I thought I needed it until I fixed the Lustre sb->s_dev on the mounted filesystem. |
| Comment by Andreas Dilger [ 07/Jun/11 ] |
|
Updated patch that includes sb->s_dev fix. |
| Comment by Kalpak Shah (Inactive) [ 23/May/13 ] |
|
Hi Andreas, I am working on this ticket. |
| Comment by Andreas Dilger [ 23/May/13 ] |
|
Unfortunately, I expect that the 1.8 version of the patch is completely useless for the current 2.1 and master code... You might want to experiment with some simple ioctl (e.g. EXT4_IOC_GETFLAGS or EXT4_IOC_GETVERSION) to get that working before you try to have resize2fs calling the online resizer. The end goal is that at least EXT4_IOC_GROUP_EXTEND/EXT4_IOC_GROUP_ADD (old resize), EXT4_IOC_RESIZE_FS (new resize), and FITRIM work. |
| Comment by Swapnil Pimpale (Inactive) [ 16/Aug/13 ] |
|
Hi Andeas, I have ported the ioctl_passthru-1_8.patch to the latest master. I tried the EXT4_IOC_SETVERSION ioctl but that resulted in a crash. |
| Comment by Andreas Dilger [ 19/Aug/13 ] |
|
It definitely shouldn't crash regardless if what ioctl is used, though I don't necessarily expect it to do anything useful. Presumably the GETVERSION and GETFLAGS ioctls return the correct values from the underlying root inode? Next step is to figure out why it crashed and fix that. |
| Comment by Swapnil Pimpale (Inactive) [ 19/Aug/13 ] |
|
Yes, GETVERSION and GETFLAGS ioctls return correct values which are as follows: MDS: GETVERSION: 0 GETFLAGS: 0x0 OST: GETVERSION: 0 GETFLAGS: 0x80000 The crash occurred because of a NULL pointer dereference in mnt_want_write(). <1>BUG: unable to handle kernel NULL pointer dereference at 00000000000000e2 <1>IP: [<ffffffff81193c34>] mnt_want_write+0x14/0x80 <4>PGD 1512d067 PUD c0d4067 PMD 0 <4>Oops: 0000 [#1] SMP <4>last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:02:00.0/irq <4>CPU 1 <4>Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) mdd(U) mgs(U) lquota(U) lfsck(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) jbd2 sha512_generic sha256_generic crc32c_intel nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 dm_mirror dm_region_hash dm_log uinput ppdev parport_pc parport e1000 sg vmware_balloon i2c_piix4 i2c_core shpchp ext3 jbd mbcache sd_mod crc_t10dif sr_mod cdrom mptspi mptscsih mptbase scsi_transport_spi pata_acpi ata_generic ata_piix dm_mod [last unloaded: libcfs] <4> <4>Pid: 26460, comm: ioctl_passthru Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform <4>RIP: 0010:[<ffffffff81193c34>] [<ffffffff81193c34>] mnt_want_write+0x14/0x80 <4>RSP: 0018:ffff880020525cc8 EFLAGS: 00010246 <4>RAX: 0000000000000000 RBX: ffff880020525d78 RCX: 0000000000000003 <4>RDX: 0000000000000001 RSI: 0000000040086604 RDI: 0000000000000002 <4>RBP: ffff880020525cc8 R08: ffffffffa0462c80 R09: 0000000000000000 <4>R10: 0000000000000000 R11: 0000000000000206 R12: 00007fff78321100 <4>R13: ffff88003c9be5b0 R14: 00007fff78321100 R15: 0000000000000000 <4>FS: 00007f94d8a74700(0000) GS:ffff880002280000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>CR2: 00000000000000e2 CR3: 000000001dcfc000 CR4: 00000000000006e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process ioctl_passthru (pid: 26460, threadinfo ffff880020524000, task ffff880020622aa0) <4>Stack: <4> ffff880020525d58 ffffffffa07198f4 0000000000000000 ffff88001dc499a8 <4><d> 0000000000000286 ffff88003c9be5b0 ffff880020525d18 ffffffff811902c0 <4><d> ffff880020525d18 ffff88003c9be5b0 ffff880020525d58 ffffffff8118ee10 <4>Call Trace: <4> [<ffffffffa07198f4>] ldiskfs_ioctl+0xe4/0x940 [ldiskfs] <4> [<ffffffff811902c0>] ? iput+0x30/0x70 <4> [<ffffffff8118ee10>] ? d_obtain_alias+0xc0/0x230 <4> [<ffffffffa0451b2a>] server_ioctl+0xba/0xf0 [obdclass] <4> [<ffffffff81312cb3>] ? pty_write+0x73/0x80 <4> [<ffffffff8130c34e>] ? do_output_char+0x1de/0x210 <4> [<ffffffff81090d4c>] ? remove_wait_queue+0x3c/0x50 <4> [<ffffffff81052223>] ? __wake_up+0x53/0x70 <4> [<ffffffff81189012>] vfs_ioctl+0x22/0xa0 <4> [<ffffffff81310efe>] ? tty_ldisc_deref+0xe/0x10 <4> [<ffffffff81309e93>] ? tty_write+0x233/0x2a0 <4> [<ffffffff811891b4>] do_vfs_ioctl+0x84/0x580 <4> [<ffffffff81176602>] ? vfs_write+0x132/0x1a0 <4> [<ffffffff81189731>] sys_ioctl+0x81/0xa0 <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b <4>Code: 8b 40 58 83 e0 01 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 65 8b 14 25 b8 e0 00 00 48 63 d2 <48> 8b 87 e0 00 00 00 48 03 04 d5 20 81 bf 81 83 00 01 0f ae f0 <1>RIP [<ffffffff81193c34>] mnt_want_write+0x14/0x80 <4> RSP <ffff880020525cc8> <4>CR2: 00000000000000e2 This was because the active_filp.f_path.mnt was not filled in before calling the ioctl. I also tested the old online resizefs ioctl LDISKFS_IOC_GROUP_EXTEND (latest e2fsprogs) as follows and it seems to be working: # df -kh Filesystem Size Used Avail Use% Mounted on /dev/sdb1 28G 20G 6.3G 76% / tmpfs 435M 88K 435M 1% /dev/shm /dev/sda1 486M 109M 352M 24% /boot /dev/loop0 147M 18M 120M 13% /mnt/mds1 /dev/loop1 184M 26M 149M 15% /mnt/ost1 /dev/loop2 184M 26M 149M 15% /mnt/ost2 TM2@tcp:/lustre 367M 51M 297M 15% /mnt/lustre # dd if=/dev/zero of=/tmp/lustre-ost-x bs=1M count=200 # mkfs.lustre --fsname=lustre --mgsnode=192.168.100.26@tcp0 --ost --index=2 --device-size=100000 /tmp/lustre-ost-x # mount -t lustre -o loop /tmp/lustre-ost-x /mnt/ost-x/ # df -kh Filesystem Size Used Avail Use% Mounted on /dev/sdb1 28G 20G 6.3G 76% / tmpfs 435M 88K 435M 1% /dev/shm /dev/sda1 486M 109M 352M 24% /boot /dev/loop0 147M 18M 120M 13% /mnt/mds1 /dev/loop1 184M 26M 149M 15% /mnt/ost1 /dev/loop2 184M 26M 149M 15% /mnt/ost2 TM2@tcp:/lustre 458M 56M 378M 13% /mnt/lustre /dev/loop3 92M 5.3M 82M 7% /mnt/ost-x ~/e2fsprogs# ./build/resize/resize2fs /dev/loop3 125M resize2fs 1.43-WIP (21-Jan-2013) Filesystem at /dev/loop3 is mounted on /mnt/ost-x; on-line resizing required old_desc_blocks = 1, new_desc_blocks = 1 Performing an on-line resize of /dev/loop3 to 32000 (4k) blocks. The filesystem on /dev/loop3 is now 32000 blocks long. # df -kh Filesystem Size Used Avail Use% Mounted on /dev/sdb1 28G 20G 6.3G 76% / tmpfs 435M 88K 435M 1% /dev/shm /dev/sda1 486M 109M 352M 24% /boot /dev/loop0 147M 18M 120M 13% /mnt/mds1 /dev/loop1 184M 26M 149M 15% /mnt/ost1 /dev/loop2 184M 26M 149M 15% /mnt/ost2 TM2@tcp:/lustre 486M 56M 406M 13% /mnt/lustre /dev/loop3 119M 5.3M 109M 5% /mnt/ost-x # umount /mnt/ost-x/ # mount -t lustre -o loop /tmp/lustre-ost-x /mnt/ost-x/ # df -kh Filesystem Size Used Avail Use% Mounted on /dev/sdb1 28G 20G 6.3G 76% / tmpfs 435M 88K 435M 1% /dev/shm /dev/sda1 486M 109M 352M 24% /boot /dev/loop0 147M 18M 120M 13% /mnt/mds1 /dev/loop1 184M 26M 149M 15% /mnt/ost1 /dev/loop2 184M 26M 149M 15% /mnt/ost2 TM2@tcp:/lustre 486M 56M 406M 13% /mnt/lustre /dev/loop3 119M 5.3M 109M 5% /mnt/ost-x |
| Comment by Swapnil Pimpale (Inactive) [ 09/May/14 ] |
|
Hi Alex, This patch: http://review.whamcloud.com/#/c/8286/2 removes lsi_srv_mnt, lmi_mnt and ddp_mnt. It is mentioned in the commit message that vfsmount has become redundant because of the introduction of local storage device. The following is per discussion with Andreas: Sadly, looking at this patch ( What would be the best way to proceed in this case? |
| Comment by Alex Zhuravlev [ 12/May/14 ] |
|
I don't think you want to manage ZFS pools (or btrfs devices) directly. we can provide vfsmount or superblock via osd_conf_get(), but again that will work for ldiskfs only. |
| Comment by Andreas Dilger [ 29/Aug/14 ] |
|
Alex, what about adding a new dt_ioctl() method for the OSD API? It would of course be fine if the underlying OSD doesn't support the given ioctl (e.g. returns -ENOTTY) but gives a way to add features like this. I see that this was handled for FIEMAP by adding a dbo_fiemap_get() method (though it could return -EOPNOTSUPP a LOT earlier in the request processing), but that might result in an explosion of different methods if we add one for every ioctl. Other candidates that we might need in the future include FITRIM to trim unallocated space for SSD or thin-provisioned devices, EXT4_IOC_PRECACHE_EXTENTS to prefetch file extent metadata, EXT4_IOC_MOVE_EXT or EXT4_IOC_MIGRATE for data migration within the OST. It isn't yet clear to me if we want separate ioctl methods for the whole device and per file, or if it is OK to just do the "device" ioctls the root inode. |
| Comment by Gerrit Updater [ 13/May/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20161 |
| Comment by Gerrit Updater [ 11/Oct/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/23092 |
| Comment by Gerrit Updater [ 26/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23092/ |
| Comment by James A Simmons [ 03/Oct/18 ] |
|
Linking this since I plan on fixing the ioctl direction issue which will provide a proper interface for this as well. |
| Comment by Andreas Dilger [ 04/Apr/20 ] |
|
It may be that this has been fixed via patch https://review.whamcloud.com/33131 "Subject: However, the last time I had tested this (several years ago) there were also some issues with e2fsprogs/resize2fs being unhappy that the block device (st_rdev) reported by the Lustre server stub mount did not match the underlying block device. That is what patch: http://review.whamcloud.com/20161 " |
| Comment by Gerrit Updater [ 19/May/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/20161/ |
| Comment by Peter Jones [ 19/May/23 ] |
|
Landed for 2.16. It's been a while since I closed a Jira ticket with a bugzilla id! |