[LU-14596] Ubuntu combined MGT/MDT issue Created: 08/Apr/21  Updated: 24/Jan/22  Resolved: 18/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Improvement Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: ubuntu

Issue Links:
Related
is related to LU-14776 Ubuntu 20.04 HWE support Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Testing on ubuntu 20.04 VM with a combined MGT/MDT, gives a "No space left on device" error when mounting

ashehata@lustre03:mgs$ mountmgs
mount.lustre: mount /dev/sdb at /mnt/mgs failed: No space left on device

Apr  8 21:24:00 lustre03 kernel: [  839.547863] LustreError: 1833:0:(fld_index.c:372:fld_index_init()) srv-lustrewt-MDT0000: Can't find "fld" obj -28
Apr  8 21:24:00 lustre03 kernel: [  839.548577] LustreError: 1833:0:(obd_config.c:775:class_setup()) setup lustrewt-MDT0000 failed (-28)
Apr  8 21:24:00 lustre03 kernel: [  839.549055] LustreError: 1833:0:(obd_config.c:2037:class_config_llog_handler()) MGC192.168.122.67@tcp: cfg command failed: rc = -28
Apr  8 21:24:00 lustre03 kernel: [  839.550340] Lustre:    cmd=cf003 0:lustrewt-MDT0000  1:lustrewt-MDT0000_UUID  2:0  3:lustrewt-MDT0000-mdtlov  4:f  
Apr  8 21:24:00 lustre03 kernel: [  839.550340] 
Apr  8 21:24:00 lustre03 kernel: [  839.550354] LustreError: 15c-8: MGC192.168.122.67@tcp: Confguration from log lustrewt-MDT0000 failed from MGS -28. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info
Apr  8 21:24:00 lustre03 kernel: [  839.551338] LustreError: 1791:0:(obd_mount_server.c:1423:server_start_targets()) failed to start server lustrewt-MDT0000: -28
Apr  8 21:24:00 lustre03 kernel: [  839.551849] LustreError: 1791:0:(obd_mount_server.c:2058:server_fill_super()) Unable to start targets: -28
Apr  8 21:24:00 lustre03 kernel: [  839.552493] LustreError: 1791:0:(obd_config.c:828:class_cleanup()) Device 5 not setup
Apr  8 21:24:06 lustre03 kernel: [  845.593726] Lustre: 1791:0:(client.c:2312:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1617917040/real 1617917040]  req@00000000984dbaa9 x1696508945630016/t0(0) o251->MGC192.168.122.67@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1617917046 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
Apr  8 21:24:06 lustre03 kernel: [  845.607482] Lustre: server umount lustrewt-MDT0000 complete
Apr  8 21:24:06 lustre03 kernel: [  845.607487] LustreError: 1791:0:(obd_mount.c:1760:lustre_fill_super()) Unable to mount  (-28)

Having a separate MGS seems to mount properly



 Comments   
Comment by David Bestor [ 16/Aug/21 ]

ubuntu 20.04 ...
Linux builder20041 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
lustre git various checkouts. 2.14.51-53

I see the same issue with using "llmount.sh" from /usr/lib/lustre/tests
By default llmount.sh in tests uses a combined mgs/mds ...

Setup mgs, mdt, osts
Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1
mount.lustre: mount /dev/mapper/mds1_flakey at /mnt/lustre-mds1 failed: No space left on device

Its related to this commit :
LU-14388 utils: always enable ldiskfs project quota
Commit : 79642e08969eb4455bd8e23574b76f0a84d4db23

I get same error "No space" with any checkout of 2.14.51 to 2.14.53 .

If I revert it then rerun make and then copy just the new mount_osd_ldiskfs.so to here:
/usr/lib/lustre/mount_osd_ldiskfs.so
I get no such error.

tune2fs 1.46.2.wc3 (18-Jun-2021)
Filesystem volume name: lustre:MDT0000
Last mounted on: /
Filesystem UUID: 9bf2ed76-8620-45e3-9772-7ca2096b613e
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype flex_bg ea_inode dirdata large_dir sparse_super large_fi
le huge_file uninit_bg dir_nlink quota project
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 100000
Block count: 62500
Reserved block count: 2809
Overhead clusters: 29188
Free blocks: 33046
Free inodes: 99810
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 26
Blocks per group: 18720
Fragments per group: 18720
Inodes per group: 25000
Inode blocks per group: 6250
Flex block group size: 16
Filesystem created: Mon Aug 16 14:02:09 2021
Last mount time: Mon Aug 16 14:02:12 2021
Last write time: Mon Aug 16 14:02:20 2021
Mount count: 3
Maximum mount count: -1
Last checked: Mon Aug 16 14:02:09 2021
Check interval: 0 (<none>)
Lifetime writes: 19 MB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 1024
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 1f675fc3-fdf1-496b-8f6c-f16d6026acef
Journal backup: inode blocks
User quota inode: 3
Group quota inode: 4
Project quota inode: 12

Comment by James A Simmons [ 16/Aug/21 ]

What is the MDSSIZE / MGSSIZE in your test/cfg/***.sh file?

Comment by David Bestor [ 16/Aug/21 ]

I haven't changed anything ....So the default in local.sh ?

MDSSIZE=${MDSSIZE:-250000}
MGSSIZE=${MGSSIZE:-$MDSSIZE}

Comment by James A Simmons [ 16/Aug/21 ]

If you add a zero does it work ? Just want to see if its a general problem or a configuration issue.

Comment by David Bestor [ 16/Aug/21 ]

same error.. just for gigles mounted as ldiskfs after failure...

Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1
mount.lustre: mount /dev/mapper/mds1_flakey at /mnt/lustre-mds1 failed: No space left on device
root@builder20041:/usr/lib/lustre/tests# ls -lah /tmp/lustre-mdt1
rw-rr- 1 root root 2.4G Aug 16 15:33 /tmp/lustre-mdt1
root@builder20041:/usr/lib/lustre/tests# mount -t ldiskfs /dev/mapper/mds1_flakey /mnt/lustre-mds1
root@builder20041:/usr/lib/lustre/tests# df -h |grep lustre
/dev/mapper/mds1_flakey 1.4G 1.2M 1.3G 1% /mnt/lustre-mds1
root@builder20041:/usr/lib/lustre/tests# umount /mnt/lustre-mds1

Comment by David Bestor [ 16/Aug/21 ]

not that it matters. but if i remove project after the llmount.sh fails. it then mounts.

root@builder20041:/usr/lib/lustre/tests# mount -t lustre /dev/mapper/mds1_flakey /mnt/lustre
mount.lustre: mount /dev/mapper/mds1_flakey at /mnt/lustre failed: No space left on device
root@builder20041:/usr/lib/lustre/tests# tune2fs -O ^project /dev/mapper/mds1_flakey
tune2fs 1.46.2.wc3 (18-Jun-2021)
root@builder20041:/usr/lib/lustre/tests# mount -t lustre /dev/mapper/mds1_flakey /mnt/lustre
root@builder20041:/usr/lib/lustre/tests# umount /mnt/lustre
root@builder20041:/usr/lib/lustre/tests# tune2fs -O project /dev/mapper/mds1_flakey
tune2fs 1.46.2.wc3 (18-Jun-2021)
root@builder20041:/usr/lib/lustre/tests# mount -t lustre /dev/mapper/mds1_flakey /mnt/lustre

Comment by David Bestor [ 16/Aug/21 ]

tried an older kernel.

Linux builder20041 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Lustre: Lustre: Build Version: 2.14.51

Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1
mount.lustre: mount /dev/mapper/mds1_flakey at /mnt/lustre-mds1 failed: No space left on device

dmesg:
[ 6033.482723] libcfs: loading out-of-tree module taints kernel.
[ 6033.483252] libcfs: module verification failed: signature and/or required key missing - tainting kernel
[ 6033.488964] LNet: HW NUMA nodes: 1, HW CPU cores: 4, npartitions: 1
[ 6034.232729] Key type ._llcrypt registered
[ 6034.232730] Key type .llcrypt registered
[ 6034.247644] Lustre: DEBUG MARKER: builder20041: executing set_hostid
[ 6034.405954] Lustre: Lustre: Build Version: 2.14.51
[ 6034.483310] LNet: Added LNI 172.21.69.136@tcp [8/256/0/180]
[ 6034.485072] LNet: Accept secure, port 988
[ 6034.637015] Lustre: Echo OBD driver; http://www.lustre.org/
[ 6035.434814] LDISKFS-fs (loop6): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[ 6036.369277] LDISKFS-fs (loop6): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[ 6037.398524] LDISKFS-fs (loop6): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[ 6037.595444] blk_update_request: I/O error, dev loop6, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[ 6037.599940] blk_update_request: I/O error, dev loop6, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[ 6038.157138] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[ 6038.343717] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 6039.625522] Lustre: Setting parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000
[ 6040.160134] LustreError: 1157177:0:(fld_index.c:372:fld_index_init()) srv-lustre-MDT0000: Can't find "fld" obj -28
[ 6040.161495] LustreError: 1157177:0:(obd_config.c:775:class_setup()) setup lustre-MDT0000 failed (-28)
[ 6040.161570] LustreError: 1157177:0:(obd_config.c:2037:class_config_llog_handler()) MGC172.21.69.136@tcp: cfg command failed: rc = -28
[ 6040.161673] Lustre: cmd=cf003 0:lustre-MDT0000 1:lustre-MDT0000_UUID 2:0 3:lustre-MDT0000-mdtlov 4:f

[ 6040.161760] LustreError: 15c-8: MGC172.21.69.136@tcp: Confguration from log lustre-MDT0000 failed from MGS -28. Communication error
between node & MGS, a bad configuration, or other errors. See syslog for more info
[ 6040.161881] LustreError: 1156915:0:(obd_mount_server.c:1422:server_start_targets()) failed to start server lustre-MDT0000: -28
[ 6040.162074] LustreError: 1156915:0:(obd_mount_server.c:2023:server_fill_super()) Unable to start targets: -28
[ 6040.162216] LustreError: 1156915:0:(obd_config.c:828:class_cleanup()) Device 5 not setup
[ 6040.162869] LustreError: 1156945:0:(ldlm_lockd.c:2492:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1629152945 with bad export cookie 4995169906271046744
[ 6040.163121] LustreError: 166-1: MGC172.21.69.136@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
[ 6040.240326] Lustre: server umount lustre-MDT0000 complete
[ 6040.240331] LustreError: 1156915:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -28

Comment by David Bestor [ 16/Aug/21 ]

another point of reference... same similar errors if i add "project" manually on 2.14.0 before mount attempt

Linux builder20041 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux[ Lustre: Lustre: Build Version: 2.14.0_dirty

dd if=/dev/zero of=/tmp/mdt.img bs=100M count=10
mkfs.lustre --mgs --mdt --reformat --fsname ubuntu --index 0 --mgsnode 172.21.69.136@tcp --servicenode 172.21.69.136@tcp --device-size=100000 /tmp/mdt.img
tune2fs -O project /tmp/mdt.img
tune2fs -l /tmp/mdt.img|grep features
Filesystem features: has_journal ext_attr resize_inode dir_index filetype mmp flex_bg ea_inode dirdata large_dir sparse_super large_file huge_file uninit_bg dir_nlink quota project
mount -t lustre -o loop /tmp/mdt.img /b
mount.lustre: mount /dev/loop6 at /b failed: No space left on device
tune2fs -O ^project /tmp/mdt.img
mount -t lustre -o loop /tmp/mdt.img /b
mount|grep lustre
/tmp/mdt.img on /b type lustre (ro,svname=ubuntu-MDT0000,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro)

Same No space left using /dev/vdb if you add "project" before mounting.

Model: Virtio Block Device (virtblk)
Disk /dev/vdb: 21.0GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number Start End Size File system Flags
1 0.00B 21.0GB 21.0GB ext4

Comment by James A Simmons [ 17/Aug/21 ]

I wonder if this is a ldiskfs issue. Have you tried ZFS ?

Comment by David Bestor [ 18/Aug/21 ]

I have given up for now. Ill skip trying zfs for now
since the default tests like llmount.sh doesn't use it without
some configuration changes. For my testing 2.14.0 works
fine for now. Ill revisit later when I no longer have a
functional Lustre with Ubuntu 20.04.

2.14.0 doesn't add the project on first mount so no issue on that branch currently.

I was able to go back to 2.13.55 but I still had issues
with adding "project" before first mount on a combined
mds/mgs device. (Loop or a virtual device)

2.13.56 needed a fix or mkfs.lustre would give error #22
git cherry-pick 9cd651aead327ae4589b58dde5818b068c89b3e5

2.13.55 . needed a few more fixes
git cherry-pick 9cd651aead327ae4589b58dde5818b068c89b3e5
git cherry-pick 1b4054790b88fa046af6dd488d4d4e643154c22d
git cherry-pick 03e6db505be90d35ccacb3af7e15277784e5d448
git cherry-pick af2f77633bf7b12d6ca1ab606ff90cf1ee58107a

But both gave same error "No space left on device"

On the the other side :
With anything Past 2.14.50 : I need to revert the following or
a combined mgs/mds gets out of space errors :
LU-14388 utils: always enable ldiskfs project quota
commit : 79642e08969eb4455bd8e23574b76f0a84d4db23

With anything Past 2.14.51 : I also need to revert the
following or mount fails with errors and osp module hangs
up on first mount and requires a reboot to clear:
LU-14430 mdt: fix maximum ACL handling
commit : aa92caa21fa2a4473dce5889de7fcd17e171c1a0

Im not sure if the two are related but until the first gets fixed I wont
know. Im sure at some point Ill be unable to revert both of these but
hopefully at that point whatever is wrong/different with ubuntu 20.04
kernels gets fixed.

For release 2.14.0 : waiting on next ubuntu point release to submit a bug report (i think both are in master already)
5.4.0-73 needed to patch to be able to compile
Patched patches/ubuntu20/ext4-pdirop.patch
5.4.0-81 needed more patching
Patched patches/rhel8/ext4-simple-blockalloc.patch
Patched patches/ubuntu20/ext4-pdirop.patch

Like I said 2.14.0 (or master with the two reverts) is fine
for now. Its a simple virt-install with 8 gig ram and 4 virtual
cpus on 20.04.1 install with everything updated
(5.4.0-81-generic #91-Ubuntu). Nothing exotic so I'm not sure
where the problem is and how to fix it.

Comment by James A Simmons [ 13/Dec/21 ]

I think Andreas found the problem. I'm working on a patch. Good news is I can duplicate the problem.

Comment by Gerrit Updater [ 02/Jan/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/45960
Subject: LU-14596 ldiskfs: Fix mounting issues for newer kernels
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3e991ef937d95899285354108853219b4c4789a

Comment by James A Simmons [ 03/Jan/22 ]

So I tracked down the reason for this failure. The reason is that by default for Ubuntu20 (5.4 and 5.8 kernels) the kernel's quota code is built as modules which are stored in the package linux-modules-extra-$(uname). That package is not installed by default. I did update the debian packaging for lustre to pull in linux-generic package which should install the correct modules. Over the weekend I have been testing with a 5.8 kernel for Ubuntu and it works like a charm with ldiskfs. Now if you just grab and build lustre for testing on Ubuntu you will need to install the linux-modules-extra-$(uname) package yourself. The patch also addresses another issue Andreas caught as well but thankfully didn't break Ubuntu server support.

Comment by Gerrit Updater [ 18/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45960/
Subject: LU-14596 ldiskfs: Fix mounting issues for newer kernels
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 32c4b80192652f55bcef5786e4ec683e85234c04

Generated at Sat Feb 10 03:11:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.