[LU-14596] Ubuntu combined MGT/MDT issue Created: 08/Apr/21 Updated: 24/Jan/22 Resolved: 18/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ubuntu | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Testing on ubuntu 20.04 VM with a combined MGT/MDT, gives a "No space left on device" error when mounting ashehata@lustre03:mgs$ mountmgs mount.lustre: mount /dev/sdb at /mnt/mgs failed: No space left on device Apr 8 21:24:00 lustre03 kernel: [ 839.547863] LustreError: 1833:0:(fld_index.c:372:fld_index_init()) srv-lustrewt-MDT0000: Can't find "fld" obj -28 Apr 8 21:24:00 lustre03 kernel: [ 839.548577] LustreError: 1833:0:(obd_config.c:775:class_setup()) setup lustrewt-MDT0000 failed (-28) Apr 8 21:24:00 lustre03 kernel: [ 839.549055] LustreError: 1833:0:(obd_config.c:2037:class_config_llog_handler()) MGC192.168.122.67@tcp: cfg command failed: rc = -28 Apr 8 21:24:00 lustre03 kernel: [ 839.550340] Lustre: cmd=cf003 0:lustrewt-MDT0000 1:lustrewt-MDT0000_UUID 2:0 3:lustrewt-MDT0000-mdtlov 4:f Apr 8 21:24:00 lustre03 kernel: [ 839.550340] Apr 8 21:24:00 lustre03 kernel: [ 839.550354] LustreError: 15c-8: MGC192.168.122.67@tcp: Confguration from log lustrewt-MDT0000 failed from MGS -28. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info Apr 8 21:24:00 lustre03 kernel: [ 839.551338] LustreError: 1791:0:(obd_mount_server.c:1423:server_start_targets()) failed to start server lustrewt-MDT0000: -28 Apr 8 21:24:00 lustre03 kernel: [ 839.551849] LustreError: 1791:0:(obd_mount_server.c:2058:server_fill_super()) Unable to start targets: -28 Apr 8 21:24:00 lustre03 kernel: [ 839.552493] LustreError: 1791:0:(obd_config.c:828:class_cleanup()) Device 5 not setup Apr 8 21:24:06 lustre03 kernel: [ 845.593726] Lustre: 1791:0:(client.c:2312:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1617917040/real 1617917040] req@00000000984dbaa9 x1696508945630016/t0(0) o251->MGC192.168.122.67@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1617917046 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' Apr 8 21:24:06 lustre03 kernel: [ 845.607482] Lustre: server umount lustrewt-MDT0000 complete Apr 8 21:24:06 lustre03 kernel: [ 845.607487] LustreError: 1791:0:(obd_mount.c:1760:lustre_fill_super()) Unable to mount (-28) Having a separate MGS seems to mount properly |
| Comments |
| Comment by David Bestor [ 16/Aug/21 ] |
|
ubuntu 20.04 ... I see the same issue with using "llmount.sh" from /usr/lib/lustre/tests Setup mgs, mdt, osts Its related to this commit : I get same error "No space" with any checkout of 2.14.51 to 2.14.53 . If I revert it then rerun make and then copy just the new mount_osd_ldiskfs.so to here: tune2fs 1.46.2.wc3 (18-Jun-2021) |
| Comment by James A Simmons [ 16/Aug/21 ] |
|
What is the MDSSIZE / MGSSIZE in your test/cfg/***.sh file? |
| Comment by David Bestor [ 16/Aug/21 ] |
|
I haven't changed anything ....So the default in local.sh ? MDSSIZE=${MDSSIZE:-250000} |
| Comment by James A Simmons [ 16/Aug/21 ] |
|
If you add a zero does it work ? Just want to see if its a general problem or a configuration issue. |
| Comment by David Bestor [ 16/Aug/21 ] |
|
same error.. just for gigles mounted as ldiskfs after failure... Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 |
| Comment by David Bestor [ 16/Aug/21 ] |
|
not that it matters. but if i remove project after the llmount.sh fails. it then mounts. root@builder20041:/usr/lib/lustre/tests# mount -t lustre /dev/mapper/mds1_flakey /mnt/lustre |
| Comment by David Bestor [ 16/Aug/21 ] |
|
tried an older kernel. Linux builder20041 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 dmesg: [ 6040.161760] LustreError: 15c-8: MGC172.21.69.136@tcp: Confguration from log lustre-MDT0000 failed from MGS -28. Communication error |
| Comment by David Bestor [ 16/Aug/21 ] |
|
another point of reference... same similar errors if i add "project" manually on 2.14.0 before mount attempt Linux builder20041 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux[ Lustre: Lustre: Build Version: 2.14.0_dirty dd if=/dev/zero of=/tmp/mdt.img bs=100M count=10 Same No space left using /dev/vdb if you add "project" before mounting. Model: Virtio Block Device (virtblk) Number Start End Size File system Flags |
| Comment by James A Simmons [ 17/Aug/21 ] |
|
I wonder if this is a ldiskfs issue. Have you tried ZFS ? |
| Comment by David Bestor [ 18/Aug/21 ] |
|
I have given up for now. Ill skip trying zfs for now 2.14.0 doesn't add the project on first mount so no issue on that branch currently. I was able to go back to 2.13.55 but I still had issues 2.13.56 needed a fix or mkfs.lustre would give error #22 2.13.55 . needed a few more fixes But both gave same error "No space left on device" On the the other side : With anything Past 2.14.51 : I also need to revert the Im not sure if the two are related but until the first gets fixed I wont For release 2.14.0 : waiting on next ubuntu point release to submit a bug report (i think both are in master already) Like I said 2.14.0 (or master with the two reverts) is fine |
| Comment by James A Simmons [ 13/Dec/21 ] |
|
I think Andreas found the problem. I'm working on a patch. Good news is I can duplicate the problem. |
| Comment by Gerrit Updater [ 02/Jan/22 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/45960 |
| Comment by James A Simmons [ 03/Jan/22 ] |
|
So I tracked down the reason for this failure. The reason is that by default for Ubuntu20 (5.4 and 5.8 kernels) the kernel's quota code is built as modules which are stored in the package linux-modules-extra-$(uname). That package is not installed by default. I did update the debian packaging for lustre to pull in linux-generic package which should install the correct modules. Over the weekend I have been testing with a 5.8 kernel for Ubuntu and it works like a charm with ldiskfs. Now if you just grab and build lustre for testing on Ubuntu you will need to install the linux-modules-extra-$(uname) package yourself. The patch also addresses another issue Andreas caught as well but thankfully didn't break Ubuntu server support. |
| Comment by Gerrit Updater [ 18/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45960/ |