[LU-10238] adding new OSTs causes quota reporting error Created: 13/Nov/17  Updated: 02/Feb/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Liam Forbes Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7 servers
kernel-3.10.0-514.21.1.el7_lustre.x86_64
lustre-2.10.0-1.el7.x86_64
lustre-dkms-2.10.0-1.el7.noarch
lustre-osd-zfs-mount-2.10.0-1.el7.x86_64
lustre-resource-agents-2.10.0-1.el7.x86_64
CentOS 6 clients
lustre-client-2.10.0-1.el6.x86_64
lustre-client-dkms-2.10.0-1.el6.noarch
ZFS for OSTs & MDT
libzfs2-0.7.3-1.el7_3.x86_64
libzfs2-devel-0.7.3-1.el7_3.x86_64
zfs-0.7.3-1.el7_3.x86_64
zfs-dkms-0.7.3-1.el7_3.noarch
zfs-release-1-4.el7_3.centos.noarch
DKMS kernel modules


Attachments: File mdsLogs.tar.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have a Lustre 2.10.0 filesystem which was built with two OSSes containing 5 OSTs each. Last week I added a third OSS (same exact hardware, slightly newer OS software except the kernel & lustre). When I created the OSTs with mkfs.lustre, the filesystem seemed to grow correctly. We currently only set and enforce group quotas.

Later that day, we noticed the output of `lfs quota -g $GROUP /center1` was showing bad values and an error message. Here's an example.

chinook02:PENGUIN$ sudo lfs quota -g penguin /center1
Disk quotas for grp penguin (gid 12738):
Filesystem kbytes quota limit grace files quota limit grace
/center1 [214] 1073741824 1181116006 - 13 0 0 -
Some errors happened when getting quota info. Some devices may be not working or deactivated. The data in "[]" is inaccurate.

We found a workaround. As soon as the group has data written to the new OSTs, `lfs quota` seems to work fine.

chinook02:PENGUIN$ lfs setstripe -i -1 -c -1 loforbes
chinook02:PENGUIN$ dd of=loforbes/testfile if=/dev/urandom bs=1M count=15
15+0 records in
15+0 records out
15728640 bytes (16 MB) copied, 1.80694 s, 8.7 MB/s
chinook02:PENGUIN$ sudo lfs quota -g penguin /center1
Disk quotas for grp penguin (gid 12738):
Filesystem kbytes quota limit grace files quota limit grace
/center1 671997883 1073741824 1181116006 - 13 0 0 -
chinook02:PENGUIN$ lfs getstripe loforbes/testfile
loforbes/testfile
lmm_stripe_count: 15
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 12
obdidx objid objid group
12 31981 0x7ced 0
7 62233208 0x3b59a78 0
14 32068 0x7d44 0
8 72183233 0x44d6dc1 0
10 31854 0x7c6e 0
11 31849 0x7c69 0
2 68917015 0x41b9717 0
5 71171215 0x43dfc8f 0
1 69395583 0x422e47f 0
13 32088 0x7d58 0
9 68211489 0x410d321 0
6 70389457 0x4320ed1 0
4 70225352 0x42f8dc8 0
3 66783438 0x3fb08ce 0
0 65674625 0x3ea1d81 0

We figured out it's not really necessary to have data on the 10 original OSTs, just the 5 new ones for this to work. I've implemented this workaround for all projects using our lustre filesystem.

Before implementing the workaround, we tried "deleting" a group's quota and recreating. That didn't seem to impact the issue. We also tried unmounting and remounting the filesystem on a client. Again, no change. Removing all files owned by a group that have data on the new OSTs results in `lfs quota` showing the error again.

We are considering a Lustre 2.10.1 update sometime soon.

Regards,
-liam



 Comments   
Comment by James Nunez (Inactive) [ 20/Dec/17 ]

Hongchao -

Would you please look into this issue?

Thank you

Comment by Hongchao Zhang [ 29/Dec/17 ]

Hi Liam,

I can't reproduce the issue in my local VMs, could you please attach the logs (syslog and debug log) when the issue occurred?
Thanks!

btw, please add quota to the debug log by "lctl set_param debug=+quota".

Comment by Liam Forbes [ 22/Jan/18 ]

Hongchao,

I'm attaching the syslog file from the two days when we added the new OSS (oss09) to the filesystem. Unfortunately, I can't say exactly what time that occurred. Also unfortunately, I don't seem to have the syslogs from that OSS on that day either.

Here are the system logs that occur when we get the error message in the `lfs quota` output.

From a client:
Jan 22 13:51:13 chinook02 kernel: LustreError: 30907:0:(osc_quota.c:291:osc_quotactl()) ptlrpc_queue_wait failed, rc: -2
Jan 22 13:51:13 chinook02 kernel: LustreError: 30907:0:(osc_quota.c:291:osc_quotactl()) Skipped 4 previous similar messages

No messages occur on the MDS or OSS. Could this be an LNET issue?

Regards,
-liam mdsLogs.tar.gz

Comment by Hongchao Zhang [ 02/Feb/18 ]

Hi Liam,

The issue is related to the OSS, could you please get the quota usage of some non-existing group (say, 20000) on your site
to check whether this issue can be triggered or not, and if so, please collect the logs (debug log by running 'lctl dk >log_fiile')
on the OSS? Thanks!

Generated at Sat Feb 10 02:33:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.