Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 1.8.6
-
None
-
x86_64, CentOS5, 2.6.18-194.17.1.el5_lustre.1.8.5, OFED 1.5.2, 4 OSS nodes, 4 8TB OSTs/OSS, 700 clients (some o2ib, some tcp)
-
3
-
10141
Description
We've deployed a new filesystem recently and enabled quotas. We've gotten over 1200 of these messages since we've been in production the last couple weeks:
Lustre: 16290:0:(quota_interface.c:460:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
Some days we get none, or very few, and some days we might get 50-100. The MDS has very little load on it. We're not aware of an operational problem associated with the above messages - no one has complained to us about I/O or quota problems. But we'd like to solve whatever issue is causing these messages.
One strange thing is that when we get one of the above messages, it is always on the 10th retry, and err is always zero and rc is always zero in that case - it seems funny to me that the 10th call to acquire() is always successful even if it failed 9 times in a row prior to this.
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
Our quota_chk_acq_common() messages seem to have stopped for the time being - none since Saturday.
Before they stopped, I did some tracing with the lustre debug_daemon, as you suggested. Since we can't trigger the problem, I'd have to wait for the quota_chk_acq_common() messages to happen "naturally". I turned on the debug_daemon to start tracing, and when I saw the quota_chk_acq_common() message in syslog, I stopped the debug_daemon and looked at the decoded debug file. A 10GB debug file got anywhere from 15 seconds to a couple minutes of debug file output.
Is there a timing issue between when something is logged in the syslog and when the corresponding debug info gets logged by the Lustre debug daemon? Even though I'd sit and watch the syslog, I never saw a trace message that corresponded to the syslog "10 retries" messages. But I did see traces that had several retries.
I could not really learn anything new by looking at the debug debug_daemon output, though. Is there something you can suggest that I should look for?