[LU-6629] sanity-benchmark test_bonnie: DQACQ failed with -22 Created: 21/May/15  Updated: 30/Jan/17  Resolved: 18/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: zfs
Environment:

lustre-master build #3029


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/bcb260f2-fe71-11e4-a865-5254006e85c2.

The sub-test test_bonnie failed with the following error:

test failed to respond and timed out

this may be a dup of LU-4875

04:45:24:Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ == 04:14:01 (1431922441)
04:45:24:Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1969152kB available, using 3844624kB file size
04:45:24:Lustre: DEBUG MARKER: min OST has 1969152kB available, using 3844624kB file size
04:45:24:LNet: Service thread pid 3517 completed after 77.68s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
04:45:24:LNet: Skipped 8 previous similar messages
04:45:24:LNet: Service thread pid 3556 completed after 84.32s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
04:45:24:LustreError: 7365:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0004 qtype:grp id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
04:45:24:LustreError: 7365:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 12 previous similar messages
04:45:24:LustreError: 7364:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0004 qtype:grp id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
04:45:24:LustreError: 7364:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 11 previous similar messages
04:45:24:LustreError: 7364:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0004 qtype:grp id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
04:45:24:LustreError: 7364:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 11 previous similar messages
04:45:24:LNet: Service thread pid 29001 completed after 45.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
04:45:24:LNet: Skipped 11 previous similar messages
05:14:26:********** Timeout by autotest system **********


 Comments   
Comment by Andreas Dilger [ 22/May/15 ]

I also see in the MDS logs:

04:15:58:LustreError: 14209:0:(qmt_handler.c:420:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0004_UUID release:1048576 granted:0, total:4194304 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:8533324 soft:8126976 granted:4194304 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
Comment by Niu Yawei (Inactive) [ 27/May/15 ]

Looks two slaves (OST4 & OST5) are not synced with master, I can't see from log how this happened, but I think this should not be the cause of "too many service threads, or there were not enough hardware resources".

Comment by Niu Yawei (Inactive) [ 16/Jun/15 ]

There was a defect could leads to quota slave reconnect without invalidate global locks, that could result in the quota slave & master not synced at the end. I think this has been fixed by 4f53536d002c13886210b672b657795baa067144

Comment by Niu Yawei (Inactive) [ 23/Jul/15 ]

If the error message of "04:15:58:LustreError: 14209:0:(qmt_handler.c:420:qmt_dqacq0()) $$$ Release too much! " were not seen on master anymore, I think we can close this ticket.

This should have been fixed by following changes in the commit of 4f53536d002c13886210b672b657795baa067144 :

+       /* Note: lw_client is needed in MDS-MDS failover during update log
+        * processing, so we needs to allow lw_client to be connected at
+        * anytime, instead of only the initial connection */
+       lw_client = (data->ocd_connect_flags & OBD_CONNECT_LIGHTWEIGHT) != 0;
+
        if (lustre_msg_get_op_flags(req->rq_reqmsg) & MSG_CONNECT_INITIAL) {
                mds_conn = (data->ocd_connect_flags & OBD_CONNECT_MDS) != 0;
-               lw_client = (data->ocd_connect_flags &
-                            OBD_CONNECT_LIGHTWEIGHT) != 0;
Comment by Niu Yawei (Inactive) [ 18/Nov/16 ]

Patch landed.

Generated at Sat Feb 10 02:01:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.