[LU-7926] MDS sits idle with extreme slow response to clients - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
None

Severity:
4
Rank (Obsolete):
9223372036854775807

Description

We had our MDS become idle no with no errors on the console or logs. It responds extremely slow (takes mins to just do an ls). This has happened twice with 24 hours. We took crash dump in both cases. The crash dump show all mdt threads sitting in ' qsd_op_begin'
like this

PID: 14754  TASK: ffff88400a83c040  CPU: 7   COMMAND: "mdt01_002"
 #0 [ffff883f634cd500] schedule at ffffffff81565692
 #1 [ffff883f634cd5d8] schedule_timeout at ffffffff81566572
 #2 [ffff883f634cd688] qsd_op_begin at ffffffffa0d04909 [lquota]
 #3 [ffff883f634cd738] osd_declare_qid at ffffffffa0d88449 [osd_ldiskfs]
 #4 [ffff883f634cd798] osd_declare_inode_qid at ffffffffa0d88702 [osd_ldiskfs]
 #5 [ffff883f634cd7f8] osd_declare_object_create at ffffffffa0d65d53 [osd_ldiskfs]
 #6 [ffff883f634cd858] lod_declare_object_create at ffffffffa0f4b482 [lod]
 #7 [ffff883f634cd8b8] mdd_declare_object_create_internal at ffffffffa0fa78cf [mdd]
 #8 [ffff883f634cd918] mdd_declare_create at ffffffffa0fbb4ce [mdd]
 #9 [ffff883f634cd988] mdd_create at ffffffffa0fbc631 [mdd]
#10 [ffff883f634cda88] mdo_create at ffffffffa0e88058 [mdt]
#11 [ffff883f634cda98] mdt_reint_open at ffffffffa0e923f4 [mdt]
#12 [ffff883f634cdb78] mdt_reint_rec at ffffffffa0e7a481 [mdt]
#13 [ffff883f634cdb98] mdt_reint_internal at ffffffffa0e5fed3 [mdt]
#14 [ffff883f634cdbd8] mdt_intent_reint at ffffffffa0e6045e [mdt]
#15 [ffff883f634cdc28] mdt_intent_policy at ffffffffa0e5dc3e [mdt]
#16 [ffff883f634cdc68] ldlm_lock_enqueue at ffffffffa075e2c5 [ptlrpc]
#17 [ffff883f634cdcd8] ldlm_handle_enqueue0 at ffffffffa0787ebb [ptlrpc]
#18 [ffff883f634cdd48] mdt_enqueue at ffffffffa0e5e106 [mdt]
#19 [ffff883f634cdd68] mdt_handle_common at ffffffffa0e62ada [mdt]
#20 [ffff883f634cddb8] mds_regular_handle at ffffffffa0e9f505 [mdt]
#21 [ffff883f634cddc8] ptlrpc_server_handle_request at ffffffffa07b70c5 [ptlrpc]
#22 [ffff883f634cdea8] ptlrpc_main at ffffffffa07b989d [ptlrpc]
#23 [ffff883f634cdf48] kernel_thread at ffffffff8100c28a

I am attaching backtrace from the 2 crash dumps and also to lustre debug dumps.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt.all.Mar26.gz
109 kB
27/Mar/16 6:32 PM
bt.all.Mar27.05.25.28.gz
104 kB
27/Mar/16 6:32 PM
debug.out.1.gz
0.2 kB
27/Mar/16 6:32 PM
debug.out.2.gz
0.2 kB
27/Mar/16 6:32 PM

Issue Links

is related to

LU-6433 MDS deadlock in qouta

Resolved

Activity

[LU-7926] MDS sits idle with extreme slow response to clients

Peter Jones added a comment - 30/Apr/16 12:55 PM

So is it ok to close this ticket as a dupliate of ~~LU-6433~~?

Peter Jones added a comment - 30/Apr/16 12:55 PM So is it ok to close this ticket as a dupliate of LU-6433 ?

Jay Lan (Inactive) added a comment - 19/Apr/16 7:35 PM

I cherry-picked the patch to b2_7_fe.

Jay Lan (Inactive) added a comment - 19/Apr/16 7:35 PM I cherry-picked the patch to b2_7_fe.

Jay Lan (Inactive) added a comment - 08/Apr/16 7:40 PM

I cherry-picked the b2_5_fe port of http://review.whamcloud.com/#/c/19250/ that Nui Yawei commented on 30/Mar/16 8:21 PM:
~~LU-6433~~ quota: handle QUOTA_DQACQ in READPAGE po
rtal

Not in production yet, but in our 2.5.3 code now.

Jay Lan (Inactive) added a comment - 08/Apr/16 7:40 PM I cherry-picked the b2_5_fe port of http://review.whamcloud.com/#/c/19250/ that Nui Yawei commented on 30/Mar/16 8:21 PM: LU-6433 quota: handle QUOTA_DQACQ in READPAGE po rtal Not in production yet, but in our 2.5.3 code now.

Niu Yawei (Inactive) added a comment - 06/Apr/16 2:03 PM

well, exactly. the fact that we send an RPC to get another quota unit can be the point at which we interrupt and return -EINPROGRESS to the client?

That sounds doable, it could eliminate all the sync acquirings.

I guess a potential problem is to avoid a livelock where a single unit is ping-pong'ed among 2+ servers because nobody consumes that right away..

I don't know exactly what your are referring, could you illustrate?

Niu Yawei (Inactive) added a comment - 06/Apr/16 2:03 PM well, exactly. the fact that we send an RPC to get another quota unit can be the point at which we interrupt and return -EINPROGRESS to the client? That sounds doable, it could eliminate all the sync acquirings. I guess a potential problem is to avoid a livelock where a single unit is ping-pong'ed among 2+ servers because nobody consumes that right away.. I don't know exactly what your are referring, could you illustrate?

Alex Zhuravlev added a comment - 06/Apr/16 6:30 AM - edited

well, exactly. the fact that we send an RPC to get another quota unit can be the point at which we interrupt and return -EINPROGRESS to the client? I guess a potential problem is to avoid a livelock where a single unit is ping-pong'ed among 2+ servers because nobody consumes that right away..

Alex Zhuravlev added a comment - 06/Apr/16 6:30 AM - edited well, exactly. the fact that we send an RPC to get another quota unit can be the point at which we interrupt and return -EINPROGRESS to the client? I guess a potential problem is to avoid a livelock where a single unit is ping-pong'ed among 2+ servers because nobody consumes that right away..

Niu Yawei (Inactive) added a comment - 06/Apr/16 2:59 AM

Alex, the problem is that the thread sending DQACQ has to wait for the RPC timeout, once it get timeout, it'll reply -EINPROGRESS to client. So literally it's not a deadlock but a livelock problem, our solution is to use different set of threads for sending and handling the DQACQ requests.

Niu Yawei (Inactive) added a comment - 06/Apr/16 2:59 AM Alex, the problem is that the thread sending DQACQ has to wait for the RPC timeout, once it get timeout, it'll reply -EINPROGRESS to client. So literally it's not a deadlock but a livelock problem, our solution is to use different set of threads for sending and handling the DQACQ requests.

Alex Zhuravlev added a comment - 05/Apr/16 6:37 PM

I'd think that MDT shouldn't get blocked in this case at all - just return to the client with -EINPROGRESS or something?

Alex Zhuravlev added a comment - 05/Apr/16 6:37 PM I'd think that MDT shouldn't get blocked in this case at all - just return to the client with -EINPROGRESS or something?

Niu Yawei (Inactive) added a comment - 31/Mar/16 3:21 AM - edited

port to b2_5_fe: http://review.whamcloud.com/#/c/19250/

Niu Yawei (Inactive) added a comment - 31/Mar/16 3:21 AM - edited port to b2_5_fe: http://review.whamcloud.com/#/c/19250/

Niu Yawei (Inactive) added a comment - 30/Mar/16 2:58 AM

Ah, I think it's a livelock problem, from the stack trace we can see that all mdt service threads were busy on sending DQACQ requests and no available threads to handle the requests. This has been fixed in master by ~~LU-6433~~.

Niu Yawei (Inactive) added a comment - 30/Mar/16 2:58 AM Ah, I think it's a livelock problem, from the stack trace we can see that all mdt service threads were busy on sending DQACQ requests and no available threads to handle the requests. This has been fixed in master by LU-6433 .

Mahmoud Hanafi added a comment - 29/Mar/16 5:32 PM

Going through the logs there is nothing to indicate that there was a lost of connections between OST, MDT or clients. There was a high load on a number of OSSes at 9:27:00. Are there any quota allocation adjusted once your over your inode softlimit?

Mahmoud Hanafi added a comment - 29/Mar/16 5:32 PM Going through the logs there is nothing to indicate that there was a lost of connections between OST, MDT or clients. There was a high load on a number of OSSes at 9:27:00. Are there any quota allocation adjusted once your over your inode softlimit?

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 27/Mar/16 6:32 PM

Updated:: 25/Oct/16 2:10 PM

Resolved:: 12/May/16 12:55 PM

Lustre

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates