[LU-12212] Often requests timeouts during dbench run Created: 21/Apr/19  Updated: 01/Feb/21  Resolved: 21/May/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Critical
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: LTS12

Issue Links:
Related
is related to LU-9193 Multiple hangs observed with many op... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Ordinary dbench run start showing a lot of messages like this:

Apr 21 03:11:05 nodez kernel: Lustre: 4236:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555830658/real 1555830658] req@ffff8800a273cc00 x1631406558218272/t0(0) o101->lustre-MDT0000-mdc-ffff8800af2fb800@0@lo:12/10 lens 616/4752 e 0 to 1 dl 1555830665 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 21 03:11:05 nodez kernel: Lustre: lustre-MDT0000-mdc-ffff8800af2fb800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Apr 21 03:11:05 nodez kernel: Lustre: lustre-MDT0000: Client 2f22eecb-5055-b4d6-16b5-a958700fcbda (at 0@lo) reconnecting
Apr 21 03:11:05 nodez kernel: Lustre: lustre-MDT0000: Connection restored to 4e8c3a45-3b8a-cd6a-047f-22363e8171e6 (at 0@lo)
Apr 21 03:11:05 nodez kernel: Lustre: Skipped 3 previous similar messages
Apr 21 03:11:48 nodez kernel: Lustre: 4237:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555830701/real 1555830701] req@ffff8800862f3440 x1631406567790080/t0(0) o101->lustre-MDT0000-mdc-ffff8800af2fb800@0@lo:12/10 lens 616/4752 e 0 to 1 dl 1555830708 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 21 03:11:48 nodez kernel: Lustre: lustre-MDT0000-mdc-ffff8800af2fb800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Apr 21 03:11:48 nodez kernel: Lustre: lustre-MDT0000: Client 2f22eecb-5055-b4d6-16b5-a958700fcbda (at 0@lo) reconnecting
Apr 21 03:12:12 nodez kernel: Lustre: 4238:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555830725/real 1555830725] req@ffff8800861750c0 x1631406572864272/t0(0) o101->lustre-MDT0000-mdc-ffff8800af2fb800@0@lo:12/10 lens 616/4752 e 0 to 1 dl 1555830732 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 21 03:12:12 nodez kernel: Lustre: lustre-MDT0000-mdc-ffff8800af2fb800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Apr 21 03:12:12 nodez kernel: Lustre: lustre-MDT0000: Client 2f22eecb-5055-b4d6-16b5-a958700fcbda (at 0@lo) reconnecting
Apr 21 03:12:12 nodez kernel: Lustre: lustre-MDT0000: Connection restored to 4e8c3a45-3b8a-cd6a-047f-22363e8171e6 (at 0@lo)
Apr 21 03:12:12 nodez kernel: Lustre: Skipped 3 previous similar messages

This started after LU-9193 patch landing (found by git bisect). There is nothing special with test setup, no SELinux, just local run with dbench -D /mnt/lustre/testdir 4



 Comments   
Comment by Gerrit Updater [ 22/Apr/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34734
Subject: LU-12212 mdt: fix SECCTX reply buffer handling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 63a53d131de12c61657d4299530a5eafcb7dab8e

Comment by Gerrit Updater [ 21/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34734/
Subject: LU-12212 mdt: fix SECCTX reply buffer handling
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cb61ed93f8563c26b6a6db396478fe54f8dc42cb

Comment by Peter Jones [ 21/May/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 21/May/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34910
Subject: LU-12212 mdt: fix SECCTX reply buffer handling
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 68f7f348553c5d55a14165a6227a9a40756dfb50

Comment by Sebastien Piechurski [ 07/Oct/19 ]

Is there a reason not to re-apply LU-9193 on b2_12 with this patch on top of it ?

The LU-9193 is a very long-standing issue for several of our customers, and we would rather not wait for the next LTS to have it solved.

Generated at Sat Feb 10 02:50:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.