[LU-13692] MDS slow/hung threads at mdt_object_local_lock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6
Affects Version/s: Lustre 2.12.4
Labels:
None
Environment:
clients running l2.12.3

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

MDS threads hang/slow getting stack trace dumps

Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib)
Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981]   req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace:
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118
Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1592343544.8525
4.75 MB
18/Jun/20 5:11 PM
messages
51 kB
18/Jun/20 5:10 PM
vmcore-dmesg.txt
1021 kB
18/Jun/20 5:10 PM

Issue Links

is related to

LU-13500 Client gets evicted - nfsd non-standard errorno -108

Open

LU-14069 OBD_FAIL_LDLM_CANCEL_BL_CB_RACE is buggy in ldlm_handle_cp_callback

Resolved

Activity

[LU-13692] MDS slow/hung threads at mdt_object_local_lock

Oleg Drokin added a comment - 28/Aug/20 8:39 PM

I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped.

I checked the logs and I do see this:

00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits

but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?)

He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now)

"lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.

Oleg Drokin added a comment - 28/Aug/20 8:39 PM I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped. I checked the logs and I do see this: 00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?) He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now) "lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.

Mahmoud Hanafi added a comment - 28/Aug/20 7:00 PM

In the last set of logs I had "nettrace" and "neterror". We didn't see any network errors. I can run with +net, if you would like.
Would make sense to see if you can reproduce this locally?

Mahmoud Hanafi added a comment - 28/Aug/20 7:00 PM In the last set of logs I had "nettrace" and "neterror". We didn't see any network errors. I can run with +net, if you would like. Would make sense to see if you can reproduce this locally?

Andreas Dilger added a comment - 28/Aug/20 5:24 AM

Oleg, in addition to the BL AST "granting" the lock with LDLM_FL_CBPENDING, could it also wake up the thread waiting for the lock reply to allow it to finish processing? There would need to be some way to trigger this so that the client thinks the RPC was received successfully.

Probably a better idea is if the RPC wait in that thread could have a periodic timeout check if the lock is already granted, and exit the wait itself? Then the BL AST can mark the lock granted with CBPENDING, and when the waiting thread next wakes up it will see the lock is granted and clean itself up and continue processing the syscall.

Andreas Dilger added a comment - 28/Aug/20 5:24 AM Oleg, in addition to the BL AST "granting" the lock with LDLM_FL_CBPENDING , could it also wake up the thread waiting for the lock reply to allow it to finish processing? There would need to be some way to trigger this so that the client thinks the RPC was received successfully. Probably a better idea is if the RPC wait in that thread could have a periodic timeout check if the lock is already granted, and exit the wait itself? Then the BL AST can mark the lock granted with CBPENDING , and when the waiting thread next wakes up it will see the lock is granted and clean itself up and continue processing the syscall.

Oleg Drokin added a comment - 28/Aug/20 3:58 AM

I don't see anything different in this log vs the previous one

00010000:00010000:4.0:1598299915.176480:0:14307:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff9c95908f1200/0xcb77a9c20dcbc88e lrc: 4/0,0 mode: CR/CR res: [0x20000d6f1:0x1:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.214@o2ib417 remote: 0xdd261a7834f00346 expref: 98 pid: 14307 timeout: 6658 lvb_type: 0
00010000:00010000:4.0:1598299915.176495:0:14307:0:(ldlm_lockd.c:1479:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff9c95908f1200, rc 0)
00000100:00100000:4.0:1598299915.176508:0:14307:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:221c8e76-3a4a-d7bb-7d3c-6ea3526db183+98:4765:x1675936175078016:12345-10.141.6.214@o2ib417:101 Request processed in 8860us (8880us total) trans 107374186758 rc 0/0
00000100:00100000:4.0:1598299915.176512:0:14307:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.214@o2ib417, seq: 2640
00000100:00100000:4.0:1598299915.176518:0:14307:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.3.114@o2ib417, seq: 2761
00000100:00100000:4.0:1598299915.176521:0:14307:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:44cfaec0-6c8c-6d1b-0b71-5fc557b7218e+97:4797:x1675936177175232:12345-10.141.3.114@o2ib417:101

request handled, reply is supposedly sent (no messages from lnet indicating otherwise at least) yet hte client did not receive it.

Oleg Drokin added a comment - 28/Aug/20 3:58 AM I don't see anything different in this log vs the previous one 00010000:00010000:4.0:1598299915.176480:0:14307:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff9c95908f1200/0xcb77a9c20dcbc88e lrc: 4/0,0 mode: CR/CR res: [0x20000d6f1:0x1:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.214@o2ib417 remote: 0xdd261a7834f00346 expref: 98 pid: 14307 timeout: 6658 lvb_type: 0 00010000:00010000:4.0:1598299915.176495:0:14307:0:(ldlm_lockd.c:1479:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff9c95908f1200, rc 0) 00000100:00100000:4.0:1598299915.176508:0:14307:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:221c8e76-3a4a-d7bb-7d3c-6ea3526db183+98:4765:x1675936175078016:12345-10.141.6.214@o2ib417:101 Request processed in 8860us (8880us total) trans 107374186758 rc 0/0 00000100:00100000:4.0:1598299915.176512:0:14307:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.214@o2ib417, seq: 2640 00000100:00100000:4.0:1598299915.176518:0:14307:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.3.114@o2ib417, seq: 2761 00000100:00100000:4.0:1598299915.176521:0:14307:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:44cfaec0-6c8c-6d1b-0b71-5fc557b7218e+97:4797:x1675936177175232:12345-10.141.3.114@o2ib417:101 request handled, reply is supposedly sent (no messages from lnet indicating otherwise at least) yet hte client did not receive it.

Mahmoud Hanafi added a comment - 27/Aug/20 4:09 PM

Any updates?

Mahmoud Hanafi added a comment - 27/Aug/20 4:09 PM Any updates?

Mahmoud Hanafi added a comment - 24/Aug/20 8:58 PM - edited

Uploaded new debug logs with nettrace
ftp:/uploads/~~LU-13692~~/~~LU-13692~~_patch2_testing_2020_08_24.tgz

Server side threads remain stuck until server reboot.

Mahmoud Hanafi added a comment - 24/Aug/20 8:58 PM - edited Uploaded new debug logs with nettrace ftp:/uploads/ LU-13692 / LU-13692 _patch2_testing_2020_08_24.tgz Server side threads remain stuck until server reboot.

Oleg Drokin added a comment - 22/Aug/20 9:03 PM

Andreas, while we can grant the lock on receipt of the bl callback, I am not sure that alone would unstuck the enqueue thread that's waiting for the RPC to return first. If we force-resume it then I imagine all sorts of strange behavior will crop up in normal races.

Also just granting the lock will not make it go away because the stuck thread is holding a reference on it.

Oleg Drokin added a comment - 22/Aug/20 9:03 PM Andreas, while we can grant the lock on receipt of the bl callback, I am not sure that alone would unstuck the enqueue thread that's waiting for the RPC to return first. If we force-resume it then I imagine all sorts of strange behavior will crop up in normal races. Also just granting the lock will not make it go away because the stuck thread is holding a reference on it.

Mahmoud Hanafi added a comment - 22/Aug/20 3:36 AM

It is highly unlikely that we would have a drop on the net. I will re-run with nettrace.

Mahmoud Hanafi added a comment - 22/Aug/20 3:36 AM It is highly unlikely that we would have a drop on the net. I will re-run with nettrace.

Andreas Dilger added a comment - 22/Aug/20 1:23 AM

Oleg, if the client gets a blocking callback for a lock that it doesn't think is granted, should it "grant" the lock at that time with the blocking flag, and reply to the server AST, so that the lock is cancelled immediately after use but doesn't cause the client to stall?

Alternately, the client could drop the lock but then it doesn't make any forward progress.

Andreas Dilger added a comment - 22/Aug/20 1:23 AM Oleg, if the client gets a blocking callback for a lock that it doesn't think is granted, should it "grant" the lock at that time with the blocking flag, and reply to the server AST, so that the lock is cancelled immediately after use but doesn't cause the client to stall? Alternately, the client could drop the lock but then it doesn't make any forward progress.

Oleg Drokin added a comment - 21/Aug/20 11:54 PM

so this last log 100% matches the first log - server receives request xid 1675670653565120, processes it and sends a reply to the client. The client never receives the reply, so when the server sends bl ast, the client, not aware of the lock being granted, does not release the lock which leads to eviction.

This log does not include nettrace/neterrors so I cannot see what actually happened to the reply - i.e. was it lost on the wire or got stuck in some outgoing queue.

Oleg Drokin added a comment - 21/Aug/20 11:54 PM so this last log 100% matches the first log - server receives request xid 1675670653565120, processes it and sends a reply to the client. The client never receives the reply, so when the server sends bl ast, the client, not aware of the lock being granted, does not release the lock which leads to eviction. This log does not include nettrace/neterrors so I cannot see what actually happened to the reply - i.e. was it lost on the wire or got stuck in some outgoing queue.

Mahmoud Hanafi added a comment - 21/Aug/20 11:02 PM

Uploaded debug for patchset2 - ftp:/uploads/~~LU-13692~~/~~LU-13692~~_patch2_2020_08_21.tgz.

I was able reproduce the issue. I stopped debugging after the first client eviction. May be it was too soon... Please take a look and let me know if you want me to re-run it.

Mahmoud Hanafi added a comment - 21/Aug/20 11:02 PM Uploaded debug for patchset2 - ftp:/uploads/ LU-13692 / LU-13692 _patch2_2020_08_21.tgz. I was able reproduce the issue. I stopped debugging after the first client eviction. May be it was too soon... Please take a look and let me know if you want me to re-run it.

People

Assignee:: Oleg Drokin

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 17/Jun/20 8:56 PM

Updated:: 18/May/21 6:50 PM

Resolved:: 29/Oct/20 12:06 PM