[LU-13692] MDS slow/hung threads at mdt_object_local_lock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6
Affects Version/s: Lustre 2.12.4
Labels:
None
Environment:
clients running l2.12.3

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

MDS threads hang/slow getting stack trace dumps

Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib)
Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981]   req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace:
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118
Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1592343544.8525
4.75 MB
18/Jun/20 5:11 PM
messages
51 kB
18/Jun/20 5:10 PM
vmcore-dmesg.txt
1021 kB
18/Jun/20 5:10 PM

Issue Links

is related to

LU-13500 Client gets evicted - nfsd non-standard errorno -108

Open

LU-14069 OBD_FAIL_LDLM_CANCEL_BL_CB_RACE is buggy in ldlm_handle_cp_callback

Resolved

Activity

[LU-13692] MDS slow/hung threads at mdt_object_local_lock

Oleg Drokin added a comment - 07/Sep/20 1:52 AM

I checked for the first eviction and that seems to have been removed from the client log by the time it was dumped so it's not possible to see what was going on on the client at the time.

I see there's still a bunch of stuck threads after that that I need to see if I can find why - probably another AST lost but it's not a give that client log (whatever it was) survived either I guess.

Oleg Drokin added a comment - 07/Sep/20 1:52 AM I checked for the first eviction and that seems to have been removed from the client log by the time it was dumped so it's not possible to see what was going on on the client at the time. I see there's still a bunch of stuck threads after that that I need to see if I can find why - probably another AST lost but it's not a give that client log (whatever it was) survived either I guess.

Mahmoud Hanafi added a comment - 03/Sep/20 5:20 PM

Uploaded logs running with patch#4.

debug="ioctl neterror net warning dlmtrace error emerg ha rpctrace config console lfsck"

ftp:/uploads/~~LU-13692~~/~~LU-13692~~_patch4_20200902.tgz

Mahmoud Hanafi added a comment - 03/Sep/20 5:20 PM Uploaded logs running with patch#4. debug="ioctl neterror net warning dlmtrace error emerg ha rpctrace config console lfsck" ftp:/uploads/ LU-13692 / LU-13692 _patch4_20200902.tgz

Oleg Drokin added a comment - 02/Sep/20 5:08 PM

patchset 4 if you can please.

I agree we are having at least two issues at hand.

Oleg Drokin added a comment - 02/Sep/20 5:08 PM patchset 4 if you can please. I agree we are having at least two issues at hand.

Mahmoud Hanafi added a comment - 01/Sep/20 9:36 PM

I think we have 2 issues here. First is the ldlm deadlock and second the credit issue.
I am going to retest with different credit values. Should I test with Patch#2 or Patch#3.

Mahmoud Hanafi added a comment - 01/Sep/20 9:36 PM I think we have 2 issues here. First is the ldlm deadlock and second the credit issue. I am going to retest with different credit values. Should I test with Patch#2 or Patch#3.

Oleg Drokin added a comment - 28/Aug/20 8:51 PM

Andreas, the thread is sleeping in ptlrpc_set_wait in ptlrpc, so waking it up from ldlm is going to be very nontrivial. We can find the thread easily as it's recorded in the lock. But then we need to determine the ptlrpc set that's privately allocated and referenced on stack of that thread only.

Oleg Drokin added a comment - 28/Aug/20 8:51 PM Andreas, the thread is sleeping in ptlrpc_set_wait in ptlrpc, so waking it up from ldlm is going to be very nontrivial. We can find the thread easily as it's recorded in the lock. But then we need to determine the ptlrpc set that's privately allocated and referenced on stack of that thread only.

Oleg Drokin added a comment - 28/Aug/20 8:39 PM

I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped.

I checked the logs and I do see this:

00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits

but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?)

He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now)

"lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.

Oleg Drokin added a comment - 28/Aug/20 8:39 PM I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped. I checked the logs and I do see this: 00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?) He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now) "lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.

Mahmoud Hanafi added a comment - 28/Aug/20 7:00 PM

In the last set of logs I had "nettrace" and "neterror". We didn't see any network errors. I can run with +net, if you would like.
Would make sense to see if you can reproduce this locally?

Mahmoud Hanafi added a comment - 28/Aug/20 7:00 PM In the last set of logs I had "nettrace" and "neterror". We didn't see any network errors. I can run with +net, if you would like. Would make sense to see if you can reproduce this locally?

Andreas Dilger added a comment - 28/Aug/20 5:24 AM

Oleg, in addition to the BL AST "granting" the lock with LDLM_FL_CBPENDING, could it also wake up the thread waiting for the lock reply to allow it to finish processing? There would need to be some way to trigger this so that the client thinks the RPC was received successfully.

Probably a better idea is if the RPC wait in that thread could have a periodic timeout check if the lock is already granted, and exit the wait itself? Then the BL AST can mark the lock granted with CBPENDING, and when the waiting thread next wakes up it will see the lock is granted and clean itself up and continue processing the syscall.

Andreas Dilger added a comment - 28/Aug/20 5:24 AM Oleg, in addition to the BL AST "granting" the lock with LDLM_FL_CBPENDING , could it also wake up the thread waiting for the lock reply to allow it to finish processing? There would need to be some way to trigger this so that the client thinks the RPC was received successfully. Probably a better idea is if the RPC wait in that thread could have a periodic timeout check if the lock is already granted, and exit the wait itself? Then the BL AST can mark the lock granted with CBPENDING , and when the waiting thread next wakes up it will see the lock is granted and clean itself up and continue processing the syscall.

Oleg Drokin added a comment - 28/Aug/20 3:58 AM

I don't see anything different in this log vs the previous one

00010000:00010000:4.0:1598299915.176480:0:14307:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff9c95908f1200/0xcb77a9c20dcbc88e lrc: 4/0,0 mode: CR/CR res: [0x20000d6f1:0x1:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.214@o2ib417 remote: 0xdd261a7834f00346 expref: 98 pid: 14307 timeout: 6658 lvb_type: 0
00010000:00010000:4.0:1598299915.176495:0:14307:0:(ldlm_lockd.c:1479:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff9c95908f1200, rc 0)
00000100:00100000:4.0:1598299915.176508:0:14307:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:221c8e76-3a4a-d7bb-7d3c-6ea3526db183+98:4765:x1675936175078016:12345-10.141.6.214@o2ib417:101 Request processed in 8860us (8880us total) trans 107374186758 rc 0/0
00000100:00100000:4.0:1598299915.176512:0:14307:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.214@o2ib417, seq: 2640
00000100:00100000:4.0:1598299915.176518:0:14307:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.3.114@o2ib417, seq: 2761
00000100:00100000:4.0:1598299915.176521:0:14307:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:44cfaec0-6c8c-6d1b-0b71-5fc557b7218e+97:4797:x1675936177175232:12345-10.141.3.114@o2ib417:101

request handled, reply is supposedly sent (no messages from lnet indicating otherwise at least) yet hte client did not receive it.

Oleg Drokin added a comment - 28/Aug/20 3:58 AM I don't see anything different in this log vs the previous one 00010000:00010000:4.0:1598299915.176480:0:14307:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff9c95908f1200/0xcb77a9c20dcbc88e lrc: 4/0,0 mode: CR/CR res: [0x20000d6f1:0x1:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.214@o2ib417 remote: 0xdd261a7834f00346 expref: 98 pid: 14307 timeout: 6658 lvb_type: 0 00010000:00010000:4.0:1598299915.176495:0:14307:0:(ldlm_lockd.c:1479:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff9c95908f1200, rc 0) 00000100:00100000:4.0:1598299915.176508:0:14307:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:221c8e76-3a4a-d7bb-7d3c-6ea3526db183+98:4765:x1675936175078016:12345-10.141.6.214@o2ib417:101 Request processed in 8860us (8880us total) trans 107374186758 rc 0/0 00000100:00100000:4.0:1598299915.176512:0:14307:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.214@o2ib417, seq: 2640 00000100:00100000:4.0:1598299915.176518:0:14307:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.3.114@o2ib417, seq: 2761 00000100:00100000:4.0:1598299915.176521:0:14307:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:44cfaec0-6c8c-6d1b-0b71-5fc557b7218e+97:4797:x1675936177175232:12345-10.141.3.114@o2ib417:101 request handled, reply is supposedly sent (no messages from lnet indicating otherwise at least) yet hte client did not receive it.

Mahmoud Hanafi added a comment - 27/Aug/20 4:09 PM

Any updates?

Mahmoud Hanafi added a comment - 27/Aug/20 4:09 PM Any updates?

Mahmoud Hanafi added a comment - 24/Aug/20 8:58 PM - edited

Uploaded new debug logs with nettrace
ftp:/uploads/~~LU-13692~~/~~LU-13692~~_patch2_testing_2020_08_24.tgz

Server side threads remain stuck until server reboot.

Mahmoud Hanafi added a comment - 24/Aug/20 8:58 PM - edited Uploaded new debug logs with nettrace ftp:/uploads/ LU-13692 / LU-13692 _patch2_testing_2020_08_24.tgz Server side threads remain stuck until server reboot.

People

Assignee:: Oleg Drokin

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 17/Jun/20 8:56 PM

Updated:: 18/May/21 6:50 PM

Resolved:: 29/Oct/20 12:06 PM