[LU-10993] Fix for LU-10826 is problematic and skips recvoery Created: 03/May/18  Updated: 16/Jan/22  Resolved: 16/Jan/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10826 Regression in LU-9372 on OPA envirome... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

I think aptch https://review.whamcloud.com/#/c/31690/ for LU-10826 is more problematic.
after apply patch https://review.whamcloud.com/#/c/31690/ and test_req_buffer_pressure=1, it prevents OOM, but they are skipping some recvoery clients.

[root@voss05 ~]#  lctl get_param obdfilter.*.recovery_status
obdfilter.scratch-OST0024.recovery_status=
status: COMPLETE
recovery_start: 1525317355
recovery_duration: 54
completed_clients: 7249/7249
replayed_requests: 0
last_transno: 98784247808
VBR: DISABLED
IR: ENABLED
obdfilter.scratch-OST0025.recovery_status=
status: COMPLETE
recovery_start: 1525317353
recovery_duration: 56
completed_clients: 7031/7031
replayed_requests: 0
last_transno: 98784247808
VBR: DISABLED
IR: ENABLED
obdfilter.scratch-OST0026.recovery_status=
status: COMPLETE
recovery_start: 1525317352
recovery_duration: 57
completed_clients: 8168/8168
replayed_requests: 0
last_transno: 98784247808
VBR: DISABLED
IR: ENABLED
obdfilter.scratch-OST0027.recovery_status=
status: COMPLETE
recovery_start: 1525317350
recovery_duration: 59
completed_clients: 8195/8195
replayed_requests: 0
last_transno: 98784247808
VBR: DISABLED
IR: ENABLED
obdfilter.scratch-OST0028.recovery_status=
status: COMPLETE
recovery_start: 1525317355
recovery_duration: 54
completed_clients: 7984/7984
replayed_requests: 0
last_transno: 98784247808
VBR: DISABLED
IR: ENABLED
obdfilter.scratch-OST0029.recovery_status=
status: COMPLETE
recovery_start: 1525317352
recovery_duration: 57
completed_clients: 7985/7985
replayed_requests: 0
last_transno: 98784247808
VBR: DISABLED
IR: ENABLED
obdfilter.scratch-OST002a.recovery_status=
status: COMPLETE
recovery_start: 1525317354
recovery_duration: 55
completed_clients: 8329/8329
replayed_requests: 0
last_transno: 98784247808
VBR: DISABLED
IR: ENABLED
obdfilter.scratch-OST002b.recovery_status=
status: COMPLETE
recovery_start: 1525317351
recovery_duration: 58
completed_clients: 8291/8291
replayed_requests: 0
last_transno: 98784247808
VBR: DISABLED
IR: ENABLED
obdfilter.scratch-OST002c.recovery_status=
status: COMPLETE
recovery_start: 1525317350
recovery_duration: 59
completed_clients: 8286/8286
replayed_requests: 0
last_transno: 94489280512
VBR: DISABLED
IR: ENABLED

And, aslo sometimes, recovery still never triggered. e.g failover situation.
I see the messages after restart OSTs

[ 9169.158440] Lustre: 14598:0:(events.c:368:request_in_callback()) All ost request buffers busy
[ 9169.158447] Lustre: 14598:0:(events.c:368:request_in_callback()) Skipped 3508 previous similar messages


 Comments   
Comment by Bruno Faccini (Inactive) [ 03/May/18 ]

Hello Shuichi,
Why do you think that some recovery clients are being missed ? Because you expect the number of completed_clients to be the same for each OST ?

Also, the "(events.c:368:request_in_callback()) All ost request buffers busy" is expected to occur when running when test_req_buffer_pressure=1.

Comment by Shuichi Ihara (Inactive) [ 03/May/18 ]

Yes, and I've checked client side, but they didn't connect to OST even reveroy stat is completed.
Another prolbem. there are 40 x OSS here and some of OSS triggered recovery, but still many OSS didn't trigger recovery.
Actually, if we do umounted OSTs and remount them again on those OSS, recoery retriggered, but not all clients to recover.
might imcomplete patch of https://review.whamcloud.com/#/c/31690/

Comment by Bruno Faccini (Inactive) [ 04/May/18 ]

I know it is not a simple task, but as you seem to be able to reproduce easily, can you try to reduce the test to a minimal sub-set of OSS's OSTs and connected Clients and then take a full Lustre debug log on OSS and Clients ? I would like to get at least the trace from OSS and from both a successful and failed Clients.
In the mean time I will try to reproduce on a test platform.

Comment by Shuichi Ihara (Inactive) [ 04/May/18 ]

ok, let me know what exact information do you need.
at least, non re-trigger recvoery situation could be easy possible to reproduce.

Comment by Bruno Faccini (Inactive) [ 04/May/18 ]

> ok, let me know what exact information do you need.
Well, like what I have already indicated in my previous comment! : "can you try to reduce the test to a minimal sub-set of OSS's OSTs and connected Clients and then take a full Lustre debug log on OSS and Clients ? I would like to get at least the trace from OSS and from both a successful and failed Clients."

Comment by Bruno Faccini (Inactive) [ 01/Jun/18 ]

Hello Shuichi,
Just a small update to let you know that the attempts to reproduce this problem have all been unsuccessful until now.
BTW, did you find sometime to reproduce again on your side and in order to provide the infos I have requested before?

Comment by Peter Jones [ 23/Aug/18 ]

Mike

Could you please assess this situation?

Thanks

Peter

Comment by Peter Jones [ 12/Oct/18 ]

Descoping from 2.12 for now as there is not enough to work on. We can certainly continue to work this as soon as there is some more data available

Generated at Sat Feb 10 02:40:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.