[LU-10993] Fix for LU-10826 is problematic and skips recvoery Created: 03/May/18 Updated: 16/Jan/22 Resolved: 16/Jan/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
I think aptch https://review.whamcloud.com/#/c/31690/ for [root@voss05 ~]# lctl get_param obdfilter.*.recovery_status obdfilter.scratch-OST0024.recovery_status= status: COMPLETE recovery_start: 1525317355 recovery_duration: 54 completed_clients: 7249/7249 replayed_requests: 0 last_transno: 98784247808 VBR: DISABLED IR: ENABLED obdfilter.scratch-OST0025.recovery_status= status: COMPLETE recovery_start: 1525317353 recovery_duration: 56 completed_clients: 7031/7031 replayed_requests: 0 last_transno: 98784247808 VBR: DISABLED IR: ENABLED obdfilter.scratch-OST0026.recovery_status= status: COMPLETE recovery_start: 1525317352 recovery_duration: 57 completed_clients: 8168/8168 replayed_requests: 0 last_transno: 98784247808 VBR: DISABLED IR: ENABLED obdfilter.scratch-OST0027.recovery_status= status: COMPLETE recovery_start: 1525317350 recovery_duration: 59 completed_clients: 8195/8195 replayed_requests: 0 last_transno: 98784247808 VBR: DISABLED IR: ENABLED obdfilter.scratch-OST0028.recovery_status= status: COMPLETE recovery_start: 1525317355 recovery_duration: 54 completed_clients: 7984/7984 replayed_requests: 0 last_transno: 98784247808 VBR: DISABLED IR: ENABLED obdfilter.scratch-OST0029.recovery_status= status: COMPLETE recovery_start: 1525317352 recovery_duration: 57 completed_clients: 7985/7985 replayed_requests: 0 last_transno: 98784247808 VBR: DISABLED IR: ENABLED obdfilter.scratch-OST002a.recovery_status= status: COMPLETE recovery_start: 1525317354 recovery_duration: 55 completed_clients: 8329/8329 replayed_requests: 0 last_transno: 98784247808 VBR: DISABLED IR: ENABLED obdfilter.scratch-OST002b.recovery_status= status: COMPLETE recovery_start: 1525317351 recovery_duration: 58 completed_clients: 8291/8291 replayed_requests: 0 last_transno: 98784247808 VBR: DISABLED IR: ENABLED obdfilter.scratch-OST002c.recovery_status= status: COMPLETE recovery_start: 1525317350 recovery_duration: 59 completed_clients: 8286/8286 replayed_requests: 0 last_transno: 94489280512 VBR: DISABLED IR: ENABLED And, aslo sometimes, recovery still never triggered. e.g failover situation. [ 9169.158440] Lustre: 14598:0:(events.c:368:request_in_callback()) All ost request buffers busy [ 9169.158447] Lustre: 14598:0:(events.c:368:request_in_callback()) Skipped 3508 previous similar messages |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 03/May/18 ] |
|
Hello Shuichi, Also, the "(events.c:368:request_in_callback()) All ost request buffers busy" is expected to occur when running when test_req_buffer_pressure=1. |
| Comment by Shuichi Ihara (Inactive) [ 03/May/18 ] |
|
Yes, and I've checked client side, but they didn't connect to OST even reveroy stat is completed. |
| Comment by Bruno Faccini (Inactive) [ 04/May/18 ] |
|
I know it is not a simple task, but as you seem to be able to reproduce easily, can you try to reduce the test to a minimal sub-set of OSS's OSTs and connected Clients and then take a full Lustre debug log on OSS and Clients ? I would like to get at least the trace from OSS and from both a successful and failed Clients. |
| Comment by Shuichi Ihara (Inactive) [ 04/May/18 ] |
|
ok, let me know what exact information do you need. |
| Comment by Bruno Faccini (Inactive) [ 04/May/18 ] |
|
> ok, let me know what exact information do you need. |
| Comment by Bruno Faccini (Inactive) [ 01/Jun/18 ] |
|
Hello Shuichi, |
| Comment by Peter Jones [ 23/Aug/18 ] |
|
Mike Could you please assess this situation? Thanks Peter |
| Comment by Peter Jones [ 12/Oct/18 ] |
|
Descoping from 2.12 for now as there is not enough to work on. We can certainly continue to work this as soon as there is some more data available |