[LU-13131] Partial writes on multi-client strided files Created: 14/Jan/20 Updated: 01/Aug/20 Resolved: 20/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.4, Lustre 2.10.6, Lustre 2.12.2 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jeff Johnson | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12, zfs | ||
| Environment: |
CentOS 7.6, Lustre/ZFS 2.10.4, 2.10.6, 2.12.2, socklnd and o2ibd |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Issue: Client partial writes during multi-node N:1 2MB strided writes to striped files. Problem occurs at scale, triggering starts around 50 concurrent files where each file is getting 2MB strided writes from 128 jobs spread across two Lustre client nodes. Sampling of client side errors:
08-Jan 08:36:26 WARNING: partial write 1048576 != 4194304 at offset 543099453440 - skywalker09 08-Jan 08:36:26 WARNING: write of 4194304 failed at offset 318150541312 Input/output error [retry 0] - skywalker09 08-Jan 08:37:13 WARNING: partial write 2097152 != 4194304 at offset 144921591808 - skywalker28 08-Jan 08:37:13 WARNING: write of 4194304 failed at offset 356448731136 Input/output error [retry 0] - skywalker28 08-Jan 08:35:48 WARNING: partial write 1048576 != 4194304 at offset 356121575424 - skywalker19 08-Jan 08:35:48 WARNING: write of 4194304 failed at offset 281496518656 Input/output error [retry 0] - skywalker19 08-Jan 08:39:59 WARNING: close failed - Input/output error - skywalker37 08-Jan 08:34:32 WARNING: partial write 2097152 != 4194304 at offset 464540139520 - skywalker87 08-Jan 08:34:32 WARNING: write of 4194304 failed at offset 506952941568 Input/output error [retry 0] - skywalker74 08-Jan 08:38:06 WARNING: write of 2097152 failed at offset 183297376256 Input/output error [retry 0] - skywalker74 08-Jan 08:38:06 WARNING: partial write 2097152 != 4194304 at offset 138403643392 - skywalker63 08-Jan 08:40:10 WARNING: close failed - Input/output error - skywalker63 Lustre Servers: Lustre + ZFS. 2.10.4, 2.10.6, 2.12.2 Lustre Clients: 2.10.4, 2.10.6, 2.12.2 Tested fabrics: Socklnd (bonded 40GbE), o2iblnd (FDR & EDR), direct connect and routed o2ib->tcp. 1MB, 4MB and 16MB RPC sizes tested. References to other JIRA tickets that are or may be related: Details: A single 600GB file is written to by 128 jobs in a 2MB strided write format. The 128 jobs are spread across two Lustre client nodes.
The file is written with Lustre stripe options of 4-8 OSTs and a stripe_size of 1MB. With a total number of concurrent files being written being ≤ 30 the partial writes are not experienced. In the case of 30 files being written this would reflect 60 Lustre clients grouped by two with 64 jobs (threads) with each pair writing a different strided file. Using the same scenario but scaled up to concurrent files written ≥ 60 which equates to 120 (or more) Lustre clients in pairs performing strided writes to unique files triggers the partial write scenario. Making an observation, the combination of the 2MB strided writes and the 1MB OST stripe would result in a significant number of objects per file which scales up as the number of files and concurrent writers increases. Perhaps there is a trigger (I’m spitballing). Error messages witnessed: There have been randomly occurring client side error messages. It is not known whether they are resulting in the partial write events or occurring as a result of the partial writes.
tgt_grant_check lfs-OSTxxxx cli xxxxxxxxx claims X GRANT, real grant Y ldlm_cancel from xx.xx.xx.xx@tcp (lib-move.c:4183:lnet_parse()) xx.xx.xx.xx@o2ib, src xx.xx.xx.xx@o2ib: Bad dest nid xx.xx.xx.xx@tcp (it's my nid but on a different network)
No notable server side (MDS / OSS) error messages are observed that would coincide with the partial write errors. The OSS nodes are in an HA configuration and no failovers have occurred as a result of the issue, which if either OSS were becoming overloaded then corosync messages would likely miss and trigger failover. Additional information: With 2MB strided writes and 1MB OST stripe size it would seem that certain larger RPC sizes / brw_size / ZFS recordsize combinations could trigger significant write amplification. Perhaps enough to exacerbate the issue. I have tried to trigger the issue on a testbed filesystem and have ran through permutations of server-side / client-side versions. socklnd and o2ibd direct, o2ibd lnet routed to socklnd, 1 / 4 / 16MB RPC sizes and I cannot trigger the partial write events. Sixteen 64c AMD Rome clients (eight concurrent files written) isn’t enough to trigger it. |
| Comments |
| Comment by Jeff Johnson [ 14/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Adding to the original description, the clients and filesystem function normally otherwise and is performant. Other I/O, different from what is outlined above, is normal. IOR benchmark runs are successful and performant. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 14/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
The errors are easier to trigger when the OSTs selected in the 4 to 8 way OST stripe are on the same OSS or OSS pair. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 15/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Jeff, to clarify your description of the write workload:
At most levels this probably doesn't make a difference, but I'm wondering if the jobs are independent executables running against the same output file if they may be less synchronized than if there are multiple threads from the same program writing to the file together (which would result in more merges on the client instead of "random" IO).
Lustre doesn't allocate objects for different "chunks" of the file, per se, unlike Ceph which allocates an "object" for each 32MB of the file or similar. There is exactly one OST object for each of the "stripe_count" parts of a file. In your description above, the stripe_count is 4 or 8. Writes are round-robined across the objects file by file offset, but Lustre does not create new "objects" for the file as it grows larger (excluding PFL/SEL which still has file layouts that are relatively static).
In this case, with an 8-stripe file, the 0-1MB, 8-9MB, 16-17MB, ... chunks would be written to stripe0; 1-2MB, 9-10MB, 18-19MB, ... chunks would be written to stripe1, etc. Having the threads writing 2MB interleaved chunks to a file with stripe_size=1MB is certainly sub-optimal, depending on how the "jobs" are distributed across the nodes. Using stripe_size=2MB would at least cut the IO/lock contention in half. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 15/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
I wondered the same thing. I'm trying to ascertain where the 4MB writes are coming from. They are 2MB but I haven't dug deep into the application to find where the 4MB IO is. "Job" is an independent separate task writing to the same file at the stride offsets described. No MPI involved. Thanks for clarifying the objects. I figured the size mismatch of the file striping and and strided writes might have triggered a new object per job. I may move this to test to AWS to get enough scale to trigger the error where I can dump the debug buffer at all end points and try to sort this out. It's far too infrequent at the scale of my testbed to catch it. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 15/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
I looked into the code and it is varying the IO and offset sizes in order to trigger the issue being seen in an application that can't be modified. The application performs 2MB strided writes as originally reported. The testbed code, written to try and trigger the issue for diagnosis, writes varied sized IO and offsets in order to accomplish that. That being said, according to the included error messages, the error triggers on other stride sizes than 2MB. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 15/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
I just triggered the issue. It occurred on one node, shortly after an OSS remount and recovery. Recovery had completed and IO from clients had started. My guess is that the burst of activity from 18 nodes all releasing IO at once somehow exposed what it usually takes 60+ nodes to start to see. I converted human-readable timestamps from dmesg/messages to epoch time to help line events with Lustre debug logs. Application: wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 188106145792 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 474214301696 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 387142647808 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 144621699072 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 268926189568 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 219213201408 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 26434600960 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 169443590144 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 57487130624 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 399555690496 - palpatine20.cluster wstmp_c70QBR: 1579118408 WARNING: partial write 4096 != 2097152 at offset 523937775616 - palpatine20.cluster Client: 1579118408 palpatine20 kernel: LustreError: 11-0: testbed01-OST0001-osc-ffff9bcb5c37c000: operation ost_write to node 10.0.50.12@tcp failed: rc = -107 1579118408 palpatine20 kernel: Lustre: testbed01-OST0001-osc-ffff9bcb5c37c000: Connection to testbed01-OST0001 (at 10.0.50.12@tcp) was lost; in progress operations using this service will wait for recovery to complete 1579118408 palpatine20 kernel: LustreError: 167-0: testbed01-OST0001-osc-ffff9bcb5c37c000: This client was evicted by testbed01-OST0001; in progress operations using this service will fail. 1579118408 palpatine20 kernel: Lustre: 9208:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) testbed01: dirty page discard: 10.0.50.11@tcp:/testbed01/fid: [0x20000041c:0x8a:0x0]/ may get corrupted (rc -5) 1579118410 palpatine20 kernel: Lustre: 9205:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) testbed01: dirty page discard: 10.0.50.11@tcp:/testbed01/fid: [0x20000041c:0x88:0x0]/ may get corrupted (rc -5) 1579118410 palpatine20 kernel: Lustre: testbed01-OST0001-osc-ffff9bcb5c37c000: Connection restored to 10.0.50.12@tcp (at 10.0.50.12@tcp) Client Lustre Debug: 00000100:00080000:52.0:1579118406.851776:0:151405:0:(pinger.c:261:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: not pinging (in recovery or recovery disabled: IDLE) 00000100:02020000:106.0F:1579118408.371668:0:9204:0:(client.c:1279:ptlrpc_check_status()) 11-0: testbed01-OST0001-osc-ffff9bcb5c37c000: operation ost_write to node 10.0.50.12@tcp failed: rc = -107 00000100:00080000:106.0:1579118408.371689:0:9204:0:(recover.c:236:ptlrpc_request_handle_notconn()) import testbed01-OST0001-osc-ffff9bcb5c37c000 of testbed01-OST0001_UUID@10.0.50.12@tcp abruptly disconnected: reconnecting 00000100:02000400:106.0:1579118408.371691:0:9204:0:(import.c:179:ptlrpc_set_import_discon()) testbed01-OST0001-osc-ffff9bcb5c37c000: Connection to testbed01-OST0001 (at 10.0.50.12@tcp) was lost; in progress operations using this service will wait for recovery to complete 00000100:00080000:106.0:1579118408.371694:0:9204:0:(import.c:187:ptlrpc_set_import_discon()) ffff9bdb64a57000 testbed01-OST0001_UUID: changing import state from FULL to DISCONN 00000100:00080000:106.0:1579118408.371696:0:9204:0:(import.c:684:ptlrpc_connect_import()) ffff9bdb64a57000 testbed01-OST0001_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:106.0:1579118408.371698:0:9204:0:(import.c:528:import_select_connection()) testbed01-OST0001-osc-ffff9bcb5c37c000: connect to NID 10.0.50.12@tcp last attempt 766971 00000100:00080000:106.0:1579118408.371701:0:9204:0:(import.c:604:import_select_connection()) testbed01-OST0001-osc-ffff9bcb5c37c000: import ffff9bdb64a57000 using connection 10.0.50.12@tcp/10.0.50.12@tcp 00000100:02020000:18.0F:1579118408.372306:0:9207:0:(client.c:1279:ptlrpc_check_status()) 11-0: testbed01-OST0001-osc-ffff9bcb5c37c000: operation ost_write to node 10.0.50.12@tcp failed: rc = -107 00000100:02020000:11.0F:1579118408.372309:0:9206:0:(client.c:1279:ptlrpc_check_status()) 11-0: testbed01-OST0001-osc-ffff9bcb5c37c000: operation ost_write to node 10.0.50.12@tcp failed: rc = -107 00000100:00080000:18.0:1579118408.372311:0:9207:0:(recover.c:236:ptlrpc_request_handle_notconn()) import testbed01-OST0001-osc-ffff9bcb5c37c000 of testbed01-OST0001_UUID@10.0.50.12@tcp abruptly disconnected: reconnecting 00000100:02020000:5.0F:1579118408.372311:0:9208:0:(client.c:1279:ptlrpc_check_status()) 11-0: testbed01-OST0001-osc-ffff9bcb5c37c000: operation ost_write to node 10.0.50.12@tcp failed: rc = -107 00000100:00080000:18.0:1579118408.372312:0:9207:0:(import.c:202:ptlrpc_set_import_discon()) osc: import ffff9bdb64a57000 already not connected (conn 78, was 77): CONNECTING 00000100:00080000:11.0:1579118408.372312:0:9206:0:(recover.c:236:ptlrpc_request_handle_notconn()) import testbed01-OST0001-osc-ffff9bcb5c37c000 of testbed01-OST0001_UUID@10.0.50.12@tcp abruptly disconnected: reconnecting 00000100:02020000:12.0F:1579118408.372313:0:9205:0:(client.c:1279:ptlrpc_check_status()) 11-0: testbed01-OST0001-osc-ffff9bcb5c37c000: operation ost_write to node 10.0.50.12@tcp failed: rc = -107 00000100:00080000:11.0:1579118408.372313:0:9206:0:(import.c:202:ptlrpc_set_import_discon()) osc: import ffff9bdb64a57000 already not connected (conn 78, was 77): CONNECTING 00000100:00080000:5.0:1579118408.372314:0:9208:0:(recover.c:236:ptlrpc_request_handle_notconn()) import testbed01-OST0001-osc-ffff9bcb5c37c000 of testbed01-OST0001_UUID@10.0.50.12@tcp abruptly disconnected: reconnecting 00000100:00080000:5.0:1579118408.372315:0:9208:0:(import.c:202:ptlrpc_set_import_discon()) osc: import ffff9bdb64a57000 already not connected (conn 78, was 77): CONNECTING 00000100:00080000:12.0:1579118408.372316:0:9205:0:(recover.c:236:ptlrpc_request_handle_notconn()) import testbed01-OST0001-osc-ffff9bcb5c37c000 of testbed01-OST0001_UUID@10.0.50.12@tcp abruptly disconnected: reconnecting 00000100:02020000:3.0:1579118408.372316:0:9210:0:(client.c:1279:ptlrpc_check_status()) 11-0: testbed01-OST0001-osc-ffff9bcb5c37c000: operation ost_write to node 10.0.50.12@tcp failed: rc = -107 00000100:00080000:12.0:1579118408.372317:0:9205:0:(import.c:202:ptlrpc_set_import_discon()) osc: import ffff9bdb64a57000 already not connected (conn 78, was 77): CONNECTING 00000100:00080000:3.0:1579118408.372318:0:9210:0:(recover.c:236:ptlrpc_request_handle_notconn()) import testbed01-OST0001-osc-ffff9bcb5c37c000 of testbed01-OST0001_UUID@10.0.50.12@tcp abruptly disconnected: reconnecting 00000100:00080000:3.0:1579118408.372332:0:9210:0:(import.c:202:ptlrpc_set_import_discon()) osc: import ffff9bdb64a57000 already not connected (conn 78, was 77): CONNECTING 00000100:00080000:53.0:1579118408.760387:0:9172:0:(import.c:1047:ptlrpc_connect_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: connect to target with instance 2 00000080:00080000:53.0:1579118408.760421:0:9172:0:(lcommon_misc.c:70:cl_init_ea_size()) updating def/max_easize: 72/96 00000100:00080000:53.0:1579118408.760425:0:9172:0:(import.c:1247:ptlrpc_connect_interpret()) @@@ testbed01-OST0001-osc-ffff9bcb5c37c000: evicting (reconnect/recover flags not set: 4) req@ffff9bc57bbbde80 x1655026382181952/t0(0) o8->testbed01-OST0001-osc-ffff9bcb5c37c000@10.0.50.12@tcp:28/4 lens 520/384 e 0 to 0 dl 1579118415 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:53.0:1579118408.760428:0:9172:0:(import.c:1250:ptlrpc_connect_interpret()) ffff9bdb64a57000 testbed01-OST0001_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:53.0:1579118408.760430:0:9172:0:(import.c:1483:ptlrpc_import_recovery_state_machine()) 167-0: testbed01-OST0001-osc-ffff9bcb5c37c000: This client was evicted by testbed01-OST0001; in progress operations using this service will fail. 00000100:00080000:53.0:1579118408.760435:0:9172:0:(import.c:1488:ptlrpc_import_recovery_state_machine()) evicted from testbed01-OST0001_UUID@10.0.50.12@tcp; invalidating 00000100:00080000:169.0:1579118408.760587:0:195679:0:(import.c:1429:ptlrpc_invalidate_import_thread()) thread invalidate import testbed01-OST0001-osc-ffff9bcb5c37c000 to testbed01-OST0001_UUID@10.0.50.12@tcp 00000100:00080000:169.0:1579118408.760592:0:195679:0:(import.c:214:ptlrpc_deactivate_and_unlock_import()) setting import testbed01-OST0001_UUID INVALID 00000008:00080000:141.0F:1579118408.760659:0:9250:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000008:00080000:106.0:1579118408.760663:0:9187:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000008:00080000:40.0F:1579118408.760692:0:9251:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000008:00080000:21.0F:1579118408.760698:0:9195:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000008:00080000:12.0:1579118408.760701:0:9205:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2490, rc = -5. 00000008:00080000:18.0:1579118408.760702:0:9207:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000008:00080000:11.0:1579118408.760703:0:9206:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000008:00080000:5.0:1579118408.760704:0:9208:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000008:00080000:3.0:1579118408.760709:0:9210:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2490, rc = -5. 00000080:00000400:5.0:1579118408.760752:0:9208:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) testbed01: dirty page discard: 10.0.50.11@tcp:/testbed01/fid: [0x20000041c:0x8a:0x0]/ may get corrupted (rc -5) 00000080:00000400:12.0:1579118408.760760:0:9205:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) testbed01: dirty page discard: 10.0.50.11@tcp:/testbed01/fid: [0x20000041c:0x88:0x0]/ may get corrupted (rc -5) 00020000:01000000:169.0:1579118408.761035:0:195679:0:(lov_obd.c:413:lov_set_osc_active()) Marking OSC testbed01-OST0001_UUID inactive 00000080:00080000:169.0:1579118408.761040:0:195679:0:(lcommon_misc.c:70:cl_init_ea_size()) updating def/max_easize: 72/72 00000008:00080000:12.0:1579118408.761483:0:9188:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000008:00080000:10.0F:1579118408.761829:0:9204:0:(osc_request.c:2040:brw_interpret()) testbed01-OST0001-osc-ffff9bcb5c37c000: resend cross eviction for object: 0x0:2492, rc = -5. 00000100:00080000:183.0F:1579118410.510893:0:195679:0:(import.c:1438:ptlrpc_invalidate_import_thread()) ffff9bdb64a57000 testbed01-OST0001_UUID: changing import state from EVICTED to RECOVER 00000100:00080000:183.0:1579118410.510902:0:195679:0:(import.c:1549:ptlrpc_import_recovery_state_machine()) ffff9bdb64a57000 testbed01-OST0001_UUID: changing import state from RECOVER to FULL 00020000:01000000:183.0:1579118410.510908:0:195679:0:(lov_obd.c:413:lov_set_osc_active()) Marking OSC testbed01-OST0001_UUID active 00000080:00080000:183.0:1579118410.510913:0:195679:0:(lcommon_misc.c:70:cl_init_ea_size()) updating def/max_easize: 72/96 00000100:02000000:183.0:1579118410.510916:0:195679:0:(import.c:1555:ptlrpc_import_recovery_state_machine()) testbed01-OST0001-osc-ffff9bcb5c37c000: Connection restored to 10.0.50.12@tcp (at 10.0.50.12@tcp) 00000400:00000100:0.0:1579118412.616427:0:9033:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.0.50.12@tcp portal 4 match 1655026382181328 offset 192 length 416: 4 00000400:00000100:0.0:1579118412.616441:0:9033:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.0.50.12@tcp portal 4 match 1655026382181264 offset 192 length 416: 4 00000400:00000100:0.0:1579118412.616449:0:9033:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.0.50.12@tcp portal 4 match 1655026382181248 offset 192 length 416: 4 00000400:00000100:0.0:1579118412.616456:0:9033:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.0.50.12@tcp portal 4 match 1655026382181312 offset 192 length 416: 4 00000400:00000100:0.0:1579118412.616463:0:9033:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.0.50.12@tcp portal 4 match 1655026382181600 offset 192 length 416: 4 00000100:00080000:52.0:1579118431.940570:0:151405:0:(pinger.c:248:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: level IDLE/11 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:52.0:1579118431.940573:0:151405:0:(pinger.c:261:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: not pinging (in recovery or recovery disabled: IDLE) 00000100:00080000:52.0:1579118457.026057:0:151405:0:(pinger.c:248:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: level IDLE/11 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:52.0:1579118457.026059:0:151405:0:(pinger.c:261:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: not pinging (in recovery or recovery disabled: IDLE) 00000100:00080000:52.0:1579118482.116613:0:151405:0:(pinger.c:248:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: level IDLE/11 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:52.0:1579118482.116615:0:151405:0:(pinger.c:261:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: not pinging (in recovery or recovery disabled: IDLE) 00000100:00080000:52.0:1579118507.206958:0:151405:0:(pinger.c:248:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: level IDLE/11 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:52.0:1579118507.206964:0:151405:0:(pinger.c:261:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: not pinging (in recovery or recovery disabled: IDLE) 00000002:00080000:145.0:1579118511.205503:0:195714:0:(mdc_request.c:903:mdc_close()) @@@ matched open req@ffff9bd563fa8d80 x1655026382205120/t4294985495(4294985495) o101->testbed01-MDT0000-mdc-ffff9bcb5c37c000@10.0.50.11@tcp:12/10 lens 600/640 e 0 to 0 dl 1579118518 ref 2 fl Complete:RP/4/0 rc 301/301 00000002:00080000:145.0:1579118511.205756:0:195714:0:(mdc_request.c:903:mdc_close()) @@@ matched open req@ffff9bd563faf980 x1655026382205104/t4294985494(4294985494) o101->testbed01-MDT0000-mdc-ffff9bcb5c37c000@10.0.50.11@tcp:12/10 lens 568/560 e 0 to 0 dl 1579118518 ref 1 fl Complete:RP/4/ffffffff rc 0/-1 00000100:00080000:52.0:1579118532.288241:0:151405:0:(pinger.c:248:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: level IDLE/11 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:52.0:1579118532.288247:0:151405:0:(pinger.c:261:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: not pinging (in recovery or recovery disabled: IDLE) 00000100:00080000:52.0:1579118557.376963:0:151405:0:(pinger.c:248:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: level IDLE/11 force 1 force_next 0 deactive 0 pingable 1 suppress 0 OSS: [1579118407] LustreError: 19423:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.0.50.20@tcp ns: filter-testbed01-OST0001_UUID lock: ffff88075b407740/0x401aaf4595d45ff7 lrc: 3/0,0 mode: PW/PW res: [0x9ba:0x0:0x0].0x0 rrc: 1592 type: EXT [466731663360->466765217791] (req 466731663360->466733760511) flags: 0x60000400020020 nid: 10.0.50.20@tcp remote: 0xd6b81379aed3c452 expref: 449 pid: 22131 timeout: 624 lvb_type: 0 [1579118407] LustreError: 19422:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880f78d1ec00 x1655825318823424/t0(0) o105->testbed01-OST0001@10.0.50.20@tcp:15/16 lens 360/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [1579118408] Lustre: testbed01-OST0001: Connection restored to b2aa93cc-306c-345d-b999-acf8e974b6eb (at 10.0.50.20@tcp) [1579118409] perf: interrupt took too long (6302 > 5581), lowering kernel.perf_event_max_sample_rate to 31000 [1579118448] LustreError: 33156:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118458] LustreError: 33150:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118458] LustreError: 33150:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 2 previous similar messages [1579118470] Lustre: testbed01-OST0001: haven't heard from client 6c8b6b2d-e3f6-8cd5-55bf-e9773802c513 (at 10.0.60.25@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8806dc8a6800, cur 1579118470 expire 1579118320 last 1579118243 [1579118470] LustreError: 33176:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118470] LustreError: 33176:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 1 previous similar message [1579118475] Lustre: testbed01-OST0000: haven't heard from client 6c8b6b2d-e3f6-8cd5-55bf-e9773802c513 (at 10.0.60.25@o2ib) in 231 seconds. I think it's dead, and I am evicting it. exp ffff880fdb937c00, cur 1579118475 expire 1579118325 last 1579118244 [1579118483] LustreError: 33163:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118483] LustreError: 33163:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 3 previous similar messages [1579118492] LustreError: 33189:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118496] perf: interrupt took too long (8663 > 7877), lowering kernel.perf_event_max_sample_rate to 23000 [1579118501] LustreError: 38102:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118501] LustreError: 38102:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 53 previous similar messages [1579118517] LustreError: 38100:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118517] LustreError: 38100:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 92 previous similar messages [1579118551] LustreError: 33046:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118551] LustreError: 33046:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 175 previous similar messages [1579118615] LustreError: 19439:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118615] LustreError: 19439:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 380 previous similar messages [1579118653] perf: interrupt took too long (10997 > 10828), lowering kernel.perf_event_max_sample_rate to 18000 [1579118743] LustreError: 33138:0:(tgt_grant.c:750:tgt_grant_check()) testbed01-OST0001: cli b2aa93cc-306c-345d-b999-acf8e974b6eb claims 17432576 GRANT, real grant 0 [1579118743] LustreError: 33138:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 730 previous similar messages [1579118896] perf: interrupt took too long (13777 > 13746), lowering kernel.perf_event_max_sample_rate to 14000
OSS Lustre Debug: <<<<Attached as gzip file, too large to post as textblock** MDS: Quiet, no log entries at this time (var/log/messages, dmesg) MDS Lustre Debug: 00000400:00000100:11.0:1579118411.756361:0:7034:0:(lib-move.c:2772:lnet_finalize_expired_responses()) Response timed out: md = ffff8807abb6b760: nid = 10.0.50.13@tcp 00000100:00080000:29.0:1579118428.244006:0:50569:0:(pinger.c:610:ping_evictor_main()) evicting all exports of obd MGS older than 1579118278 00000100:02000400:29.0:1579118428.244010:0:50569:0:(pinger.c:636:ping_evictor_main()) MGS: haven't heard from client 9b8ec3a9-c7f9-4529-2076-d827aa8ab8f5 (at 10.0.60.25@o2ib) in 230 seconds. I think it's dead, and I am evicting it. exp ffff881039f6a000, cur 1579118428 expire 1579118278 last 1579118198 00000100:00080000:29.0:1579118428.244016:0:50569:0:(pinger.c:638:ping_evictor_main()) Last request was at 1579118198 00000020:00080000:29.0:1579118428.244018:0:50569:0:(genops.c:1664:class_fail_export()) disconnecting export ffff881039f6a000/9b8ec3a9-c7f9-4529-2076-d827aa8ab8f5 00000020:00000080:29.0:1579118428.244024:0:50569:0:(genops.c:1503:class_disconnect()) disconnect: cookie 0x3fae46a8ab5b698b 00000100:00080000:29.0:1579118428.244026:0:50569:0:(import.c:1777:ptlrpc_cleanup_imp()) ffff880f40f1b000 : changing import state from FULL to CLOSED 00000020:00080000:29.0:1579118428.244030:0:50569:0:(genops.c:1681:class_fail_export()) disconnected export ffff881039f6a000/9b8ec3a9-c7f9-4529-2076-d827aa8ab8f5 00000020:00000080:6.0:1579118428.244061:0:50559:0:(genops.c:1024:class_export_put()) final put ffff881039f6a000/9b8ec3a9-c7f9-4529-2076-d827aa8ab8f5 00000020:00000080:6.0:1579118428.244067:0:40:0:(genops.c:974:class_export_destroy()) destroying export ffff881039f6a000/9b8ec3a9-c7f9-4529-2076-d827aa8ab8f5 for MGS 00000020:00000080:6.0:1579118428.244073:0:40:0:(genops.c:1224:obd_zombie_import_free()) destroying import ffff880f40f1b000 for MGS 00000100:00080000:29.0:1579118451.219628:0:50569:0:(pinger.c:610:ping_evictor_main()) evicting all exports of obd testbed01-MDT0000 older than 1579118301 00000100:02000400:29.0:1579118451.219632:0:50569:0:(pinger.c:636:ping_evictor_main()) testbed01-MDT0000: haven't heard from client 6c8b6b2d-e3f6-8cd5-55bf-e9773802c513 (at 10.0.60.25@o2ib) in 236 seconds. I think it's dead, and I am evicting it. exp ffff881039f6c000, cur 1579118451 expire 1579118301 last 1579118215 00000100:00080000:29.0:1579118451.219638:0:50569:0:(pinger.c:638:ping_evictor_main()) Last request was at 1579118215 00000020:00080000:29.0:1579118451.219639:0:50569:0:(genops.c:1664:class_fail_export()) disconnecting export ffff881039f6c000/6c8b6b2d-e3f6-8cd5-55bf-e9773802c513 00000020:00000080:29.0:1579118451.219645:0:50569:0:(genops.c:1503:class_disconnect()) disconnect: cookie 0x3fae46a8ab5b69ae 00000100:00080000:29.0:1579118451.219647:0:50569:0:(import.c:1777:ptlrpc_cleanup_imp()) ffff880f40f1b800Vr=��Vr=���: changing import state from FULL to CLOSED 00000020:00080000:29.0:1579118451.219731:0:50569:0:(genops.c:1681:class_fail_export()) disconnected export ffff881039f6c000/6c8b6b2d-e3f6-8cd5-55bf-e9773802c513 00000020:00000080:5.0:1579118455.562245:0:7389:0:(genops.c:1024:class_export_put()) final put ffff881039f6c000/6c8b6b2d-e3f6-8cd5-55bf-e9773802c513 00000020:00000080:5.0:1579118455.562252:0:454:0:(genops.c:974:class_export_destroy()) destroying export ffff881039f6c000/6c8b6b2d-e3f6-8cd5-55bf-e9773802c513 for testbed01-MDT0000 00000020:00000080:5.0:1579118455.562262:0:454:0:(genops.c:1224:obd_zombie_import_free()) destroying import ffff880f40f1b800 for testbed01-MDT0000 00000004:00080000:8.0:1579118482.417488:0:54868:0:(mdt_identity.c:109:mdt_identity_do_upcall()) testbed01-MDT0000: invoked upcall /usr/sbin/l_getidentity testbed01-MDT0000 1000, time 50us 00000040:00080000:8.0:1579118482.418975:0:54868:0:(llog_osd.c:709:llog_osd_write_rec()) added record [0x1:0x3:0x0].2448, 64 off164864 00000004:00080000:8.0:1579118482.418978:0:54868:0:(osp_sync.c:1584:osp_sync_add_commit_cb()) testbed01-OST0000-osc-MDT0000: add commit cb at 5611876436689024ns, next at 5611877436688303ns, rc = 0 00000040:00080000:8.0:1579118482.418992:0:54868:0:(llog_osd.c:709:llog_osd_write_rec()) added record [0x1:0x6:0x0].2447, 64 off164800 00000004:00080000:8.0:1579118482.418993:0:54868:0:(osp_sync.c:1584:osp_sync_add_commit_cb()) testbed01-OST0001-osc-MDT0000: add commit cb at 5611876436704698ns, next at 5611877436704471ns, rc = 0 00000004:00080000:8.0:1579118482.423377:0:54868:0:(osp_internal.h:646:osp_update_last_fid()) Gap in objids: start=[0x100000000:0x9c4:0x0], count =62 00000004:00080000:8.0:1579118482.423380:0:54868:0:(osp_object.c:1575:osp_create()) Writing gap [0x100000000:0x9c4:0x0]+62 in llog 00000004:00080000:8.0:1579118482.423387:0:54868:0:(osp_object.c:1592:osp_create()) testbed01-OST0000-osc-MDT0000: Wrote last used FID: [0x100000000:0xa02:0x0], index 0: 0 00000004:00080000:8.0:1579118482.423388:0:54868:0:(osp_internal.h:646:osp_update_last_fid()) Gap in objids: start=[0x100010000:0x9c5:0x0], count =93 00000004:00080000:8.0:1579118482.423389:0:54868:0:(osp_object.c:1575:osp_create()) Writing gap [0x100010000:0x9c5:0x0]+93 in llog 00000004:00080000:8.0:1579118482.423391:0:54868:0:(osp_object.c:1592:osp_create()) testbed01-OST0001-osc-MDT0000: Wrote last used FID: [0x100010000:0xa22:0x0], index 1: 0 00000004:00080000:23.0:1579118483.645314:0:52539:0:(osp_sync.c:1584:osp_sync_add_commit_cb()) testbed01-OST0000-osc-MDT0000: add commit cb at 5611877663025258ns, next at 5611878663024623ns, rc = 0 00000004:00080000:20.0:1579118483.645339:0:52823:0:(osp_sync.c:1584:osp_sync_add_commit_cb()) testbed01-OST0001-osc-MDT0000: add commit cb at 5611877663050318ns, next at 5611878663049961ns, rc = 0 00000001:00080000:5.0:1579118485.562540:0:7389:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) testbed01-MDT0000: transno 4294985497 is committed 00000004:00080000:5.0:1579118485.562545:0:7389:0:(osp_sync.c:1549:osp_sync_local_commit_cb()) testbed01-OST0001-osc-MDT0000: 750 committed 00000004:00080000:5.0:1579118485.562548:0:7389:0:(osp_sync.c:1549:osp_sync_local_commit_cb()) testbed01-OST0000-osc-MDT0000: 751 committed 00000004:00080000:5.0:1579118485.562551:0:7389:0:(osp_sync.c:1549:osp_sync_local_commit_cb()) testbed01-OST0001-osc-MDT0000: 751 committed 00000004:00080000:6.0:1579118485.562552:0:52822:0:(osp_sync.c:1135:osp_sync_process_queues()) 1 changes, 0 in progress, 0 in flight 00000004:00080000:5.0:1579118485.562553:0:7389:0:(osp_sync.c:1549:osp_sync_local_commit_cb()) testbed01-OST0000-osc-MDT0000: 752 committed 00000004:00080000:52.0:1579118485.562555:0:52538:0:(osp_sync.c:1135:osp_sync_process_queues()) 1 changes, 0 in progress, 0 in flight 00000004:00080000:6.0:1579118485.562573:0:52822:0:(osp_sync.c:985:osp_sync_process_record()) found record 10692404, 64, idx 2447, id 749 00000004:00080000:52.0:1579118485.562580:0:52538:0:(osp_sync.c:985:osp_sync_process_record()) found record 10692404, 64, idx 2448, id 750 00000004:00080000:23.0:1579118486.053455:0:7131:0:(osp_sync.c:564:osp_sync_interpret()) reply req ffff88080b383f00/2, rc 0, transno 24306 00000004:00080000:3.0:1579118488.285849:0:7126:0:(osp_sync.c:564:osp_sync_interpret()) reply req ffff880858309200/2, rc 0, transno 19847 00000004:00080000:18.0:1579118498.669487:0:7138:0:(osp_sync.c:510:osp_sync_request_commit_cb()) commit req ffff880858309200, transno 8589954439 00000004:00080000:19.0:1579118498.669492:0:7139:0:(osp_sync.c:510:osp_sync_request_commit_cb()) commit req ffff88080b383f00, transno 8589958898 00000400:00000100:12.0:1579118536.756346:0:7034:0:(lib-move.c:2772:lnet_finalize_expired_responses()) Response timed out: md = ffff8807abb6b760: nid = 10.0.50.13@tcp 00000400:00000100:12.0:1579118836.756362:0:7034:0:(lib-move.c:2772:lnet_finalize_expired_responses()) Response timed out: md = ffff8807f9f65320: nid = 10.0.50.13@tcp 00000400:00000100:12.0:1579118961.756334:0:7034:0:(lib-move.c:2772:lnet_finalize_expired_responses()) Response timed out: md = ffff8807f9f65320: nid = 10.0.50.13@tcp | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 16/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
What's interesting in the client and OSS logs here is two things (the MDS logs show nothing of interest):
I haven't traced the whole lock request/callback/idle disconnect/reconnect through the logs yet (not sure how far back the client and server Lustre debug logs go, whether they would still contain the original request from 100s earlier), but it at least seems like this is related? I've previously asked to set llite.*.idle_timeout=0 on the clients to eliminate idle disconnect/reconnect as a source of problems, and in theory that has been eliminated as the root cause, but once again it is showing up here as a potential trigger for the problem (the "connect idle client just before the error was hit" situation was also seen in other logs). Jeff, if you can reproduce this with some specific frequency (though I understand it is not easy, we haven't been able to hit it at all), could you please run lctl set_param -P llite.*.idle_timeout=0 on the MGS, and wait a few seconds before verifying on the clients that lctl get_param llite.*.idle_timeout returns 0 before trying to reproduce again. My speculation is that (somehow) the idle disconnect is leaving the client with some stale lock handles on the server import (or similar, maybe after recovery?) that eventually get an LDLM_CP_CALLBACK=105 from the OSS but the client doesn't even get the RPC since it is disconnected? Possibly the OSS/DLM need to consider if the client has idle disconnected that it doesn't have any outstanding locks on its side? The GRANT messages appear to be a byproduct of the client eviction (40s earlier), and while potentially annoying I don't think they are in any way causing this problem to be hit. It is the eviction itself that is causing the short write/EIO to be seen on the client. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 16/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oh, nm. I now see that the idle connection message in the client debug log was for OST0000, not OST0001 where the client was evicted. That likely eliminates the "idle disconnect" is the root of the problem theory. This will need further digging into the client and OSS debug logs to see if the client got the original LDLM_CP_CALLBACK message or not, and/or why it didn't reply in a timely manner. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 16/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
I added the full OSS Lustre debug logs for the OSS and Client with the triggered error. Client: palpatine20-lustredbg.txt.gz OSS: testbed01-oss00-lustredbg-FULL.txt.gz
| ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 17/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Hi Jeff, thanks for the debug logs. I looked through the logs for the relevant information related to this client, but unfortunately it looks like they are captured at the default debug level and do not contain the rpctrace and dlmtrace log messages needed to trace the cause of the client eviction. The server evicted the client at 1579118407.432055 after 99s waiting for the lock callback, and the client noticed this at 1579118408.371668, but looking in the client logs 99s earlier there is nothing of interest for OST0001: 1579118281.410886:0:151405:0:(pinger.c:261:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: not pinging (in recovery or recovery disabled: IDLE) 1579118306.501460:0:151405:0:(pinger.c:248:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: level IDLE/11 force 1 force_next 0 deactive 0 pingable 1 suppress 0 1579118306.501463:0:151405:0:(pinger.c:261:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: not pinging (in recovery or recovery disabled: IDLE) 1579118331.592129:0:151405:0:(pinger.c:248:ptlrpc_pinger_process_import()) b2aa93cc-306c-345d-b999-acf8e974b6eb->testbed01-OST0000_UUID: level IDLE/11 force 1 force_next 0 deactive 0 pingable 1 suppress 0 Similarly, 99s earlier in the server logs there is just business as usual, committing transactions in the background, but no details about what is actually being sent over the network: 1579118306.736148:0:86248:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) testbed01-OST0001: transno 8589942253 is committed 1579118306.736154:0:86248:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) testbed01-OST0001: transno 8589942255 is committed 1579118306.736155:0:83616:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) testbed01-OST0001: transno 8589942256 is committed 1579118308.826828:0:82903:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) testbed01-OST0000: transno 8589946839 is committed 1579118308.826829:0:79772:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) testbed01-OST0000: transno 8589946814 is committed 1579118308.826841:0:19991:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) testbed01-OST0000: transno 8589946837 is committed We are also trying to reproduce this locally, but any chance you could try to reproduce this again there with "lctl set_param debug='+rpctrace +dlmtrace'" on both the client and server, and increasing the size of the debug buffers to hold enough information to debug it, like "lctl set_param debug_mb=32768" or similar (32GB, assuming the client and server have enough RAM)? | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 17/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
I'm on it. Don't know how long it will take to reproduce, hopefully I'll get lucky. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 30/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Managed to capture another occurrence. This time occurring on three clients at roughly the same time. Due to size of the Lustre debug files I cut everything down to a ten second window. If more is needed, like prior to the ten second window, I can generate more files to upload. In the uploaded files there is a lustrelog_X file and messages_X file for machines: mds00, oss00 and oss01 clients: r4i5n12, r4i6n3 and r4i6n5. The first occurrence of the partial writes occur at: r4i5n12: walltime: Jan 29 20:48:59 / epoch: 1580330939 r4i6n3: walltime: Jan 29 20:50:02 / epoch: 1580331122 r4i6n5: walltime: Jan 29 20:48:49 / epoch: 1580330929 rpctrace and dlmtrace are enabled in the Lustre logging.
All of the output and log files are ten second windows, epoch: 1580330930-1580330939. mds00: lustrelog_mds00_158033093x.txt.gz oss00: lustrelog_oss00_158033093x.txt.gz oss01: lustrelog_oss01_158033093x_A.txt.gz client-r4i5n12: lustrelog_r4i5n12_158033093x.txt.gz client-r4i6n3: lustrelog_r4i6n3_158033093x.txt.gz client-r4i6n5: lustrelog_r4i6n5_158033093x.txt.gz
| ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 30/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Jeff, I haven't had a chance to look at the logs yet, but if is the same as previous failures, I will need the OSS logs from 100s before the clients are evicted (from the time the eviction message is printed on the OSS console). | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 30/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
They're quite large. What's the max upload here? I can place them on a public share and add the link here. | ||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 30/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
That would work - or else you could use the Whamcloud write-only FTP site if you want to ensure that it is only Whamcloud people who get access to them (I can post details) | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 30/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Nothing confidential. Post or email me ftp details, just sounds easier. | ||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 30/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name. In the following example, there are three debug logs in a single directory and the JIRA issue
$ ls -lh
total 333M
rw-rr- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug
rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug
rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug
$ ncftp ftp.whamcloud.com
NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 99.96.190.235...
(vsFTPd 2.2.2)
Logging in...
Login successful.
Logged in to ftp.whamcloud.com.
ncftp / > cd uploads
Directory successfully changed.
ncftp /uploads > mkdir LU-4242
ncftp /uploads > cd LU-4242
Directory successfully changed.
ncftp /uploads/LU-4242 > put *
mds-debug: 97.66 MB 11.22 MB/s
oss-00-debug: 117.19 MB 11.16 MB/s
oss-01-debug: 117.48 MB 11.18 MB/s
ncftp /uploads/LU-4242 >
Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than Whamcloud staff) be able to see or read them. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 30/Jan/20 ] | ||||||||||||||||||||||||||||||||||||
|
Uploaded to WC FTP per instructions. | ||||||||||||||||||||||||||||||||||||
| Comment by Zhenyu Xu [ 02/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
What I observed one eviction case between client r4i5n12 and oss01 (OST0006), the client's clock contains 14 seconds earlier than that of oss01 client: oss01
1580330836.689047 ldlm_cli_enqueue() for EXT lock from OST0006
1580330850.7572429 ldlm_handle_enqueue0() got lock request, grant it, and sending reply to client
1580330836.689210 ldlm_cli_enqueue_fini() got granted from OST0006
...
1580330853.372794 ldlm_server_blocking_ast() someone else request lock conflicting this one, sending blocking AST
1580330853.372856 ldlm_add_waiting_lock() adding to waiting list (timeout: 100)
1580330839.304821 ldlm_handle_bl_callback() client blocking AST callback handler, got the blocking request
1580330839.304828 ldlm_cli_cancel_local() client-side cancel
1580330952.665030 expired_lock_main() blocking AST timedout, evicting the client
1580330938.660675 ptlrpc_connect_interpret() got the evict event
1580330938.672393 ldlm_cli_cancel_req() skipping cancel on invalid import
After oss01 sending the blocking RPC, the client hasn't f handle finished the blocking AST (lock cancel has been done, but does not reply to oss01 immediately) before the oss01 timed out waiting for the bl AST reply and evicted the client. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 02/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Bobijam, is there any record of the OSS sending the blocking AST callback to the client for that lock handle in either the client or server logs? | ||||||||||||||||||||||||||||||||||||
| Comment by Zhenyu Xu [ 03/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Andreas, yes, the client log records appears in r4i5n12.log, and OSS logs in oss01_20200129_2046_03.log and oss01_20200129_2046_04.log client oss01
1580330853.372856:0:46915:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-lustre-OST0006_UUID lock: ffff88e5c35298c0/0xdc8659a15ab03659 lrc: 4/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 4681 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x70000400000020 nid: 10.148.0.85@o2ib remote: 0x507863eddeba3639 expref: 1529 pid: 15923 timeout: 2328723 lvb_type: 0
1580330853.372863:0:46915:0:(client.c:1631:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost01_128:lustre-OST0006_UUID:46915:1654660910834016:10.148.0.85@o2ib:104
1580330853.372882:0:46915:0:(client.c:2368:ptlrpc_set_wait()) set ffff88ebf6072c80 going to sleep for 6seconds
1580330839.304799:0:26811:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb01_007:LOV_OSC_UUID+4:46915:x1654660910834016:12345-10.148.1.253@o2ib:104
1580330839.304816:0:26811:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb01_007:LOV_OSC_UUID+4:46915:x1654660910834016:12345-10.148.1.253@o2ib:104 Request processed in 17us (44us total) trans 0 rc 0/0
1580330839.304820:0:26811:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.148.1.253@o2ib, seq: 436262
1580330839.304821:0:9524:0:(ldlm_lockd.c:1727:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b76f8fb33c0/0x507863eddeba3639 lrc: 3/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 254 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x420000000000 nid: local remote: 0xdc8659a15ab03659 expref: -99 pid: 9573 timeout: 0 lvb_type: 1
The problem seems to be in the client, after receiving the blocking request from OSS, it cancel the lock locally, and after that the records show that other ldlm_cb threads starting handle other blocking requests. The one thing I noticed is that before the client got evicted due to the designated lock, the lock has been matched/used for IO, that could possibly explained why the lock cannot be cancelled right away. 1580330938.669799:0:12730:0:(osc_cache.c:1936:try_to_add_extent_for_io()) ### extent: ffff9b6975cd1550 ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b76f8fb33c0/0x507863eddeba3639 lrc: 4/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 257 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x429400000000 nid: local remote: 0xdc8659a15ab03659 expref: -99 pid: 9573 timeout: 0 lvb_type: 1 1580330938.669806:0:12730:0:(osc_cache.c:1114:osc_extent_make_ready()) ### extent: ffff9b71a7bebe40 ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b83d4e64d80/0x507863eddeb90550 lrc: 2/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 257 type: EXT [76764151808->76766248959] (req 76764151808->76765200383) flags: 0x20000020000 nid: local remote: 0xdc8659a15a97aef0 expref: -99 pid: 9573 timeout: 0 lvb_type: 1 .. 1580330938.670574:0:12719:0:(osc_cache.c:1936:try_to_add_extent_for_io()) ### extent: ffff9b6975cd1550 ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b76f8fb33c0/0x507863eddeba3639 lrc: 4/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 257 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x429400000000 nid: local remote: 0xdc8659a15ab03659 expref: -99 pid: 9573 timeout: 0 lvb_type: 1 1580330938.670596:0:12719:0:(osc_cache.c:1114:osc_extent_make_ready()) ### extent: ffff9b6975cd1550 ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b76f8fb33c0/0x507863eddeba3639 lrc: 4/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 257 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x429400000000 nid: local remote: 0xdc8659a15ab03659 expref: -99 pid: 9573 timeout: 0 lvb_type: 1 1580330938.670847:0:12719:0:(ldlm_lock.c:1468:ldlm_lock_match_with_skip()) ### matched (77265371136 77265375231) ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b76f8fb33c0/0x507863eddeba3639 lrc: 5/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 257 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x429400000000 nid: local remote: 0xdc8659a15ab03659 expref: -99 pid: 9573 timeout: 0 lvb_type: 1 ... 1580330938.671254:0:12727:0:(osc_cache.c:866:osc_extent_finish()) ### extent: ffff9b6975cd1550 ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b76f8fb33c0/0x507863eddeba3639 lrc: 4/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 258 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x429400000000 nid: local remote: 0xdc8659a15ab03659 expref: -99 pid: 9573 timeout: 0 lvb_type: 1 .. 1580330938.671514:0:12727:0:(osc_cache.c:488:__osc_extent_remove()) ### extent: ffff9b6975cd1550 ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b76f8fb33c0/0x507863eddeba3639 lrc: 4/0,0 mode: PW/PW res: [0x78b:0x0:0x0].0x0 rrc: 257 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x429400000000 nid: local remote: 0xdc8659a15ab03659 expref: -99 pid: 9573 timeout: 0 lvb_type: 1 .. .. 1580330938.672393:0:9524:0:(ldlm_request.c:1352:ldlm_cli_cancel_req()) skipping cancel on invalid import ffff9b897d3f5800 1580330938.672395:0:9524:0:(ldlm_lockd.c:1767:ldlm_handle_bl_callback()) ### client blocking callbackhandler END ns: lustre-OST0006-osc-ffff9b7a29056800 lock: ffff9b76f8fb33c0/0x507863eddeba3639 lrc: 1/0,0 mode: --/PW res: [0x78b:0x0:0x0].0x0 rrc: 257 type: EXT [77265371136->77270614015] (req 77265371136->77266419711) flags: 0x4c69400000000 nid: local remote: 0xdc8659a15ab03659 expref: -99 pid: 9573 timeout: 0 lvb_type: 1 | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 03/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Bobijam, thanks for finding this information. This makes sense given the symptoms seen, but now we need to find why the client was evicted. When a client gets a blocking AST callback, it still needs to flush all of the dirty pages under that lock before the lock can be cancelled, so that may prevent the client from cancelling the lock immediately. However, the client does need to reply to the OST for the blocking AST so that the OSS knows the AST was received. If not, the OSS should try to resend the AST at least once in case it was lost. After that, the client can continue to submit write RPCs under that lock handle to prolong the lock timeout on the server, since the server knows it might have a lot of data to flush. When the server gets the BRW RPCs with that lock handle embedded, they should be processed by the high priority RPC queue, and the OSS should update the lock timer so that the client is not evicted as long as it continues to send more writes. However, the client should not continue to add more dirty pages to the lock, since that is starving other clients from getting the lock. If things are working properly, the logs should show BRW RPCs being generated on the client for that OST object almost immediately, and the OSS processing them and prolonging the DLM lock timeout. Once all of the dirty pages under that lock are flushed, then the client should send an LDLM_CANCEL RPC to the OSS. One of several things could be happening:
To track this down, it makes sense to find all of the logs that mention the lock handle on the client 0x507863eddeba3639 and/or server (0xdc8659a15ab03659 (these both refer to the same lock). Looking at the logs shown above, it seems like things are mostly working OK. The client has the right flags set on the lock 0x429400000000: #define LDLM_FL_CBPENDING 0x0000000400000000ULL #define LDLM_FL_CANCEL 0x0000001000000000ULL #define LDLM_FL_CANCELING 0x0000008000000000ULL #define LDLM_FL_LVB_READY 0x0000020000000000ULL #define LDLM_FL_BL_AST 0x0000400000000000ULL It isn't clear to me if try_to_add_extent_for_io() and osc_extent_make_ready() are for new IO, or if this is flushing existing dirty pages? This is happening 100s after the blocking callback was sent, so it is surprising that the client wouldn't have finished flushing 1-5MB of dirty pages yet and cancelled the lock? The CANCELING flag is set, which means the client is at least trying to cancel it. Did that RPC not get sent (maybe it is blocked on the client), or maybe the cancel was piggy-backed into another RPC as an ELC request, but that RPC is not high priority, and is now stuck on the OSS? | ||||||||||||||||||||||||||||||||||||
| Comment by Zhenyu Xu [ 03/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Attached client.txt and oss.txt files, which is the result of the grep of the designed lock handle in client and oss01 logs. The lock was added to LRU list before the blocking AST was coming, after ldlm_cli_cancel_local() I don't see the reply either on client side or on the oss side. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 03/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Additional information:
There are no unique features that would explain why a specific client encounters the issue more frequently (same config, kernel, client, slurm job parameters, etc). This may be coincidental but seems worth pointing out what has been observed. (edited to change flip the assertions to client-sticky as opposed to server-sticky) | ||||||||||||||||||||||||||||||||||||
| Comment by Zhenyu Xu [ 04/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
After checking the similar ldlm_handle_bl_callback messages in the logs of HP-259, the after ldlm_cli_cancel_local(), the client should flush the dirty data, but the time gap between the cancel locally and try_to_add_extent_for_io() is 99 seconds, while the reproducer3-2020-01-31 client log in HP-259 shows that client at most took around 10 seconds to start the flush and the lock blocking ast handling can be finished and replied to OSS before the timeout set by the OSS. My speculation is that the client is busy sending RPCs, so that the flush requests cannot be sent out to OSS timely, I think increase the osc.*. max_rpcs_in_flight would decrease the chance of this error. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 04/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
We've seen the errors with max_rpcs_in_flight at 8 and at 32. Question: Is there a value you'd like to see run, or you think has a high probability of not triggering the issue? There are lnet routers in the client->server path of some instances of the event. o2ib-client / socklnd-server. The issue is seen on single lnet configurations. All socklnd or all o2ibd. Question: Are the failed flush requests the partial writes we're seeing? Or is there more to this issue where the partial writes are another part of a larger or different issue?
| ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
I don't think that increasing osc.*.max_rpcs_in_flight can be a permanent solution to the problem, since more RPCs in flight per OST may not speed up the IO processing on the OSS (e.g. other threads on the same client, and threads on other clients will also have lots of RPCs in flight, so there is going to be contention for processing all of the RPCs on the OSS). As long as the client is sending any BRW RPCs with the contended DLM lock handle (and they are being processed by the OST) it should be enough to refresh the lock timer on the OSS, so the client shouldn't be evicted. One possibility that we might consider is including the lock cancellation in the ELC buffer as part of the last BRW RPC? That would reduce the RPC count, and ensure that the lock cancel is processed as quickly as possible, but would need some special handling on the OSS to avoid canceling the lock before the BRWs are processed, maybe a refcount on the lock on the OSS for each BRW in the pending queue, in case the BRW RPCs are processed out of order? | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
so I checked the logs and I came to a similar conclusion as Bobijam: We have a bunch of dirty extents. a cancel request comes we try to shuffle it out, but there are a lot of other cancel activity ongoing so we sit and wait for our turn. what does lctl get_param osc.*.max_dirty_mb show? So if there's a problem with a "logjam" on the server side where our io RPCs are stuck behind other IO RPCs then even 32 might not be enough. I know it's sort of hard but settign the max_rpcs_in_flight to something like 1024 to ensure there's never any waiting client side for any sending might be the ultimate test here. (of course normally it's not a great idea to go that high because you can overwhelm the server side, but in this case as we suspect there's some sort of struggling with making RPCs out it might be an interesting experiment) | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
So, my comment is the same as previously - if we set max_rpcs_in_flight=1024 that would get more RPCs from this client into the server queue, but also more from every other client, and not get us further ahead of the problem (disk bandwidth is not able to keep up with short-term demand). It may actually make things worse, if the queues are very deep with "normal" write RPCs, then we have a harder time to get high-priority writes processed quickly. Are the LDLM_CANCEL RPCs being blocked? That definitely shouldn't be subject to max_rpcs_in_flight, if that is currently the case. Also, is there something blocking the LDLM_CANCEL RPCs being processed on the server? I believe I'd suggested to increase ldlm_canceld threads on the OSS, but if eg. they get blocked on disk IO for some reason (eg. update LVB?) then that would also be a problem. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
FWIW, I've seen this occur on bonded 40GbE socklnd (not multi-rail) and Inifniband (o2ibd). I wondered if Eth interrupt service getting long might have played a factor but seeing it on IB rules that out. Would setting higher values for ptlrpc:ldlm_num_threads help? Or, does that dynamically increase on demand? | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Jeff, the tread count is dynamic, but current max is only 32. I think increasing this on the server might reduce the logjam for the cancel RPCs, though it wasn't a total solution on the other test system. The real issue is if we change parameters, there has to be some way to verify that this reduces the frequency of the problem, and it wasn't clear to me how consistently you can reproduce it? If it is easy/100% then we can experiment, but if only occasionally it is hit then it will be hard (or very lengthy) to know if the change fixes the problem. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
I think I can iterate on this somewhat. The prevailing question seems to be.....is this being considered a bug or more application tuning? Throughout this thread it appears that Lustre is doing what it is supposed to but timeouts or misses are occurring. Also, does the missed cancel and failed flush result in what the reproducer sees as partial writes? I'm trying to tie this back to application side perception of the condition. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
IMHO, this is clearly in the realm of Lustre not doing what it is supposed to. While it can't make the storage run faster for this workload, it should accept the IO at the speed that the storage runs and allow it to eventually finish. When the lock times out, the (seemingly unresponsive from OSS point of view) client is evicted an this results in pending IO from the client being errored out. That results in the partial writes seen by the application. | ||||||||||||||||||||||||||||||||||||
| Comment by Alex Zhuravlev [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
there is a logic to prolong locks on OST, probably we should extend that to the clients? | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
adilger I think you are missing that the server side lock prolong logic only works when we have IO RPCs in the server queue. When the client did not send them yet the prolong logic does not work. bzzz But how do you let server know the client wants to prolong a lock? another rpc to get stuck in some queue? | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 06/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Currently the testbed is: osc.*.max_dirty_mb = 2000 osc.*.max_rpcs_in_flight = 8 Retesting with: osc.*.max_rpcs_in_flight = 256 (it rejects attempts to set to 1024) | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 07/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
hm, indeed it looks like max rpcs in flight has a hard limit at 512. Any news from the testing? | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 07/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
So far no failures. I'll post to ftp and add links with logs and debug outputs when it's triggered. lctl errors on any attempt to set osc.*.max_rpcs_in_flight higher than 256. 256 is the highest allowed value. Fails as well when going direct to sysfs and setting it directly: # echo 512 > /sys/fs/lustre/osc/testbed1-OST0000-osc-ffff98b1d3b98800/max_rpcs_in_flight -bash: echo: write error: Numerical result out of range | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 07/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37468 | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 07/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
this is really strange: lustre/include/obd.h:#define OBD_MAX_RIF_MAX 512
int obd_set_max_rpcs_in_flight(struct client_obd *cli, __u32 max)
{
struct obd_request_slot_waiter *orsw;
__u32 old;
int diff;
int i;
char *typ_name;
int rc;
if (max > OBD_MAX_RIF_MAX || max < 1)
return -ERANGE;
Anyway, I just added a patch that I think should help with the problem you are observing by making sure all IO resulting from lock blocking is submitted at all times (but we need to wait until sanity testing passes before I can recommend you to give it a try). | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 07/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg, is the https://review.whamcloud.com/37468 patch client side only or applied to both client and server?
| ||||||||||||||||||||||||||||||||||||
| Comment by Zhenyu Xu [ 07/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
It's a client-only patch. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 07/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37479 | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 11/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Testing patched 2.12 client. Reproducer lasts longer but the partial write still occurs as does client eviction. Once client experiences error, subsequent errors occur more frequently on that specific client. Will be posting debug logs from client and OSS soon.
| ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 11/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
how about the max_rpcs_in_flight at 256, did that help? | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 11/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
btw can you please produce full server-client logs with the patch in place? In theory it should be helping with this problem so I wonder if there's a gap somewhere so not all RPCs are sent as expected | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 11/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
@Oleg max_rpcs_in_flight=256 without patch did not help. max_rpcs_in_flight=256 with patch made the failures look different but still ended up with partial writes.
Trying to trigger again and capture in logs on both sides. Missed it the first time. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 13/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Partial write errors on multiple clients. This is using patched client with (7f7545a) patch. Trace taken on clients, MDS and OSS nodes. Logs uploaded to ftp.whamcloud.com/uploads in a new subdir under this LU's directory. (1.97GB compressed) This run was performed with Server-side: Lustre 2.12.2 / ZFS 0.7.13 max_rpcs_in_flight=256 , ldlm.services.*.threads_max=2048 228 clients active on two OSS nodes, four OSTs each.
You'll see in the logs that the numbers of evictions has decreased. However, there is still significant occurrences of tgt_grant mismatches on the OSS side, as well as the partial writes are still seen by the client. It appears that the grants are not properly handled or reconciled during or after an eviction.....and as I mentioned....the partial writes are still seen.
| ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 17/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Additional error traces and detailed config data (lnetctl, lctl, procfs, sysfs) uploaded to WC ftp site. | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 18/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
ok, so this time the trace changed significantly. The timeline is like this: In r4i5n6.log.xz: This is why I typically prefer logs to be gathered at -1 level when possible - so that all the data is there should it be needed even though the logs do grow up quite a bit. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 18/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Reproducer running with -1 logging, started this morning. Will post when logs are captured and uploaded. | ||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 19/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
aeonjeff could you please share details of the configuration you are using to reproduce this issue? # of MDS/MDY/OSS/OST/clients? Thanks! | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 19/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
1MDS, 1MDT, 2OSS, 8OST (4 per OSS), 228 clients. | ||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 19/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
> 1MDS, 1MDT, 2OSS, 8OST (4 per OSS), 228 clients. Thanks Jeff! | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 19/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Starting FTP upload of -1 logs taken yesterday as Oleg asked. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 20/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37479/ | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 21/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
something is very strange with your logs. the -1 logs from the clients are only 33M big and don't actually have the eviction event in them. Were you not using debug_daemon when collecting? | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 21/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg, The client logs are -1 level Lustre debug logs. Each one was uploaded as a xz compressed text log file at maximum compression. 00010000:00000010:31.1F:1582070557.810848:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b040af6a400. 00010000:00000010:30.1F:1582070558.032848:0:21371:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b04f90d2ac0. 00010000:00000010:30.1:1582070560.341846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14bcef5b00. 00010000:00000010:30.1:1582070562.723847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b04f90d0d80. 00010000:00000010:30.1:1582070565.329847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14b267cb40. 00010000:00000010:30.1:1582070567.295847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b04f90d21c0. 00010000:00000010:31.1:1582070569.019846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b03cdb13840. 00010000:00000010:30.1:1582070569.459849:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b0434b7c000. 00010000:00000010:31.1:1582070570.638847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14b67e0d80. 00010000:00000010:31.1:1582070570.876852:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14bcc82640. 00010000:00000010:31.1:1582070571.239848:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14b67e0480. 00010000:00000010:30.1:1582070571.245848:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b040af6a880. 00010000:00000010:30.1:1582070572.252845:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b03cdb121c0. 00010000:00000010:30.1:1582070573.248846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14d9a0c6c0. 00010000:00000010:31.1:1582070573.769848:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14d9a0dd40. 00010000:00000010:31.1:1582070574.078851:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8af4f507a640. 00010000:00000010:30.1:1582070574.962851:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b032eb2ef40. 00010000:00000010:31.1:1582070575.691847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b13d8f77a80. 00010000:00000010:31.1:1582070576.338847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b03f9b6d680. 00010000:00000010:30.1:1582070577.739861:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b044a392880. 00010000:00000010:30.1:1582070577.953858:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b044a391d40. 00010000:00000010:31.1:1582070578.137845:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b0379b06d00. 00010000:00000010:31.1:1582070580.180848:0:21406:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b13d12d98c0. 00010000:00000010:31.1:1582070581.632846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b032eb2f3c0. 00010000:00000010:31.1:1582070581.845846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14b67e0b40. 00010000:00000010:30.1:1582070582.592851:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14dafe5680. 00010000:00000010:30.1:1582070583.312850:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b0379b05b00. 00010000:00000010:31.1:1582070584.029846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b141e20c000. 00010000:00000010:30.1:1582070585.444858:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b140af4f600. 00010000:00000010:30.1:1582070586.682847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14bcc82d00. 00010000:00000010:30.1:1582070588.330859:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b03f4e7aac0. 00010000:00000010:30.1:1582070588.512851:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b0401ab5680. 00010000:00000010:31.1:1582070588.547847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b153ff13600. 00010000:00000010:31.1:1582070588.960847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b0444f69b00. 00010000:00000010:30.1:1582070589.822846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b141e20fcc0. 00010000:00000010:31.1:1582070590.324847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14fc748d80. 00010000:00000010:30.1:1582070594.188847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14fc74b180. 00010000:00000010:30.1:1582070596.826847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b03eb2498c0. 00010000:00000010:30.1:1582070598.907846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b13d12d9b00. 00010000:00000010:31.1:1582070632.074846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b03c2a0c240. 00010000:00000010:31.1:1582070634.096846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b04602c4d80. 00010000:00000010:31.1:1582070635.359846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b13d12d86c0. 00010000:00000010:31.1:1582070636.423846:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14aaf7c480. 00010000:00000010:31.1:1582070637.047847:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b041de90480. 00010000:00000010:31.1:1582070638.973850:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b0379b07cc0. 00010000:00000010:31.1:1582070642.645848:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b04ed7c4900. 00010000:00000010:31.1:1582070646.307851:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14dfb60900. 00010000:00000010:31.1:1582070650.296859:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b03eef53a80. 00010000:00000010:30.1:1582070652.007859:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b14fc748240. 00010000:00000010:31.1:1582070652.191853:0:0:0:(ldlm_lock.c:452:lock_handle_free()) slab-freed 'lock': 560 at ffff8b04daee0240.
| ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 21/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Well, the logs are at -1 level indeed, but there are gaps in them. For example the very first lock eviction, if we grep for the handle in all the logs, we get: # xzgrep b79351ea115a8522 *.log.xz oss00_20200219_0001_005.log.xz:00010000:00020000:23.0:1582070795.016453:0:48152:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 114s: evicting client at 10.148.0.154@o2ib ns: filter-lustre-OST0003_UUID lock: ffff9052dc9a72c0/0xb79351ea115a8522 lrc: 3/0,0 mode: PW/PW res: [0x972:0x0:0x0].0x0 rrc: 9259 type: EXT [866123776->872415231] (req 866123776->867172351) flags: 0x60000400020020 nid: 10.148.0.154@o2ib remote: 0x6114ebdee1f8ce7d expref: 456 pid: 98682 timeout: 4069356 lvb_type: 0 oss00_messages.log.xz:Feb 19 00:06:35 oss00 kernel: LustreError: 48152:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 114s: evicting client at 10.148.0.154@o2ib ns: filter-lustre-OST0003_UUID lock: ffff9052dc9a72c0/0xb79351ea115a8522 lrc: 3/0,0 mode: PW/PW res: [0x972:0x0:0x0].0x0 rrc: 9259 type: EXT [866123776->872415231] (req 866123776->867172351) flags: 0x60000400020020 nid: 10.148.0.154@o2ib remote: 0x6114ebdee1f8ce7d expref: 456 pid: 98682 timeout: 4069356 lvb_type: 0 So basically neither clients nor servers have nay records of it other than server has the eviction record. If we check second evicted lock handle (cbbeb26a5a7afac0), it's seen in the server logs more, but it's again totally absent in client logs. So in order to have whole consistent logs, debug_daemon needs to be used to avoid any gaps in logs. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 24/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
New logs uploaded to WC ftp. The debug_daemon method was used. One small complication, the clients are stateless (no local disk) so the size wasn't unlimited. Set for 32GB. The debug_daemon coughed up overflow and other errors periodically. I know at least one partial write/eviction sequence is captured but I don't know if anything was dropped by debug_daemon as it appeared to have issues keeping up with the output level at times.
| ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 24/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
typically a good way to combat debug daemon overflows is by increasing debug_mb setting. I'll check the logs to see what was captured | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 25/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
The clients are stateless. The ability to increase debug_mb is ram limited. | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 26/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
where are the logs going to? some slow network filesystem? with decently fast local storage I never had a need to increase debug_mb beyond 2048. if it goes to local ramdisk then you can keep debug_mb smaller so tmpfs could adsorb more data I guess. I checked the first couple of handles from eviction logs and no matches in the client logs it seems? | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 26/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
There is no network storage destination. The debug file is written to tmpfs on a stateless (diskless) client. We'll adjust debug_mb down to 2048 and try to trigger a stop of the reproducer when the first instance of partial writes are seen on the client side as a way to reduce noise and get more potent debug content. Can you provide any example of the message strings you are looking for? I want to make sure the debug output I upload contains some semblance of what you're looking for so your time isn't wasted assessing debug output that doesn't contain some actionable data. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 26/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
Would it be of any value to spin up an LFS in AWS with enough clients to trigger the reproducer error and just provide access to assess? I've considered this but I won't go through the process deploying it if you don't find any value in it. | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 28/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
if you can organize a reproducer instance we can directly access, that would be great! | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 04/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
While we look to creating a public access reproducer, a new batch of logs have been uploaded to WC FTP, subdir contains date string 20200228.
| ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 10/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 11/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
I checked the logs provided and unfortunately the log daemon writes cannot keep up with the amount of logs generated and as the result dropped a lot of messages, including the ones we are interested in. I guess we need to find an alternative way or drop the message level so. The goal is to ensure this sort of message does not appear in client logs: "debug daemon buffer overflowed; discarding 10% of pages" how about this debug level I wonder, if it does not help, try removing info from the list please. "inode super cache info warning other page dlmtrace error emerg rpctrace vfstrace console" | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 14/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37918 | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 15/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Uploaded a new debug capture. Directory name contains string '20200312'. Looks like a clean capture of a client partial write sequence with a good amount of lead up.
| ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 17/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Thank you for the log. Here's the culprit: 00000008:00000020:18.0:1584023888.667012:0:22780:0:(osc_cache.c:1626:osc_enter_cache()) lustre-OST0002-osc-ffff9ea3f41b7000: sleeping for cache space @ ffff9eb2dad977f0 for ffff9eaab54ae6f8
00000008:00000020:2.0F:1584023929.966471:0:22780:0:(osc_cache.c:1650:osc_enter_cache()) lustre-OST0002-osc-ffff9ea3f41b7000: grant { dirty: 3347/512000 dirty_pa
ges: 17171/8210805 dropped: 0 avail: -64749568, dirty_grant: 22151168, reserved: 32374784, flight: 21 } lru {in list: 8192, left: 6400, waiters: 0 }finally got grant space
00010000:00010000:4.0:1584023815.137465:0:8689:0:(ldlm_lockd.c:1770:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lustre-OST0002-osc-ffff9ea3f41b7000 lock: ffff9eaa4fb81200/0x88a95bcbaa133363 lrc: 4/0,1 mode: PW/PW res: [0xfc6:0x0:0x0].0x0 rrc: 4 type: EXT [51163168768->51166314495] (req 51163168768->51164217343) flags: 0x420400000000 nid: local remote: 0xb79351ea30b47450 expref: -99 pid: 22780 timeout: 0 lvb_type: 1 00000008:00010000:2.0:1584023929.966510:0:22780:0:(osc_cache.c:826:osc_extent_find()) ### extent: ffff9eaabe7bf8c0 ns: lustre-OST0002-osc-ffff9ea3f41b7000 lock: ffff9eaa4fb81200/0x88a95bcbaa133363 lrc: 4/0,1 mode: PW/PW res: [0xfc6:0x0:0x0].0x0 rrc: 4 type: EXT [51163168768->51166314495] (req 51163168768->51164217343) flags: 0x420400000000 nid: local remote: 0xb79351ea30b47450 expref: -99 pid: 22780 timeout: 0 lvb_type: 1 00010000:00010000:2.0:1584023929.970180:0:22780:0:(ldlm_lock.c:826:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PW) ns: ?? lock: ffff9eaa4fb81200/0x88a95bcbaa133363 lrc: 5/0,1 mode: PW/PW res: ?? rrc=?? type: ??? flags: 0x10420400000000 nid: local remote: 0xb79351ea30b47450 expref: -99 pid: 22780 timeout: 0 lvb_type: 1 00010000:00010000:2.0:1584023929.970189:0:22780:0:(ldlm_lock.c:882:ldlm_lock_decref_internal()) ### final decref done on cbpending lock ns: ?? lock: ffff9eaa4fb81200/0x88a95bcbaa133363 lrc: 4/0,0 mode: PW/PW res: ?? rrc=?? type: ??? flags: 0x10420400000000 nid: local remote: 0xb79351ea30b47450 expref: -99 pid: 22780 timeout: 0 lvb_type: 1 So basically osc_queue_async_io is called with ldlm lock held, gets stuck in osc_enter_cache because we don't have any free grant or memory allowance and we need to wait until an RPC arrives with more grant or frees up some previously dirty page. In the past we handled this by doing sync IO in this case to ensure nothing is stuck. Time to see why that code was reworked, I guess. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 18/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37967 | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 19/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg, A few questions:
| ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 19/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37992 | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 19/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
37967 patch is a port of 37918 to master. It is a further fix on top of 37479 (it can arise separately too, but 37479 is more likely to hit first masking this other issue, you did not hit it in your testing it appears based on your logs but please include it now just in case). the locking eviction is definitely a client-side issue promoted by slow response of OSTs to IO requests. I just made a patch 37992 to address this potentially excess waiting that I'd like you to try and see if it helps in your case. I am cautiously optimistic that's going to be the end of it, but it was quite a deep rabbit hole so who knows if another level of effects would be unearthed after this patch (in which case please collect additional logs). | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 19/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Okay, to be clear, apply 37967, 37479 and 37992 then retest? | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 19/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
No, please apply 37479, 37918 and 37922 | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 22/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Apply 37468 as well? Or does 37918 supersede 37468? | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 23/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oh, sorry. I mean 37468 37918 and 37922 37479 is not a fix, just a cleanup. I keep getting it mixed up with 37468 unfortunately | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 24/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37967/ | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 25/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38057 | ||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 26/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Do you have a sense as to when you will be able to run the next round of tests? Peter | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 26/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Gathering logs now to upload from a run with patches applied. Logged (failed) run has two client partial write failures and evictions. Will comment again when uploaded. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 28/Mar/20 ] | ||||||||||||||||||||||||||||||||||||
|
Latest logs uploading to a subdir with 20200325 date string in the name. This run was performed with all available patches applied as of 20200325. Upload should be complete within an hour or so from now. Logs are ~8GB xz compressed. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 01/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38114 | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 01/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Thank you for the logs. I see we needed some more gymnastics with the sync IO to ensure it actually is sent when we are out of grant instead of just sitting there in the queue. Can you please add the 38114 patch on top of all the other patches you already have and give that a try? Thanks. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 05/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
New logs uploaded, subdir named with date string 20200403. Partial write still occurs. However, what appears to be a large issue is the after effects of hitting the issue. The client never recovers. Once the issue occurs the grant accounting becomes fouled and never resolves itself. The partial write and eviction create an i/o error for outstanding commands but once the issue presents and the grant accounting is fouled that never recovers and causes endless failures from that point forward. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 07/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38057/ | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 08/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38164 | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 08/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Thank you for this new set of logs, it's puzzling in the sense that the last patch actually did not do anything (so the RPC that was supposed to be sent as high priority is not sent like that anyway.), I am not sure why you seem to have problems with grant recovery, how reproducible was that? It's a bit puzzling why the patch actually did not work, I wonder if IO region is covered by multipe locks (there are two big offset difference in locks in use in a single io). Determining all of the potential locks that a region might be covered under when we are out of grant seems to be a bit less trivial than I hoped. As a simple proof of concept we can issue all "out of grant" locks - the patch to try this approach is https://review.whamcloud.com/38164 - while I am thinking how to better tackle this, perhaps it would be great if you can verify it actually help as much as I hope it should. Thank you. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 08/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
So far the test patch results in a clean reproducer run. For expediency a smaller number of clients were used. Scaling up clients to levels of previous runs. Stay tuned, but so far the test only patch runs clean. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 09/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Reproducer run of 222 client nodes was successful using the new test patch. Whatever you tweaked in that patch definitely made a positive difference. | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 09/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
thank you for confirmation. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 15/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37918/ | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 15/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg, 37918 is a patch already included since mid-March. Is there anything new since we reported that your test patch passed without partial write and grant errors? I want to be sure I didn't miss something. | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 18/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
You are on the current set of patches. I am still working on 38164 patch replacement that actually looks at all locks that cover an extent. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 20/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38283 | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 21/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
I think I have a new patch for you to try. If you can back out last two patches: 38114 that you had and 38164 and then reapply current version of 38114 - that should be the proper way of implementing it. If you can give that a try, that would be appreciated. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 22/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
70 node reproducer run triggered the typical partial write and grant errors. Debug and associated logs uploaded to 2020421 subdirectory. Patch set used: 37468 – always allow high priority RPCs through (KEEP)
| ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 24/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Hm, the logs don't actually contain "the event", the eviction is there, but the lock and process that held it only appear at the moment of eviction for both evicted locks so it seems for whatever reason the log buffer overflowed and dropped some data. I also tried searching for "forced sync i/o" (precursor to the case we have at hand with current patches) and all occurrences of that happened after eviction too. How to gauge usefulness of logs: when you see an eviction message in server logs like this: LustreError: 34207:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.148.0.81@o2ib ns: filter-lustre-OST0003_UUID lock: ffff8dfe9532cc80/0x4c0e0269f5c19575 lrc: 3/0,0 mode: PW/PW res: [0x150c:0x0:0x0].0x0 rrc: 19369 type: EXT [81640030208->81645273087] (req 81640030208->81641078783) flags: 0x60000400020020 nid: 10.148.0.81@o2ib remote: 0x78c91a7ff8305a07 expref: 3878 pid: 67607 timeout: 1630888 lvb_type: 0 take the hex number after "remote:" and discard leading 0x (so in out case it would be 78c91a7ff8305a07) and then grep client debug logs for it. You should have two clusters of hits with timestamps ~100 seconds apart. This means you caught both sides of the needed data. if you only see it clustered in one spot (like with the latest set of logs) - that means one side is missing. I think it would be best if you also try to collect the logs immediately after you see eviction message in server logs, currently your clients seem to be running for considerable time after the eviction which unevenly discards portions of older (useful) logs sometimes resulting in such gaps. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 26/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
FYI: due to space and stateless client constraints the info messages were disabled, in case you are looking to those for context.
| ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 27/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
the messages I am interested in are not INFO, mostly CACHE, RPC and DLM | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 27/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
New traces uploaded, dated 20200424. Same patch set as 20200421.
| ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 28/Apr/20 ] | ||||||||||||||||||||||||||||||||||||
|
Thank you. These are the good traces, they show the problem clearly: When we are preparing to send a sync RPC - the lock is not yet contended so we send it with urgent priority and the RPC gets stuck in the queue. When a blocking callback arrives it sees the lock is busy and waits until it returns, which is obviously does not in time. The original PoC code sidestepped that by always sendign such requests as high-priority so the situation never arose. Let's see if I can have a good way to handle that. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 01/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38453 | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 04/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
I refreshed patches 37992 and 38114 and it would be great if you could give them a try again. Thank you. | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 06/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Refreshed patches appeared to work at ~70 clients and then failed when scaled up to ~200 clients. Logs uploaded to ftp, named 20200505 | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 09/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
thank you forthis additional testing pouring through the logs I noticed something I sort of ignored before - the timeout in osc_enter_cache_try is actually for a single waiting period in a potentially unlimited while loop so if you have a bunch of RPCs returning the timeout keeps renewing. This code was reworked in master with patch https://review.whamcloud.com/37605 (not directly applicable to b2_12) that ensures timeout is global for the duraction of that function. Conversion to longer timeout made it reappear in your testing because your relatively longer RPCs have more chances to return under the new timeout values and restart the timer. I'll try to port over relevant stuff from patch 37605 to b2_12 | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 15/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38610 | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 15/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
I have all the relevant patches pulled into b2_12-next branch in our git repo and it would be great if you can give that a try. Thanks! | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 20/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38283/ | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 20/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38453/ | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 20/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38672 | ||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 20/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
All patches landed for 2.14. aeonjeff we're still keen to get your report back on the effectiveness of the fixes in your testing | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 20/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
So far the latest patch set has run the reproducer very stable over two days. First runs at 70 clients, subsequent runs at 200+ clients. Have never seen this level of stability running the reproducer. Gathering more nodes to scale up higher but this is looking good. | ||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 20/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Thanks for the update Jeff - this is encouraging news. | ||||||||||||||||||||||||||||||||||||
| Comment by Oleg Drokin [ 20/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Thank you for the update, great to hear that. | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 21/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37468/ | ||||||||||||||||||||||||||||||||||||
| Comment by Jeff Johnson [ 21/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Overnight runs of 300+ client nodes. Never failed. Stick a fork in it, this one's done. #bravo | ||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 21/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Nice! | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 23/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38672/ | ||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 23/May/20 ] | ||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38610/ |