Details
Description
We seeing few issues with IU, which are on Lustre v2.1.6 with LU-2943 and LU-3645 patch and the issues are where sometimes clients get evicted and sometimes client doesn't lists the directory metadata information. When running an 'ls' on directory it misses to get the metadata information from the MDS node and it seems it's getting locked up.
Here is the example from one client.
?--------- ? ? ? ? ? testtemp_2014_02_08_01_04_21 ?--------- ? ? ? ? ? testtemp_2014_02_08_01_04_212014_02_08_01_04_21_test.log ?--------- ? ? ? ? ? testtemp_2014_02_08_01_06_22 ?--------- ? ? ? ? ? testtemp_2014_02_08_01_06_222014_02_08_01_06_22_test.log ?--------- ? ? ? ? ? testtemp_2014_02_08_08_57_04
Server Logs
Feb 10 17:41:10 dc2mds01 kernel: : Lustre: dc2-MDT0000: haven't heard from client 6d16eeb8-e52c-8156-1c35-3be057a8b0ff (at 149.165.235.51@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8824273af800, cur 1392072070 expire 1392071920 last 1392071843 Feb 10 17:41:10 dc2oss08 kernel: : Lustre: dc2-OST004e: haven't heard from client 6d16eeb8-e52c-8156-1c35-3be057a8b0ff (at 149.165.235.51@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8815eeb28c00, cur 1392072070 expire 1392071920 last 1392071843 Feb 10 20:22:46 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 20:42:08 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs Feb 10 21:02:47 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 21:22:09 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs Feb 10 21:42:49 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 22:02:11 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs Feb 10 22:22:50 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 22:42:12 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs Feb 10 22:52:16 dc2oss12 kernel: : Lustre: dc2-OST0074: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 22:52:16 dc2oss12 kernel: : Lustre: 15986:0:(ldlm_lib.c:952:target_handle_connect()) dc2-OST0074: connection from 6aff6348-d240-0525-462c-22e1589f343f@149.165.235.51@tcp t0 exp ffff881158ad4c00 cur 1392090736 last 1392090736 Feb 10 23:02:51 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 23:12:34 dc2mds01 kernel: : LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 20028s: evicting client at 149.165.235.51@tcp ns: mdt-ffff881843d1a000 lock: ffff882d78e8f480/0xcf86cac7791be105 lrc: 3/0,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 211 type: IBT flags: 0x4000020 remote: 0xf3ffedf65e028084 expref: 55 pid: 28615 timeout: 7109149431 Feb 10 23:12:39 dc2mds01 kernel: : Lustre: 13955:0:(ldlm_lib.c:952:target_handle_connect()) dc2-MDT0000: connection from 6aff6348-d240-0525-462c-22e1589f343f@149.165.235.51@tcp t225898436598 exp (null) cur 1392091959 last 0
Client logs
Feb 10 22:39:17 xfer01 kernel: LustreError: Skipped 85 previous similar messages Feb 10 22:49:19 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16 Feb 10 22:49:19 xfer01 kernel: LustreError: Skipped 85 previous similar messages Feb 10 22:52:16 xfer01 kernel: Lustre: 13130:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1459693143899338 sent from dc2-OST0074-osc-ffff810063082c00 to NID 10.10.0.12@o2ib 17s ago has timed out (17s prior to deadline). Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0074-osc-ffff810063082c00: Connection to service dc2-OST0074 via nid 10.10.0.12@o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 10 22:52:16 xfer01 kernel: Lustre: 13130:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1459693143899342 sent from dc2-OST0078-osc-ffff810063082c00 to NID 10.10.0.12@o2ib 17s ago has timed out (17s prior to deadline). Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0074-osc-ffff810063082c00: Connection restored to service dc2-OST0074 using nid 10.10.0.12@o2ib. Feb 10 22:52:16 xfer01 kernel: Lustre: Server dc2-OST0074_UUID version (2.1.6.0) is much newer than client version (1.8.9) Feb 10 22:52:16 xfer01 kernel: Lustre: Skipped 3 previous similar messages Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0078-osc-ffff810063082c00: Connection to service dc2-OST0078 via nid 10.10.0.12@o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0079-osc-ffff810063082c00: Connection restored to service dc2-OST0079 using nid 10.10.0.12@o2ib. Feb 10 22:59:21 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16 Feb 10 22:59:21 xfer01 kernel: LustreError: Skipped 85 previous similar messages Feb 10 23:09:23 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16 Feb 10 23:09:23 xfer01 kernel: LustreError: Skipped 85 previous similar messages Feb 10 23:10:53 xfer01 kernel: LustreError: 13174:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Feb 10 23:10:53 xfer01 kernel: LustreError: 13174:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Feb 10 23:12:39 xfer01 kernel: LustreError: 167-0: This client was evicted by dc2-MDT0000; in progress operations using this service will fail. Feb 10 23:12:39 xfer01 kernel: Lustre: Server dc2-MDT0000_UUID version (2.1.6.0) is much newer than client version (1.8.9) Feb 10 23:12:39 xfer01 kernel: Lustre: Skipped 3 previous similar messages Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(mdc_locks.c:672:mdc_enqueue()) ldlm_cli_enqueue error: -4 Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(client.c:859:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff81012e144400 x1459693143907855/t0 o101->dc2-MDT0000_UUID@10.10.0.171@o2ib:12/10 lens 568/4880 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(file.c:119:ll_close_inode_openhandle()) inode 144593035466996022 mdc close failed: rc = -108 Feb 10 23:12:39 xfer01 kernel: Lustre: dc2-MDT0000-mdc-ffff810063082c00: Connection restored to service dc2-MDT0000 using nid 10.10.0.171@o2ib.
We have collected the stack trace from the MDS node and Lustre client. Upon further investigating the stack trace logs it seems its hitting with few bugs related to LU-1775, LU-2827, LU-3546 and LU-4112. It seems we hitting with all of this issues at IU.
Here are the stack trace logs which reports about IBT Flags "rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 7083106737"
00010000:00010000:1.0:1392065810.879717:0:28934:0:(ldlm_lockd.c:809:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-ffff881843d1a000 lock: ffff881bc9e066c0/0xcf86cac754afad6c lrc: 3/0,0 mode: PR/PR res: 8612640447/14904 bits 0x3 rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 0 00010000:00010000:1.0:1392065810.879720:0:28934:0:(ldlm_lockd.c:468:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: mdt-ffff881843d1a000 lock: ffff881bc9e066c0/0xcf86cac754afad6c lrc: 4/0,0 mode: PR/PR res: 8612640447/14904 bits 0x3 rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 7083106737 00000100:00100000:1.0:1392065810.879728:0:28934:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt_77:dc2-MDT0000_UUID:28934:1456766457101078:922@gni:104 00000100:00100000:1.0:1392065810.879744:0:28934:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt_77:dc2-MDT0000_UUID:28934:1456766457101079:923@gni:104
Here are the stack trace logs which reports about AST "(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST "
00010000:00010000:5.0:1392065759.472527:0:10020:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881bf2728000/0xcf86cac758e2f8c0 lrc: 1/0,0 mode: --/PR res: 8628747995/115206 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c2533 expref: 1102 pid: 4259 timeout: 7083106330 00010000:00010000:3.0:1392065759.685218:0:20661:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881b3e213900/0xcf86cac758e34cdc lrc: 1/0,0 mode: --/PR res: 8628747995/115215 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c25e9 expref: 1106 pid: 4259 timeout: 7083055542
Here are the stack trace logs which reports about Flock lock "(ldlm_flock.c:183:ldlm_process_flock_lock()) flags 0x40000"
00010000:00010000:1.0:1392065865.202616:0:4297:0:(ldlm_flock.c:183:ldlm_process_flock_lock()) flags 0x40000 owner 837 pid 837 mode 32 start 0 end 9223372036854775807 00000100:00100000:0.0:1392065865.202617:0:11853:0:(events.c:300:request_in_callback()) peer: 12345-1032@gni 00010000:00010000:1.0:1392065865.202621:0:4297:0:(ldlm_flock.c:108:ldlm_flock_destroy()) ### ldlm_flock_destroy(mode: 2, flags: 0x40000) ns: mdt-ffff881843d1a000 lock: ffff882878911240/0xcf86cac759740b5c lrc: 2/0,0 mode: PW/PW res: 8624191356/125099 rrc: 3 type: FLK pid: 837 [0->9223372036854775807] flags: 0x0 remote: 0xc0abc3a2c7d67cb8 expref: 807 pid: 4259 timeout: 0
Logs are attached here and I can also provide complete stack trace logs which ate taken from MDS node and Lustre Client node at that time period or running 'ls' on directory structure if it's required.
Thank You,
Manish