[LU-4616] ls or client eviction issues (ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST Created: 12/Feb/14 Updated: 22/May/14 Resolved: 02/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Manish Patel (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Server: 2.1.6 with |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 12632 |
| Description |
|
We seeing few issues with IU, which are on Lustre v2.1.6 with Here is the example from one client. ?--------- ? ? ? ? ? testtemp_2014_02_08_01_04_21 ?--------- ? ? ? ? ? testtemp_2014_02_08_01_04_212014_02_08_01_04_21_test.log ?--------- ? ? ? ? ? testtemp_2014_02_08_01_06_22 ?--------- ? ? ? ? ? testtemp_2014_02_08_01_06_222014_02_08_01_06_22_test.log ?--------- ? ? ? ? ? testtemp_2014_02_08_08_57_04 Server Logs Feb 10 17:41:10 dc2mds01 kernel: : Lustre: dc2-MDT0000: haven't heard from client 6d16eeb8-e52c-8156-1c35-3be057a8b0ff (at 149.165.235.51@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8824273af800, cur 1392072070 expire 1392071920 last 1392071843 Feb 10 17:41:10 dc2oss08 kernel: : Lustre: dc2-OST004e: haven't heard from client 6d16eeb8-e52c-8156-1c35-3be057a8b0ff (at 149.165.235.51@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8815eeb28c00, cur 1392072070 expire 1392071920 last 1392071843 Feb 10 20:22:46 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 20:42:08 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs Feb 10 21:02:47 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 21:22:09 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs Feb 10 21:42:49 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 22:02:11 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs Feb 10 22:22:50 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 22:42:12 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs Feb 10 22:52:16 dc2oss12 kernel: : Lustre: dc2-OST0074: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 22:52:16 dc2oss12 kernel: : Lustre: 15986:0:(ldlm_lib.c:952:target_handle_connect()) dc2-OST0074: connection from 6aff6348-d240-0525-462c-22e1589f343f@149.165.235.51@tcp t0 exp ffff881158ad4c00 cur 1392090736 last 1392090736 Feb 10 23:02:51 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting Feb 10 23:12:34 dc2mds01 kernel: : LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 20028s: evicting client at 149.165.235.51@tcp ns: mdt-ffff881843d1a000 lock: ffff882d78e8f480/0xcf86cac7791be105 lrc: 3/0,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 211 type: IBT flags: 0x4000020 remote: 0xf3ffedf65e028084 expref: 55 pid: 28615 timeout: 7109149431 Feb 10 23:12:39 dc2mds01 kernel: : Lustre: 13955:0:(ldlm_lib.c:952:target_handle_connect()) dc2-MDT0000: connection from 6aff6348-d240-0525-462c-22e1589f343f@149.165.235.51@tcp t225898436598 exp (null) cur 1392091959 last 0 Client logs Feb 10 22:39:17 xfer01 kernel: LustreError: Skipped 85 previous similar messages Feb 10 22:49:19 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16 Feb 10 22:49:19 xfer01 kernel: LustreError: Skipped 85 previous similar messages Feb 10 22:52:16 xfer01 kernel: Lustre: 13130:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1459693143899338 sent from dc2-OST0074-osc-ffff810063082c00 to NID 10.10.0.12@o2ib 17s ago has timed out (17s prior to deadline). Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0074-osc-ffff810063082c00: Connection to service dc2-OST0074 via nid 10.10.0.12@o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 10 22:52:16 xfer01 kernel: Lustre: 13130:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1459693143899342 sent from dc2-OST0078-osc-ffff810063082c00 to NID 10.10.0.12@o2ib 17s ago has timed out (17s prior to deadline). Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0074-osc-ffff810063082c00: Connection restored to service dc2-OST0074 using nid 10.10.0.12@o2ib. Feb 10 22:52:16 xfer01 kernel: Lustre: Server dc2-OST0074_UUID version (2.1.6.0) is much newer than client version (1.8.9) Feb 10 22:52:16 xfer01 kernel: Lustre: Skipped 3 previous similar messages Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0078-osc-ffff810063082c00: Connection to service dc2-OST0078 via nid 10.10.0.12@o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0079-osc-ffff810063082c00: Connection restored to service dc2-OST0079 using nid 10.10.0.12@o2ib. Feb 10 22:59:21 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16 Feb 10 22:59:21 xfer01 kernel: LustreError: Skipped 85 previous similar messages Feb 10 23:09:23 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16 Feb 10 23:09:23 xfer01 kernel: LustreError: Skipped 85 previous similar messages Feb 10 23:10:53 xfer01 kernel: LustreError: 13174:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Feb 10 23:10:53 xfer01 kernel: LustreError: 13174:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Feb 10 23:12:39 xfer01 kernel: LustreError: 167-0: This client was evicted by dc2-MDT0000; in progress operations using this service will fail. Feb 10 23:12:39 xfer01 kernel: Lustre: Server dc2-MDT0000_UUID version (2.1.6.0) is much newer than client version (1.8.9) Feb 10 23:12:39 xfer01 kernel: Lustre: Skipped 3 previous similar messages Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(mdc_locks.c:672:mdc_enqueue()) ldlm_cli_enqueue error: -4 Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(client.c:859:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff81012e144400 x1459693143907855/t0 o101->dc2-MDT0000_UUID@10.10.0.171@o2ib:12/10 lens 568/4880 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(file.c:119:ll_close_inode_openhandle()) inode 144593035466996022 mdc close failed: rc = -108 Feb 10 23:12:39 xfer01 kernel: Lustre: dc2-MDT0000-mdc-ffff810063082c00: Connection restored to service dc2-MDT0000 using nid 10.10.0.171@o2ib. We have collected the stack trace from the MDS node and Lustre client. Upon further investigating the stack trace logs it seems its hitting with few bugs related to Here are the stack trace logs which reports about IBT Flags "rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 7083106737" 00010000:00010000:1.0:1392065810.879717:0:28934:0:(ldlm_lockd.c:809:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-ffff881843d1a000 lock: ffff881bc9e066c0/0xcf86cac754afad6c lrc: 3/0,0 mode: PR/PR res: 8612640447/14904 bits 0x3 rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 0 00010000:00010000:1.0:1392065810.879720:0:28934:0:(ldlm_lockd.c:468:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: mdt-ffff881843d1a000 lock: ffff881bc9e066c0/0xcf86cac754afad6c lrc: 4/0,0 mode: PR/PR res: 8612640447/14904 bits 0x3 rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 7083106737 00000100:00100000:1.0:1392065810.879728:0:28934:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt_77:dc2-MDT0000_UUID:28934:1456766457101078:922@gni:104 00000100:00100000:1.0:1392065810.879744:0:28934:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt_77:dc2-MDT0000_UUID:28934:1456766457101079:923@gni:104 Here are the stack trace logs which reports about AST "(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST " 00010000:00010000:5.0:1392065759.472527:0:10020:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881bf2728000/0xcf86cac758e2f8c0 lrc: 1/0,0 mode: --/PR res: 8628747995/115206 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c2533 expref: 1102 pid: 4259 timeout: 7083106330 00010000:00010000:3.0:1392065759.685218:0:20661:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881b3e213900/0xcf86cac758e34cdc lrc: 1/0,0 mode: --/PR res: 8628747995/115215 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c25e9 expref: 1106 pid: 4259 timeout: 7083055542 Here are the stack trace logs which reports about Flock lock "(ldlm_flock.c:183:ldlm_process_flock_lock()) flags 0x40000" 00010000:00010000:1.0:1392065865.202616:0:4297:0:(ldlm_flock.c:183:ldlm_process_flock_lock()) flags 0x40000 owner 837 pid 837 mode 32 start 0 end 9223372036854775807 00000100:00100000:0.0:1392065865.202617:0:11853:0:(events.c:300:request_in_callback()) peer: 12345-1032@gni 00010000:00010000:1.0:1392065865.202621:0:4297:0:(ldlm_flock.c:108:ldlm_flock_destroy()) ### ldlm_flock_destroy(mode: 2, flags: 0x40000) ns: mdt-ffff881843d1a000 lock: ffff882878911240/0xcf86cac759740b5c lrc: 2/0,0 mode: PW/PW res: 8624191356/125099 rrc: 3 type: FLK pid: 837 [0->9223372036854775807] flags: 0x0 remote: 0xc0abc3a2c7d67cb8 expref: 807 pid: 4259 timeout: 0 Logs are attached here and I can also provide complete stack trace logs which ate taken from MDS node and Lustre Client node at that time period or running 'ls' on directory structure if it's required. Thank You, |
| Comments |
| Comment by Peter Jones [ 12/Feb/14 ] |
|
Bobijam What do you advise here? Thanks Peter |
| Comment by Zhenyu Xu [ 17/Feb/14 ] |
|
Is it reliably reproducible? what are the reproduce steps if possible? I tried to reproduce it with 2.1.6 server/1.8 client and didn't make it. |
| Comment by Manish Patel (Inactive) [ 18/Feb/14 ] |
|
Hi Zhenyu, Yes this is reproducible by just doing 'ls' to long listing of directory. The nature of the problem starts or gets reproducible after like 15-20 days of filesystem operation from the downtime. Which means we took down the MDS and OSS to clear that same problem and after reboot of the MDS and OSS file system got online and at that time we were not able to reproduce that problem. But after like 12-15 days the same problem re-appeared and we were able to reproduce it again by just doing a 'ls' on the directory to get the directory stats. Here is in brief explanation from IU, what they are seeing and experiencing. we're actually experiencing two issues, or at least different manifestations of a single problem. First, we have clients that hang when stat'ing certain directories and eventually get evicted from the MDS, and only the MDS. These clients log "The mds_connect operation failed with -16" until they detect that they've been evicted by the MDS. Second, we're seeing clients evicted from OSS's with " Feb 7 15:42:32 dc2oss16 kernel: : LustreError: 138-a: dc2-OST009f: A client on nid 10.10.0.46@tcp was evicted due to a lock completion callback time out: rc -107 Feb 7 15:43:52 dc2oss16 kernel: : LustreError: 138-a: dc2-OST00a5: A client on nid 10.10.0.70@tcp was evicted due to a lock blocking callback time out: rc -107 Feb 7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 10.10.0.192@tcp was evicted due to a lock completion callback time out: rc -107 Feb 7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST0082: A client on nid 10.10.0.250@tcp was evicted due to a lock completion callback time out: rc -107 Feb 7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST001b: A client on nid 10.10.0.247@tcp was evicted due to a lock blocking callback time out: rc -107 Feb 7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST0019: A client on nid 10.10.0.94@tcp was evicted due to a lock blocking callback time out: rc -107 Feb 7 17:25:40 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 149.165.226.203@tcp was evicted due to a lock blocking callback time out: rc -107 Thank You, |
| Comment by Zhenyu Xu [ 26/Feb/14 ] |
|
Would you mind trying this: when this issue happens, try to mount a new client and to check whether this new client also has this issue and collect the -1 logs (client, MDS and OSS) during the procedure? |
| Comment by Zhenyu Xu [ 26/Feb/14 ] |
|
the currently provided log does not reveal why client returns -22 to OSS server. |
| Comment by Manish Patel (Inactive) [ 27/Feb/14 ] |
|
Hi Zhenyu, In reference to the comment "try to mount a new client", do you mean freshly built/created new Lustre Client with v 1.8.9 which is never connected before. I have complete two sets of MDS and Lustre-client stack traces of those time, which you can download it from here. http://ddntsr.com/ftp/2014-02-26-LU-4616.rar I hope that helps to shed some light and if not then let me know what logs are required from the next test. Thank You, |
| Comment by Zhenyu Xu [ 27/Feb/14 ] |
|
yes, I mean a freshly connected new Lustre client, so that you can collect as small as relevant client logs from it. |
| Comment by Zhenyu Xu [ 27/Feb/14 ] |
|
the 2014-02-26- |
| Comment by Manish Patel (Inactive) [ 27/Feb/14 ] |
|
Hi Zhenyu, Thank you for your prompt response. Ok I will check with IU and see if we can get this setup done, to get new set of logs. In reference to "get -1 debug logs", so you mean to enable on client using below commands sysctl -w lnet.printk=-1 And does that needs to be done on just that newly connected client or it has to be enabled on MDS and OSS also. Other then that, do you also need stack trace, dlmtrace or rpctrace. If yes then from which nodes and what buffer size and what is the right way to enable those debug options and collect the logs out of it. Thank You, |
| Comment by Zhenyu Xu [ 27/Feb/14 ] |
|
Suppose you've builded a new client, set debug level to -1 on all involving re-hit nodes (the new client, MDS nodes, OSS nodes) with "lctl set_param debug=-1", then mount the client and try the re-hit procedure. If the rehit occurs, "lctl dk <log file>" to dump the -1 logs on all involving nodes, the -1 level log contains all level of logs. |
| Comment by Manish Patel (Inactive) [ 03/Mar/14 ] |
|
Hi Zhenyu, We have setup new client and collected the logs as you requested. Here is the 'ls' behavior on the directory as below: The `ls -al` job timestamps and output are included below: The `ls -al` job timestamps and output are included below: ===== Begin 2014-03-03_11:53:52 ls start 2014-03-03_11:53:52 total 5366452 d????????? ? ? ? ? ? . drwxr-xr-x 7 917155 210 4096 Feb 8 22:01 .. -rw-r--r-- 1 917155 210 188103 Feb 7 23:37 673.maf drwxr-xr-x 2 917155 210 4096 Jan 29 00:32 backup d????????? ? ? ? ? ? blasttemp -rw-r--r-- 1 917155 210 377500 Feb 7 22:48 blasttemp2014_02_07_22_25_19_blast.log -rw-r--r-- 1 917155 210 54362 Feb 7 22:43 blasttemp2014_02_07_22_40_05_blast.log -rw-r--r-- 1 917155 210 34941 Feb 7 22:48 blasttemp2014_02_07_22_46_08_blast.log drwxr-xr-x 2 917155 210 4096 Feb 8 05:32 blasttemp_2014_02_08_00_47_34 -rw-r--r-- 1 917155 210 9747929 Feb 8 05:32 blasttemp_2014_02_08_00_47_342014_02_08_00_47_34_blast.log drwxr-xr-x 2 917155 210 4096 Feb 8 06:48 blasttemp_2014_02_08_00_49_37 -rw-r--r-- 1 917155 210 15579008 Feb 8 06:48 blasttemp_2014_02_08_00_49_372014_02_08_00_49_37_blast.log d????????? ? ? ? ? ? blasttemp_2014_02_08_01_04_21 -????????? ? ? ? ? ? blasttemp_2014_02_08_01_04_212014_02_08_01_04_21_blast.log d????????? ? ? ? ? ? blasttemp_2014_02_08_01_06_22 -????????? ? ? ? ? ? blasttemp_2014_02_08_01_06_222014_02_08_01_06_22_blast.log d????????? ? ? ? ? ? blasttemp_2014_02_08_08_57_04 -rw-r--r-- 1 917155 210 20242620 Feb 8 22:03 blasttemp_2014_02_08_08_57_042014_02_08_08_57_04_blast.log -rw-r--r-- 1 917155 210 326104853 Feb 7 18:32 lpr_comb_c95_0_18000.maf -rw-r--r-- 1 917155 210 887278 Feb 8 05:32 lpr_comb_c95_0_18000.maf2014_02_08_00_47_34.mergelog.log -rw-r--r-- 1 917155 210 433456072 Feb 8 05:32 lpr_comb_c95_0_18000_merge6.maf -rw-r--r-- 1 917155 210 46735521 Feb 8 05:32 lpr_comb_c95_0_18000_merge6.maf_merged -rw-r--r-- 1 917155 210 460095829 Feb 9 11:54 lpr_comb_c95_0_30000.maf -????????? ? ? ? ? ? lpr_comb_c95_0_50k.maf -rw-r--r-- 1 917155 210 215968052 Feb 7 18:30 lpr_comb_c95_100001_end.maf -????????? ? ? ? ? ? lpr_comb_c95_100001_end.maf2014_02_08_01_06_22.mergelog.log -rw-r--r-- 1 917155 210 2089023 Feb 8 16:13 lpr_comb_c95_100001_end.maf2014_02_08_08_57_04.mergelog.log -????????? ? ? ? ? ? lpr_comb_c95_100001_end_merge6_2.maf -rw-r--r-- 1 917155 210 58659088 Feb 8 16:13 lpr_comb_c95_100001_end_merge6_2.maf_merged -rw-r--r-- 1 917155 210 49456106 Feb 8 08:02 lpr_comb_c95_100001_end_merge6.maf -rw-r--r-- 1 917155 210 59065234 Feb 8 08:02 lpr_comb_c95_100001_end_merge6.maf_merged -rw-r--r-- 1 917155 210 226895678 Feb 5 20:05 lpr_comb_c95_10k_20k.maf -????????? ? ? ? ? ? lpr_comb_c95_18001_50000.maf -rw-r--r-- 1 917155 210 1662375 Feb 8 06:48 lpr_comb_c95_18001_50000.maf2014_02_08_00_49_37.mergelog.log -rw-r--r-- 1 917155 210 256010134 Feb 8 06:48 lpr_comb_c95_18001_50000_merge6.maf -????????? ? ? ? ? ? lpr_comb_c95_18001_50000_merge6.maf_merged -rw-r--r-- 1 917155 210 254099687 Feb 7 18:29 lpr_comb_c95_50001_100000.maf -rw-r--r-- 1 917155 210 2298305 Feb 8 07:38 lpr_comb_c95_50001_100000.maf2014_02_08_01_04_21.mergelog.log -rw-r--r-- 1 917155 210 149209527 Feb 8 07:38 lpr_comb_c95_50001_100000_merge6.maf -rw-r--r-- 1 917155 210 75692995 Feb 8 07:38 ls end 2014-03-03_14:31:14 End 2014-03-03_14:31:31 ===== Here are the URL of those logs and I have attached one file which contains kern logs from MDS and OSS nodes log file named as "Server_System_logs_from_MDS_OSS_03-032014.txt". http://ddntsr.com/ftp/2014-03-03-29545_run9.tar.gz Let me know if you need anything else from our end. Thank You, |
| Comment by Zhenyu Xu [ 04/Mar/14 ] |
|
from the 29545_run9/mds_run9_debug.log, I can see the client 149.165.235.42@tcp was evicted at "Tue Mar 4 03:31:13 CST 2014"
at "Tue Mar 4 03:31:14 CST 2014", the client thought that it is still connected with MDS, and tried to reconnect.
but MDS has evicted it and told the client the situation, client realized the eviction.
note: exp is NULL, which means MDS does not recognize this client, and treat it as a new client connection request. the logs do not cover why the client failed to answer the MDS lock request though. |
| Comment by Manish Patel (Inactive) [ 07/Mar/14 ] |
|
Hi Zhenyu, We took a maintenance window and captured the vmcore crash dump by doing a manual Sysreq-trigger. Here is the back trace which was captured to see if that sheds some light here. There are also few Intel ticket which has some similar type of traces as in PID: 20659 TASK: ffff88180eb38ae0 CPU: 11 COMMAND: "mdt_37" #0 [ffff88141d639c60] schedule at ffffffff8150de92 #1 [ffff88141d639d28] cfs_waitq_wait at ffffffffa04ab60e [libcfs] #2 [ffff88141d639d38] ptlrpc_wait_event at ffffffffa07e19e9 [ptlrpc] #3 [ffff88141d639dc8] ptlrpc_main at ffffffffa07e956d [ptlrpc] #4 [ffff88141d639f48] kernel_thread at ffffffff8100c0ca PID: 20660 TASK: ffff881850f5e040 CPU: 11 COMMAND: "mdt_38" #0 [ffff88141d669690] schedule at ffffffff8150de92 #1 [ffff88141d669758] cfs_waitq_wait at ffffffffa04ab60e [libcfs] #2 [ffff88141d669768] lov_create at ffffffffa0a8250a [lov] #3 [ffff88141d669868] mdd_lov_create at ffffffffa0c7290e [mdd] #4 [ffff88141d669908] mdd_create at ffffffffa0c84871 [mdd] #5 [ffff88141d669a48] cml_create at ffffffffa0dae3f7 [cmm] #6 [ffff88141d669a98] mdt_reint_open at ffffffffa0d0d19d [mdt] #7 [ffff88141d669b78] mdt_reint_rec at ffffffffa0cf5c81 [mdt] #8 [ffff88141d669b98] mdt_reint_internal at ffffffffa0ceced4 [mdt] #9 [ffff88141d669be8] mdt_intent_reint at ffffffffa0ced53d [mdt] #10 [ffff88141d669c38] mdt_intent_policy at ffffffffa0cebc09 [mdt] #11 [ffff88141d669c78] ldlm_lock_enqueue at ffffffffa0795391 [ptlrpc] #12 [ffff88141d669cd8] ldlm_handle_enqueue0 at ffffffffa07bb1dd [ptlrpc] #13 [ffff88141d669d48] mdt_enqueue at ffffffffa0cec586 [mdt] #14 [ffff88141d669d68] mdt_handle_common at ffffffffa0ce1772 [mdt] #15 [ffff88141d669db8] mdt_regular_handle at ffffffffa0ce2665 [mdt] #16 [ffff88141d669dc8] ptlrpc_main at ffffffffa07e9b9e [ptlrpc] #17 [ffff88141d669f48] kernel_thread at ffffffff8100c0ca PID: 20661 TASK: ffff882cfdadf540 CPU: 2 COMMAND: "mdt_39" #0 [ffff882745d7d870] schedule at ffffffff8150de92 #1 [ffff882745d7d938] cfs_waitq_wait at ffffffffa04ab60e [libcfs] #2 [ffff882745d7d948] ldlm_completion_ast at ffffffffa07b2dca [ptlrpc] #3 [ffff882745d7d9e8] ldlm_cli_enqueue_local at ffffffffa07b2536 [ptlrpc] #4 [ffff882745d7da78] mdt_object_lock at ffffffffa0cde2a0 [mdt] #5 [ffff882745d7db18] mdt_getattr_name_lock at ffffffffa0ceec62 [mdt] #6 [ffff882745d7dbd8] mdt_intent_getattr at ffffffffa0cefc1d [mdt] #7 [ffff882745d7dc38] mdt_intent_policy at ffffffffa0cebc09 [mdt] #8 [ffff882745d7dc78] ldlm_lock_enqueue at ffffffffa0795391 [ptlrpc] #9 [ffff882745d7dcd8] ldlm_handle_enqueue0 at ffffffffa07bb1dd [ptlrpc] #10 [ffff882745d7dd48] mdt_enqueue at ffffffffa0cec586 [mdt] #11 [ffff882745d7dd68] mdt_handle_common at ffffffffa0ce1772 [mdt] #12 [ffff882745d7ddb8] mdt_regular_handle at ffffffffa0ce2665 [mdt] #13 [ffff882745d7ddc8] ptlrpc_main at ffffffffa07e9b9e [ptlrpc] #14 [ffff882745d7df48] kernel_thread at ffffffff8100c0ca Another thing, as the maintenance window was over and customer started mounting the Lustre-Client nodes and one of the node failed right after connecting to MDS. The only difference was that they used the different mount option. Originally that client node was mounted using "loclflock" option and this time it was mounted as "flock". Mar 5 13:18:29 c6 kernel: LustreError: 11984:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff881589b2f240 (22024075/0/0/0) (rc: 2) Mar 5 13:18:29 c6 kernel: Lustre: dc2-OST002c-osc-ffff8880133b3400: Connection restored to service dc2-OST002c using nid 10.10.0.5@o2ib. Mar 5 13:18:29 c6 kernel: LustreError: 11137:0:(lov_request.c:211:lov_update_enqueue_set()) enqueue objid 0x1843a subobj 0x1500f8b on OST idx 44: rc -5 Mar 5 13:18:29 c6 kernel: LustreError: 11356:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 Mar 5 13:39:26 c6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.11@o2ib. The ldlm_enqueue operation failed with -107 Mar 5 13:39:26 c6 kernel: Lustre: dc2-OST0073-osc-ffff8880133b3400: Connection to service dc2-OST0073 via nid 10.10.0.11@o2ib was lost; in progress operations using this service will wait for recovery to complete. Mar 5 13:39:28 c6 kernel: LustreError: 11265:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Mar 5 13:39:28 c6 kernel: LustreError: 11265:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Mar 5 13:39:35 c6 kernel: LustreError: 167-0: This client was evicted by dc2-OST0073; in progress operations using this service will fail. Mar 5 13:39:37 c6 kernel: Lustre: Server dc2-OST0073_UUID version (2.1.6.0) is much newer than client version (1.8.9) Mar 5 13:39:37 c6 kernel: LustreError: 13543:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO Mar 5 13:39:37 c6 kernel: LustreError: 3378:0:(client.c:859:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff881fd195bc00 x1461691364654913/t0 o4->dc2-OST0073_UUID@10.10.0.11@o2ib:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 Mar 5 13:39:37 c6 kernel: LustreError: 3378:0:(client.c:859:ptlrpc_import_delay_req()) Skipped 1 previous similar message Mar 5 13:39:38 c6 kernel: LustreError: 13616:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0073-osc-ffff8880133b3400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Mar 5 13:39:38 c6 kernel: LustreError: 13462:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5 Mar 5 13:39:38 c6 kernel: LustreError: 13616:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff8815c0741b00 (21885599/0/0/0) (rc: 1) Mar 5 13:39:39 c6 kernel: Lustre: dc2-OST0073-osc-ffff8880133b3400: Connection restored to service dc2-OST0073 using nid 10.10.0.11@o2ib. Mar 5 13:44:23 c6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.8@o2ib. The ldlm_enqueue operation failed with -107 Mar 5 13:44:24 c6 kernel: Lustre: dc2-OST0052-osc-ffff8880133b3400: Connection to service dc2-OST0052 via nid 10.10.0.8@o2ib was lost; in progress operations using this service will wait for recovery to complete. Mar 5 13:44:31 c6 kernel: LustreError: 167-0: This client was evicted by dc2-OST0052; in progress operations using this service will fail. Mar 5 13:44:31 c6 kernel: Lustre: Server dc2-OST0052_UUID version (2.1.6.0) is much newer than client version (1.8.9) Mar 5 13:44:31 c6 kernel: LustreError: 15981:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO Mar 5 13:44:31 c6 kernel: LustreError: 15948:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO Mar 5 13:44:31 c6 kernel: LustreError: 3378:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed Mar 5 13:44:31 c6 kernel: LustreError: 3378:0:(osc_request.c:2357:brw_interpret()) LBUG I have attached the backtrace logs and the logs from server and client nodes for this client eviction incidence. https://jira.hpdd.intel.com/secure/attachment/14246/server_logs_03_05.log Thank You, |
| Comment by Zhenyu Xu [ 10/Mar/14 ] |
|
the last client LBUG so far is addressed by patch http://review.whamcloud.com/#/c/5971 ( |
| Comment by Manish Patel (Inactive) [ 10/Mar/14 ] |
|
Hi Zhenyu, Thank you for the updates. What about the backtrace of the MDS server which I provided? Can you check and see if that helps to find out about why the "stats of directory listing was not showing" and the "lock completion callback time out" and "blocking timeouts"? Thank You, |
| Comment by Zhenyu Xu [ 11/Mar/14 ] |
|
Some info in server_logs_03_05.log reminds me of Mar 5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Client 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp) refused reconnection, still busy with 5 active RPCs Mar 5 13:18:29 dc2oss03 kernel: : LustreError: 20387:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff880767c28c00 x1461679860288976/t0(0) o3->86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043551 ref 1 fl Interpret:/0/0 rc 0/0 Mar 5 13:18:29 dc2oss03 kernel: : LustreError: 20387:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 2 previous similar messages Mar 5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110 Mar 5 13:18:29 dc2oss03 kernel: : Lustre: Skipped 2 previous similar messages Mar 5 13:18:29 dc2oss03 kernel: : LustreError: 20923:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff880883a49000 x1461679860288966/t0(0) o3->86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043548 ref 1 fl Interpret:/0/0 rc 0/0 Mar 5 13:18:29 dc2oss03 kernel: : LustreError: 20923:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 1 previous similar message Mar 5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110 Mar 5 13:18:29 dc2oss03 kernel: : Lustre: Skipped 1 previous similar message Mar 5 13:18:16 dc2oss05 kernel: : Lustre: 15852:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394043489/real 1394043489] req@ffff88181a7a8800 x1456766146495206/t0(0) o104->dc2-OST002c@149.165.226.206@tcp:15/16 lens 296/192 e 0 to 1 dl 1394043496 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 5 13:18:16 dc2oss05 kernel: : Lustre: 15852:0:(client.c:1817:ptlrpc_expire_one_request()) Skipped 17 previous similar messages Mar 5 13:18:16 dc2oss05 kernel: : LustreError: 138-a: dc2-OST002c: A client on nid 149.165.226.206@tcp was evicted due to a lock blocking callback time out: rc -107 Mar 5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Client 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp) refused reconnection, still busy with 4 active RPCs Mar 5 13:18:26 dc2oss08 kernel: : LustreError: 27762:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff880bd5827400 x1461679860288975/t0(0) o3->86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043551 ref 1 fl Interpret:/0/0 rc 0/0 Mar 5 13:18:26 dc2oss08 kernel: : LustreError: 27762:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 3 previous similar messages Mar 5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110 Mar 5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110 It turned out that there is some problem about their network setup, does this system use hybrid network and would you mind taking consider of the network issue possibility? |
| Comment by Manish Patel (Inactive) [ 13/Mar/14 ] |
|
Hi Zhenyu, Yes this is the hybrid network with IB and TCP network with multiple subnet. At the last maintenance window we did some network changes suggested by the Intel in Yesterday we had one more downtime due to MDS lockup and we had to reboot the MDS to fix the directory stats listings issues. We have collected the stacktrace and the logs from the server and the clients of that time period. The logs are as below: http://ddntsr.com/ftp/2014-03-12-29545_stacktraces.tgz Let me know if you need anything else from our end. Thank You, |
| Comment by Oleg Drokin [ 14/Mar/14 ] |
|
From the logs it looks like some OST became super slow for some reason (overloaded?) there are signs of MDT requests taking long time (in ost logs). Also we can see there's a bunch of clients dropping (no pings happening) that I am not sure why, some network problems? Mar 12 09:26:17 dc2oss01 kernel: : LustreError: 20182:0:(filter_io_26.c:781:filt er_commitrw_write()) dc2-OST000a: slow direct_io 82s Mar 12 09:26:18 dc2oss01 kernel: : LustreError: 22824:0:(filter_io_26.c:781:filt er_commitrw_write()) dc2-OST000a: slow direct_io 82s Mar 12 09:26:18 dc2oss01 kernel: : LustreError: 22824:0:(filter_io_26.c:781:filt er_commitrw_write()) Skipped 4 previous similar messages Mar 12 09:26:20 dc2oss01 kernel: : LustreError: 22816:0:(filter_io_26.c:781:filt er_commitrw_write()) dc2-OST000a: slow direct_io 73s Mar 12 09:26:20 dc2oss01 kernel: : LustreError: 22816:0:(filter_io_26.c:781:filt er_commitrw_write()) Skipped 2 previous similar messages Mar 12 09:26:24 dc2oss01 kernel: : LustreError: 19592:0:(filter_io_26.c:781:filt er_commitrw_write()) dc2-OST000a: slow direct_io 79s Mar 12 09:26:24 dc2oss01 kernel: : LustreError: 19592:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 4 previous similar messages Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20421:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 92s Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20421:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 8 previous similar messages Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 15568:0:(filter.c:3459:filter_setattr()) dc2-OST000a: slow i_alloc_sem 89s Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20225:0:(filter_io.c:752:filter_preprw_write()) dc2-OST000a: slow i_alloc_sem 63s Mar 12 09:26:34 dc2oss01 kernel: : Lustre: dc2-OST000a: Export ffff88043011b000 already connecting from 149.165.226.218@tcp Mar 12 09:26:56 dc2oss01 kernel: : LustreError: 22815:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 111s Mar 12 09:26:56 dc2oss01 kernel: : LustreError: 22815:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 11 previous similar messages Mar 12 09:26:57 dc2oss01 kernel: : Lustre: dc2-OST000a: Export ffff88043011b000 already connecting from 149.165.226.218@tcp Mar 12 09:27:11 dc2oss09 kernel: : Lustre: dc2-OST0056: Client adc9bfa5-dcf7-5426-7ed1-a1ddfa428cf5 (at 129.79.10.14@tcp) reconnecting Mar 12 09:27:11 dc2oss09 kernel: : Lustre: Skipped 120 previous similar messages Mar 12 09:27:11 dc2oss09 kernel: : Lustre: 16243:0:(ldlm_lib.c:952:target_handle_connect()) dc2-OST0056: connection from adc9bfa5-dcf7-5426-7ed1-a1ddfa428cf5@129.79.10.14@tcp t0 exp ffff881829c07800 cur 1394630831 last 1394630831 Mar 12 09:27:11 dc2oss09 kernel: : Lustre: 16243:0:(ldlm_lib.c:952:target_handle_connect()) Skipped 130 previous similar messages Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 22911:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 130s Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 22911:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 7 previous similar messages Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 15219:0:(filter_io.c:871:filter_preprw_write()) dc2-OST000a: slow start_page_write 129s |
| Comment by Liang Zhen (Inactive) [ 14/Mar/14 ] |
|
Hi, I have a few questions here, it could be helpful if you can provide information about them:
|
| Comment by Isaac Huang (Inactive) [ 14/Mar/14 ] |
|
When running LNet selftest between client and servers, it makes sense to do it in 3 steps: If either 1 or 2 fails, there's no need to go step 3. If step 3 fails, we know there's something wrong with LNet routing as the complete path has been verified by steps 1 and 2. |
| Comment by Justin Miller (Inactive) [ 14/Mar/14 ] |
|
IB OSS - 10.10.0.1@o2ib 1. test between client and router /opt/ddn/bin/lnet_selftest.sh -c "396@gni" -s "30@gni" -d 1:1 > /tmp/4616.client-router.lnet 2. test between routers and servers /opt/ddn/bin/lnet_selftest.sh -c "30@gni" -s "10.10.0.1@o2ib" -d 1:1 > /tmp/4616.router-server.lnet 3. test between clients and servers /opt/ddn/bin/lnet_selftest.sh -c "396@gni" -s "10.10.0.1@o2ib" -d 1:1 > |
| Comment by Justin Miller (Inactive) [ 14/Mar/14 ] |
|
> when client(s) got evicted, I assume other clients in the same network are still functional, right? Yes, other clients remain functional > if they are still functional, could you run LNet stress tests (LNet selftest) between another client and servers Yes, lnet selftest has been run from another client on that network and attached to this ticket > when this issue appeared, could you check all routers (I assume there are four?) are still working. There are four TCP<->IB routers for the clusters that mount over TCP. The lnet selftests attached are for the GNI<->IB router, IB server, GNI client. > check router status on the evicted clients All routers are in UP state > check router status on other clients in the same network. All routers are in UP state > check router status on server All routers are in UP state > login routers to see if they are able to connect to other nodes. GNI<->IB routers are able to ping all GNI clients, all IB Lustre servers, including nodes in the bad state |
| Comment by Isaac Huang (Inactive) [ 14/Mar/14 ] |
|
In 4616.client-server.lnet and 4616.router-server.lnet, there's lots of output like: These don't look right, but I don't know why yet. |
| Comment by Bruno Faccini (Inactive) [ 14/Mar/14 ] |
|
Just an update to explain the debug session I ran earlier today with the IU team : _ they 1st showed me "live" how easy they can reproduce the issue. Its interesting to note that only certain directory seem to be always able to trigger and all others do not !!! When we do a simple "ls" in a known directory to trigger, it just hangs and the Client will be later evicted and can no longer reconnect. _ after that, I asked to redo it from a fresh Client. Before to reproduce, I enabled the full debug-logs and grew the buffer size to 2M, and cleared the debug-log on both the MDS and the Client. And a few second after we reproduced the "ls" hang, I dumped the debug-log in /tmp/lustre-debug.bfi on both the MDS and the Client, then I forced-crash the Client and took a full stacks trace on the MDS (to be found in /var/log/messages). Here is the Console output of the cmds I ran on the Client, that shows the "ls" PID : [root@c4 ~]# lctl clear [root@c4 ~]# date ; ls /N/dc2/scratch/smccaula & Fri Mar 14 12:51:52 EDT 2014 [1] 70546 [root@c4 ~]# lctl dk /tmp/lustre-log.bfi Debug log: 115742 lines, 115742 kept, 0 dropped, 0 bad. [root@c4 ~]# sync [root@c4 ~]# echo 1 > /proc/sys/kernel/sysrq [root@c4 ~]# echo c > /proc/sys sys/ sysrq-trigger sysvipc/ [root@c4 ~]# echo c > /proc/sysrq-trigger After that, I requested via email to the IU team to provide us the following infos : /tmp/lustre-log.bfi for both Client/c4 and MDS /var/log/messages for both Client/c4 and MDS crash-dump of Client + lustre-modules/kernel-debuginfo* rpms LNET NIDs of Client+MDS |
| Comment by Bruno Faccini (Inactive) [ 15/Mar/14 ] |
|
I checked the uploaded stuff and seems that the /tmp/lustre-log.bfi (full debug-log) of both Client/c4 and MDS are missing, along with the kernel-debuginfo* rpms. They are not available on the IU dropbox too. Seems that others debug-logs have been provided but their date is more than 2 hours later than the debug session I ran. Also does the Client/c4 run with a stock 2.6.32-358.11.1.el6.x86_64 kernel ? |
| Comment by Peter Jones [ 15/Mar/14 ] |
|
Just to keep all the info in the ticket for those monitoring "We uploaded the missing /tmp/lustre-log.bfi to the ftp server as well as I am not sure why the timestamps are different. Are the different inside I believe the kernel-debuginfo* files you are looking for are inside the I will check with our Cray admin to get the answer on the c4 client Please let me know if you need anything else." then "Concerning he time stamps being off, we had to perform the debug/crash I will let you know about the client version when we get an update." then "c4 is running 1.8.9 with the ll_getname patch." |
| Comment by Justin Miller (Inactive) [ 15/Mar/14 ] |
|
the lustre-log.bfi from the MDS has been upload to the ftp and box.iu.edu as dc2mds01.lustre-log.bfi.gz the kernel-debuginfo* rpms are in box.iu.edu as well, or also in the client and mds tgz on the ftp site. Concerning the time stamps being off, we had to perform the debug/crash procedure multiple times before we properly captured all the debug info. So it makes sense that the time stamps are later, as we performed the The full debug information provided is from a host named oss30, which is TCP client that connects through the dc2xfer[01-04] routers. [root@oss30 ~]# rpm -qa | grep lustre there is some confusion about the c4 client that needs cleared up. c4 was the client we were using for interactive debugging with Bruno, but its debug dumps got lost. We followed the same procedure for crashing and collecting on another client oss30 |
| Comment by Di Wang [ 15/Mar/14 ] |
|
A few information got from discussion with Justin 1. when "ls" failure happens, it only hit the certain directories, i.e. client only got evicted when ls these certain directories. But these "certain" directories might change after each restart. 2. The FS is created with lustre 2.1.3, right now it is 2.1.6 plus some patches. But client is 1.8.6, which is be mounted at least 2 filesystem, one is 2.1.6, another one is 1.8. (Justin, please confirm?) 3. The "ls" failure will happen on all clients, no matter 1.8 and 2.1.3 clients. 4. Before the failure happens on these directory, no correlated applications are being identified to touch these directories. But needs confirm. |
| Comment by Peter Jones [ 15/Mar/14 ] |
| Comment by Bruno Faccini (Inactive) [ 15/Mar/14 ] |
|
Justin: Thanks for the clarification. But the "date/time" confusion I seemed to introduce, comes from the fact I was expecting files/logs/crash-dump from the session on c4/Client and did not find any clear indication/warning that you re-ran the session on an other node!! |
| Comment by Bruno Faccini (Inactive) [ 15/Mar/14 ] |
|
Andreas, Di, feel free to correct/add !! Also, thanks to provide the dc2mds01.lustre-log.bfi from my debug session, but I still don't find the Client/c4 one in what is on ftp or dropbox. In fact, I may be wrong, but I think you did not actually provide anything from my debug session when connected at IU but only what you got from the remake on your own, can you help there ? |
| Comment by Justin Miller (Inactive) [ 15/Mar/14 ] |
|
Bruno: You are right, we should have been clear about the logs that were upload. We unfortunately were unable to save full stack output from c4. We recreated the full stack output from another very similar client following the same procedures, and that is what we shared with everyone. They're from oss30 which I described previously. We do have the dc2mds01 syslog/messages from the time of our interactive debug session while using c4. I have uploaded that log to box as 4616.mds01syslog.20130314.txt In fact that file is the entire March 14 messages file for dc2mds01, so it will have the c4 client debug, as well as our later oss30 client debug. |
| Comment by Andreas Dilger [ 15/Mar/14 ] |
|
Looking through the mds_messages.log file, and filtering out the many other messages which are being printed there (e.g. spurious llog -ENOSPC error that would be fixed by Mar 12 09:31:29 dc2mds01 kernel: : Lustre: Service thread pid 11131 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 12 09:31:29 dc2mds01 kernel: : Pid: 11131, comm: mdt_40 Mar 12 09:31:29 dc2mds01 kernel: : Mar 12 09:31:29 dc2mds01 kernel: : Call Trace: Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa04b260e>] cfs_waitq_wait+0xe/0x10 [libcfs] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0a2c50a>] lov_create+0xbaa/0x1400 [lov] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c0f90e>] mdd_lov_create+0x9ee/0x1ba0 [mdd] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c21871>] mdd_create+0xf81/0x1a90 [mdd] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0d4b3f7>] cml_create+0x97/0x250 [cmm] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0caa19d>] mdt_reint_open+0x1abd/0x28f0 [mdt] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c92c81>] mdt_reint_rec+0x41/0xe0 [mdt] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c89ed4>] mdt_reint_internal+0x544/0x8e0 [mdt] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c8a53d>] mdt_intent_reint+0x1ed/0x530 [mdt] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c88c09>] mdt_intent_policy+0x379/0x690 [mdt] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa06ff391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa07251dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c89586>] mdt_enqueue+0x46/0x130 [mdt] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c7e772>] mdt_handle_common+0x932/0x1750 [mdt] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c7f665>] mdt_regular_handle+0x15/0x20 [mdt] Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0753b9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] Mar 12 09:31:29 dc2mds01 kernel: : LustreError: dumping log to /tmp/lustre-log.1394631089.11131 After this quite a number of other MDS threads also reported waiting and dumping stacks in lov_create(). Later on Wednesday the MDS was rebooted (17:04 and again at 18:02). The same problem with create threads waiting in lov_create() looks to have happened again a few hours later that evening (2014-03-12 23:01). An increasing number of threads appeared to become permanently blocked on the 13th and 14th, until the MDS was taken offline on the 14th, but I suspect those are just a byproduct of the other threads holding locks on the affected directories. It isn't totally clear, but this may be caused by the
I can't say for sure that this is the problem since the stacks of the other threads on the MDS are not available, but it does seem possible that this patch is causing problems. It might also just be that the OSTs are overloaded, but I never saw any messages in the rest of the console logs for any of the MDS threads that reported being blocked, and if the OSTs were just busy the threads would eventually have become unstuck. Given the number of other messages being printed, I would have expected some messages to be printed by those threads (there were other unrelated messages occasionally being printed before the threads dumped stack). I would recommend that this patch be reverted from the MDS for now, to see if it is the cause of the problem. Should the problem happen again, in particular either of the messages: Lustre: Service thread pid NNNNN was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. LustreError: ### lock timed out (enqueued at NNNNNNNNNN 200s ago); not entering recovery in server code, just going back to sleep indicating that there are hung threads on the MDS, please dump all of the process stacks on the MDS via "echo t > /proc/sysrq-trigger" and attach the stack traces from the console here. It may be that only a partial dump of the stack traces is shown in /var/log/messages because of the large volume of messages, so preferably they can be captured directly from a serial console. At this point, I don't think that the clients have anything to do with the current problem, they are just suffering from the hung MDS threads not being able to service requests, or blocking on locks held by the hung MDS service threads. I don't think the current problem relates to the one reported earlier in this ticket. The original problem may potentially be caused by the MDS being busy and not processing RPC requests in a timely manner and evicting clients too soon. I saw a number of cases in the logs where a client was evicted due to a timed-out lock callback, and then a few seconds later the MDS rejected a cancel request from the same client. That kind of problem might be fixed by increasing the AT max timeout, or more recent changes to high-priority request handling. Mar 10 17:52:11 dc2mds01 kernel: : LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 149.165.238.1@tcp ns: mdt-ffff881837d4e000 lock: ffff881d6af0c000/0x78c70fdb970d7e0 lrc: 3/0,0 mode: PR/PR res: 8589943942/77226 bits 0x3 rrc: 13 type: IBT flags: 0x4000020 remote: 0xe213f1ffc604946c expref: 54330 pid: 11173 timeout: 4825691675 Mar 10 17:52:12 dc2mds01 kernel: : LustreError: 20011:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 149.165.238.1@tcp arrived at 1394488331 with bad export cookie 543933852487261410 It would probably be good to split these issues into two separate bugs. |
| Comment by nasf (Inactive) [ 16/Mar/14 ] |
|
mdt_reint (include create case) and mdt_getattr are handled via the same request portal on the MDS, as long as the create operations as blocked because of interpret hung, even though other rdpg/seq service threads are idle, the "ls -l" on client still will be affected. Just as you have seen that the files namelist is available (via rpgd service), but you cannot get the owner/mode information (via mdt_getattr/mdt_getattr_name). So I agree with Andreas to drop the patch http://review.whamcloud.com/7559 as a temporary solution. |
| Comment by Peter Jones [ 02/May/14 ] |
|
As per DDN ok to close this ticket |