[LU-4616] ls or client eviction issues (ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST Created: 12/Feb/14  Updated: 22/May/14  Resolved: 02/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Manish Patel (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Server: 2.1.6 with LU-2943 and LU-3645 patch
Lustre Cleint: 1.8.7, 1.8.9 and 2.x


Attachments: File 4616.client-router.lnet     File 4616.client-server.lnet     File 4616.router-server.lnet     Text File Flock_lock_logs.txt     Text File IBT flags_logs.txt     Text File Server_System_logs_from_MDS_OSS_03-032014.txt     File aprun-DebugLogs.tar.gz     File backtrace.tar.gz     File march_12_svr.log.gz     Text File returned -22 from blocking AST_logs.txt     File rt_119855.c6.lustre_logs     Text File server_log_2_10_2014.log     Text File server_logs_03_05.log     Text File xfer_client_2_10_2014.log    
Severity: 2
Rank (Obsolete): 12632

 Description   

We seeing few issues with IU, which are on Lustre v2.1.6 with LU-2943 and LU-3645 patch and the issues are where sometimes clients get evicted and sometimes client doesn't lists the directory metadata information. When running an 'ls' on directory it misses to get the metadata information from the MDS node and it seems it's getting locked up.

Here is the example from one client.

?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_04_21
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_04_212014_02_08_01_04_21_test.log
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_06_22
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_06_222014_02_08_01_06_22_test.log
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_08_57_04

Server Logs

Feb 10 17:41:10 dc2mds01 kernel: : Lustre: dc2-MDT0000: haven't heard from client 6d16eeb8-e52c-8156-1c35-3be057a8b0ff (at 149.165.235.51@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8824273af800, cur 1392072070 expire 1392071920 last 1392071843
Feb 10 17:41:10 dc2oss08 kernel: : Lustre: dc2-OST004e: haven't heard from client 6d16eeb8-e52c-8156-1c35-3be057a8b0ff (at 149.165.235.51@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8815eeb28c00, cur 1392072070 expire 1392071920 last 1392071843
Feb 10 20:22:46 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 20:42:08 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs
Feb 10 21:02:47 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 21:22:09 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs
Feb 10 21:42:49 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 22:02:11 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs
Feb 10 22:22:50 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 22:42:12 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs
Feb 10 22:52:16 dc2oss12 kernel: : Lustre: dc2-OST0074: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 22:52:16 dc2oss12 kernel: : Lustre: 15986:0:(ldlm_lib.c:952:target_handle_connect()) dc2-OST0074: connection from 6aff6348-d240-0525-462c-22e1589f343f@149.165.235.51@tcp t0 exp ffff881158ad4c00 cur 1392090736 last 1392090736
Feb 10 23:02:51 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 23:12:34 dc2mds01 kernel: : LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 20028s: evicting client at 149.165.235.51@tcp  ns: mdt-ffff881843d1a000 lock: ffff882d78e8f480/0xcf86cac7791be105 lrc: 3/0,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 211 type: IBT flags: 0x4000020 remote: 0xf3ffedf65e028084 expref: 55 pid: 28615 timeout: 7109149431
Feb 10 23:12:39 dc2mds01 kernel: : Lustre: 13955:0:(ldlm_lib.c:952:target_handle_connect()) dc2-MDT0000: connection from 6aff6348-d240-0525-462c-22e1589f343f@149.165.235.51@tcp t225898436598 exp (null) cur 1392091959 last 0

Client logs

Feb 10 22:39:17 xfer01 kernel: LustreError: Skipped 85 previous similar messages
Feb 10 22:49:19 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16
Feb 10 22:49:19 xfer01 kernel: LustreError: Skipped 85 previous similar messages
Feb 10 22:52:16 xfer01 kernel: Lustre: 13130:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1459693143899338 sent from dc2-OST0074-osc-ffff810063082c00 to NID 10.10.0.12@o2ib 17s ago has timed out (17s prior to deadline).
Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0074-osc-ffff810063082c00: Connection to service dc2-OST0074 via nid 10.10.0.12@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Feb 10 22:52:16 xfer01 kernel: Lustre: 13130:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1459693143899342 sent from dc2-OST0078-osc-ffff810063082c00 to NID 10.10.0.12@o2ib 17s ago has timed out (17s prior to deadline).
Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0074-osc-ffff810063082c00: Connection restored to service dc2-OST0074 using nid 10.10.0.12@o2ib.
Feb 10 22:52:16 xfer01 kernel: Lustre: Server dc2-OST0074_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Feb 10 22:52:16 xfer01 kernel: Lustre: Skipped 3 previous similar messages
Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0078-osc-ffff810063082c00: Connection to service dc2-OST0078 via nid 10.10.0.12@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0079-osc-ffff810063082c00: Connection restored to service dc2-OST0079 using nid 10.10.0.12@o2ib.
Feb 10 22:59:21 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16
Feb 10 22:59:21 xfer01 kernel: LustreError: Skipped 85 previous similar messages
Feb 10 23:09:23 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16
Feb 10 23:09:23 xfer01 kernel: LustreError: Skipped 85 previous similar messages
Feb 10 23:10:53 xfer01 kernel: LustreError: 13174:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Feb 10 23:10:53 xfer01 kernel: LustreError: 13174:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Feb 10 23:12:39 xfer01 kernel: LustreError: 167-0: This client was evicted by dc2-MDT0000; in progress operations using this service will fail.
Feb 10 23:12:39 xfer01 kernel: Lustre: Server dc2-MDT0000_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Feb 10 23:12:39 xfer01 kernel: Lustre: Skipped 3 previous similar messages
Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(mdc_locks.c:672:mdc_enqueue()) ldlm_cli_enqueue error: -4
Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(client.c:859:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81012e144400 x1459693143907855/t0 o101->dc2-MDT0000_UUID@10.10.0.171@o2ib:12/10 lens 568/4880 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(file.c:119:ll_close_inode_openhandle()) inode 144593035466996022 mdc close failed: rc = -108
Feb 10 23:12:39 xfer01 kernel: Lustre: dc2-MDT0000-mdc-ffff810063082c00: Connection restored to service dc2-MDT0000 using nid 10.10.0.171@o2ib.

We have collected the stack trace from the MDS node and Lustre client. Upon further investigating the stack trace logs it seems its hitting with few bugs related to LU-1775, LU-2827, LU-3546 and LU-4112. It seems we hitting with all of this issues at IU.

Here are the stack trace logs which reports about IBT Flags "rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 7083106737"

00010000:00010000:1.0:1392065810.879717:0:28934:0:(ldlm_lockd.c:809:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-ffff881843d1a000 lock: ffff881bc9e066c0/0xcf86cac754afad6c lrc: 3/0,0 mode: PR/PR res: 8612640447/14904 bits 0x3 rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 0
00010000:00010000:1.0:1392065810.879720:0:28934:0:(ldlm_lockd.c:468:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: mdt-ffff881843d1a000 lock: ffff881bc9e066c0/0xcf86cac754afad6c lrc: 4/0,0 mode: PR/PR res: 8612640447/14904 bits 0x3 rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 7083106737
00000100:00100000:1.0:1392065810.879728:0:28934:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt_77:dc2-MDT0000_UUID:28934:1456766457101078:922@gni:104
00000100:00100000:1.0:1392065810.879744:0:28934:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt_77:dc2-MDT0000_UUID:28934:1456766457101079:923@gni:104

Here are the stack trace logs which reports about AST "(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST "

00010000:00010000:5.0:1392065759.472527:0:10020:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881bf2728000/0xcf86cac758e2f8c0 lrc: 1/0,0 mode: --/PR res: 8628747995/115206 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c2533 expref: 1102 pid: 4259 timeout: 7083106330
00010000:00010000:3.0:1392065759.685218:0:20661:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881b3e213900/0xcf86cac758e34cdc lrc: 1/0,0 mode: --/PR res: 8628747995/115215 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c25e9 expref: 1106 pid: 4259 timeout: 7083055542

Here are the stack trace logs which reports about Flock lock "(ldlm_flock.c:183:ldlm_process_flock_lock()) flags 0x40000"

00010000:00010000:1.0:1392065865.202616:0:4297:0:(ldlm_flock.c:183:ldlm_process_flock_lock()) flags 0x40000 owner 837 pid 837 mode 32 start 0 end 9223372036854775807
00000100:00100000:0.0:1392065865.202617:0:11853:0:(events.c:300:request_in_callback()) peer: 12345-1032@gni
00010000:00010000:1.0:1392065865.202621:0:4297:0:(ldlm_flock.c:108:ldlm_flock_destroy()) ### ldlm_flock_destroy(mode: 2, flags: 0x40000) ns: mdt-ffff881843d1a000 lock: ffff882878911240/0xcf86cac759740b5c lrc: 2/0,0 mode: PW/PW res: 8624191356/125099 rrc: 3 type: FLK pid: 837 [0->9223372036854775807] flags: 0x0 remote: 0xc0abc3a2c7d67cb8 expref: 807 pid: 4259 timeout: 0

Logs are attached here and I can also provide complete stack trace logs which ate taken from MDS node and Lustre Client node at that time period or running 'ls' on directory structure if it's required.

Thank You,
Manish



 Comments   
Comment by Peter Jones [ 12/Feb/14 ]

Bobijam

What do you advise here?

Thanks

Peter

Comment by Zhenyu Xu [ 17/Feb/14 ]

Is it reliably reproducible? what are the reproduce steps if possible? I tried to reproduce it with 2.1.6 server/1.8 client and didn't make it.

Comment by Manish Patel (Inactive) [ 18/Feb/14 ]

Hi Zhenyu,

Yes this is reproducible by just doing 'ls' to long listing of directory. The nature of the problem starts or gets reproducible after like 15-20 days of filesystem operation from the downtime. Which means we took down the MDS and OSS to clear that same problem and after reboot of the MDS and OSS file system got online and at that time we were not able to reproduce that problem. But after like 12-15 days the same problem re-appeared and we were able to reproduce it again by just doing a 'ls' on the directory to get the directory stats.

Here is in brief explanation from IU, what they are seeing and experiencing.

we're actually experiencing two issues, or at least different manifestations of a single problem. First, we have clients that hang when stat'ing certain directories and eventually get evicted from the MDS, and only the MDS. These clients log "The mds_connect operation failed with -16" until they detect that they've been evicted by the MDS.

Second, we're seeing clients evicted from OSS's with "
lock blocking callback time out: rc -107", which usually are not accompanied by an eviction from the MDS. Often 2-3 clients will get evicted at the same second, see the samples below:

Feb  7 15:42:32 dc2oss16 kernel: : LustreError: 138-a: dc2-OST009f: A client on nid 10.10.0.46@tcp was evicted due to a lock completion callback time out: rc -107 Feb  7 15:43:52 dc2oss16 kernel: : LustreError: 138-a: dc2-OST00a5: A client on nid 10.10.0.70@tcp was evicted due to a lock blocking callback time out: rc -107

Feb  7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 10.10.0.192@tcp was evicted due to a lock completion callback time out: rc -107 Feb  7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST0082: A client on nid 10.10.0.250@tcp was evicted due to a lock completion callback time out: rc -107

Feb  7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST001b: A client on nid 10.10.0.247@tcp was evicted due to a lock blocking callback time out: rc -107 Feb  7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST0019: A client on nid 10.10.0.94@tcp was evicted due to a lock blocking callback time out: rc -107 Feb  7 17:25:40 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 149.165.226.203@tcp was evicted due to a lock blocking callback time out: rc -107

Thank You,
Manish

Comment by Zhenyu Xu [ 26/Feb/14 ]

Would you mind trying this: when this issue happens, try to mount a new client and to check whether this new client also has this issue and collect the -1 logs (client, MDS and OSS) during the procedure?

Comment by Zhenyu Xu [ 26/Feb/14 ]

the currently provided log does not reveal why client returns -22 to OSS server.

Comment by Manish Patel (Inactive) [ 27/Feb/14 ]

Hi Zhenyu,

In reference to the comment "try to mount a new client", do you mean freshly built/created new Lustre Client with v 1.8.9 which is never connected before.

I have complete two sets of MDS and Lustre-client stack traces of those time, which you can download it from here.

http://ddntsr.com/ftp/2014-02-26-LU-4616.rar

I hope that helps to shed some light and if not then let me know what logs are required from the next test.

Thank You,
Manish

Comment by Zhenyu Xu [ 27/Feb/14 ]

yes, I mean a freshly connected new Lustre client, so that you can collect as small as relevant client logs from it.

Comment by Zhenyu Xu [ 27/Feb/14 ]

the 2014-02-26-LU-4616.rar does not contain info to reveal the root cause, if possible, I still need your help to get -1 debug logs (newly connected client, MDS and OSS) during the rehit. If the newly created client can rehit the issue, the -1 log covering that time span would contain relevant information. Thank you in advance.

Comment by Manish Patel (Inactive) [ 27/Feb/14 ]

Hi Zhenyu,

Thank you for your prompt response. Ok I will check with IU and see if we can get this setup done, to get new set of logs. In reference to "get -1 debug logs", so you mean to enable on client using below commands

sysctl -w lnet.printk=-1

And does that needs to be done on just that newly connected client or it has to be enabled on MDS and OSS also.

Other then that, do you also need stack trace, dlmtrace or rpctrace. If yes then from which nodes and what buffer size and what is the right way to enable those debug options and collect the logs out of it.

Thank You,
Manish

Comment by Zhenyu Xu [ 27/Feb/14 ]

Suppose you've builded a new client, set debug level to -1 on all involving re-hit nodes (the new client, MDS nodes, OSS nodes) with "lctl set_param debug=-1", then mount the client and try the re-hit procedure. If the rehit occurs, "lctl dk <log file>" to dump the -1 logs on all involving nodes, the -1 level log contains all level of logs.

Comment by Manish Patel (Inactive) [ 03/Mar/14 ]

Hi Zhenyu,

We have setup new client and collected the logs as you requested. Here is the 'ls' behavior on the directory as below:

The `ls -al` job timestamps and output are included below:

The `ls -al` job timestamps and output are included below:

=====
Begin 2014-03-03_11:53:52
ls start 2014-03-03_11:53:52
total 5366452
d????????? ? ?      ?            ?            ? .
drwxr-xr-x 7 917155 210       4096 Feb  8 22:01 ..
-rw-r--r-- 1 917155 210     188103 Feb  7 23:37 673.maf
drwxr-xr-x 2 917155 210       4096 Jan 29 00:32 backup
d????????? ? ?      ?            ?            ? blasttemp
-rw-r--r-- 1 917155 210     377500 Feb  7 22:48
blasttemp2014_02_07_22_25_19_blast.log
-rw-r--r-- 1 917155 210      54362 Feb  7 22:43
blasttemp2014_02_07_22_40_05_blast.log
-rw-r--r-- 1 917155 210      34941 Feb  7 22:48
blasttemp2014_02_07_22_46_08_blast.log
drwxr-xr-x 2 917155 210       4096 Feb  8 05:32
blasttemp_2014_02_08_00_47_34
-rw-r--r-- 1 917155 210    9747929 Feb  8 05:32
blasttemp_2014_02_08_00_47_342014_02_08_00_47_34_blast.log
drwxr-xr-x 2 917155 210       4096 Feb  8 06:48
blasttemp_2014_02_08_00_49_37
-rw-r--r-- 1 917155 210   15579008 Feb  8 06:48
blasttemp_2014_02_08_00_49_372014_02_08_00_49_37_blast.log
d????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_01_04_21
-????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_01_04_212014_02_08_01_04_21_blast.log
d????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_01_06_22
-????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_01_06_222014_02_08_01_06_22_blast.log
d????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_08_57_04
-rw-r--r-- 1 917155 210   20242620 Feb  8 22:03
blasttemp_2014_02_08_08_57_042014_02_08_08_57_04_blast.log
-rw-r--r-- 1 917155 210  326104853 Feb  7 18:32 lpr_comb_c95_0_18000.maf
-rw-r--r-- 1 917155 210     887278 Feb  8 05:32
lpr_comb_c95_0_18000.maf2014_02_08_00_47_34.mergelog.log
-rw-r--r-- 1 917155 210  433456072 Feb  8 05:32 lpr_comb_c95_0_18000_merge6.maf
-rw-r--r-- 1 917155 210   46735521 Feb  8 05:32
lpr_comb_c95_0_18000_merge6.maf_merged
-rw-r--r-- 1 917155 210  460095829 Feb  9 11:54 lpr_comb_c95_0_30000.maf
-????????? ? ?      ?            ?            ? lpr_comb_c95_0_50k.maf
-rw-r--r-- 1 917155 210  215968052 Feb  7 18:30 lpr_comb_c95_100001_end.maf
-????????? ? ?      ?            ?            ?
lpr_comb_c95_100001_end.maf2014_02_08_01_06_22.mergelog.log
-rw-r--r-- 1 917155 210    2089023 Feb  8 16:13
lpr_comb_c95_100001_end.maf2014_02_08_08_57_04.mergelog.log
-????????? ? ?      ?            ?            ?
lpr_comb_c95_100001_end_merge6_2.maf
-rw-r--r-- 1 917155 210   58659088 Feb  8 16:13
lpr_comb_c95_100001_end_merge6_2.maf_merged
-rw-r--r-- 1 917155 210   49456106 Feb  8 08:02
lpr_comb_c95_100001_end_merge6.maf
-rw-r--r-- 1 917155 210   59065234 Feb  8 08:02
lpr_comb_c95_100001_end_merge6.maf_merged
-rw-r--r-- 1 917155 210  226895678 Feb  5 20:05 lpr_comb_c95_10k_20k.maf
-????????? ? ?      ?            ?            ? lpr_comb_c95_18001_50000.maf
-rw-r--r-- 1 917155 210    1662375 Feb  8 06:48
lpr_comb_c95_18001_50000.maf2014_02_08_00_49_37.mergelog.log
-rw-r--r-- 1 917155 210  256010134 Feb  8 06:48 lpr_comb_c95_18001_50000_merge6.maf
-????????? ? ?      ?            ?            ?
lpr_comb_c95_18001_50000_merge6.maf_merged
-rw-r--r-- 1 917155 210  254099687 Feb  7 18:29 lpr_comb_c95_50001_100000.maf
-rw-r--r-- 1 917155 210    2298305 Feb  8 07:38
lpr_comb_c95_50001_100000.maf2014_02_08_01_04_21.mergelog.log
-rw-r--r-- 1 917155 210  149209527 Feb  8 07:38 lpr_comb_c95_50001_100000_merge6.maf
-rw-r--r-- 1 917155 210   75692995 Feb  8 07:38



ls end 2014-03-03_14:31:14
End 2014-03-03_14:31:31

=====

Here are the URL of those logs and I have attached one file which contains kern logs from MDS and OSS nodes log file named as "Server_System_logs_from_MDS_OSS_03-032014.txt".

http://ddntsr.com/ftp/2014-03-03-29545_run9.tar.gz
http://ddntsr.com/ftp/2014-03-03-dc2ossXX_run9_debug_logs.tar.gz

Let me know if you need anything else from our end.

Thank You,
Manish

Comment by Zhenyu Xu [ 04/Mar/14 ]

from the 29545_run9/mds_run9_debug.log, I can see the client 149.165.235.42@tcp was evicted at "Tue Mar 4 03:31:13 CST 2014"

00010000:00020000:8.1:1393875073.142106:0:0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 9437s: evicting client at 149.165.235.42@tcp ns: mdt-ffff881843d1a000 lock: ffff882afe438000/0xcf86cae155f4b253 lrc: 3/0,0 mode: PR/PR res: 35127297/2218887223 bits 0x3 rrc: 382 type: IBT flags: 0x4000020 remote: 0x8a1cc956fcd9690 expref: 53 pid: 12292 timeout: 8892268113

at "Tue Mar 4 03:31:14 CST 2014", the client thought that it is still connected with MDS, and tried to reconnect.

00000100:00100000:1:1393875074.464664:0:1977:0:(import.c:720:ptlrpc_connect_import()) @@@ reconnect request 1219 req@ffff8800307ed000 x1461574032228883/t0 o38->dc2-MDT0000_UUID@10.10.0.171@o2ib:12/10 lens 368/584 e 0 to 1 dl 0 ref 1 fl New:N/0/0 rc 0/0

but MDS has evicted it and told the client the situation, client realized the eviction.

00010000:00000400:3.0:1393875078.627576:0:12472:0:(ldlm_lib.c:952:target_handle_connect()) dc2-MDT0000: connection from a69ef170-baaf-a6ea-059d-2e445759c567@149.165.235.42@tcp t230235724090 exp (null) cur 1393875078 last 0

note: exp is NULL, which means MDS does not recognize this client, and treat it as a new client connection request.

the logs do not cover why the client failed to answer the MDS lock request though.

Comment by Manish Patel (Inactive) [ 07/Mar/14 ]

Hi Zhenyu,

We took a maintenance window and captured the vmcore crash dump by doing a manual Sysreq-trigger. Here is the back trace which was captured to see if that sheds some light here. There are also few Intel ticket which has some similar type of traces as in LU-2807, LU-4733, LU-4335 and LU-4195, which I thought might be helpful in investigating this issues.

PID: 20659  TASK: ffff88180eb38ae0  CPU: 11  COMMAND: "mdt_37"
 #0 [ffff88141d639c60] schedule at ffffffff8150de92
 #1 [ffff88141d639d28] cfs_waitq_wait at ffffffffa04ab60e [libcfs]
 #2 [ffff88141d639d38] ptlrpc_wait_event at ffffffffa07e19e9 [ptlrpc]
 #3 [ffff88141d639dc8] ptlrpc_main at ffffffffa07e956d [ptlrpc]
 #4 [ffff88141d639f48] kernel_thread at ffffffff8100c0ca

PID: 20660  TASK: ffff881850f5e040  CPU: 11  COMMAND: "mdt_38"
 #0 [ffff88141d669690] schedule at ffffffff8150de92
 #1 [ffff88141d669758] cfs_waitq_wait at ffffffffa04ab60e [libcfs]
 #2 [ffff88141d669768] lov_create at ffffffffa0a8250a [lov]
 #3 [ffff88141d669868] mdd_lov_create at ffffffffa0c7290e [mdd]
 #4 [ffff88141d669908] mdd_create at ffffffffa0c84871 [mdd]
 #5 [ffff88141d669a48] cml_create at ffffffffa0dae3f7 [cmm]
 #6 [ffff88141d669a98] mdt_reint_open at ffffffffa0d0d19d [mdt]
 #7 [ffff88141d669b78] mdt_reint_rec at ffffffffa0cf5c81 [mdt]
 #8 [ffff88141d669b98] mdt_reint_internal at ffffffffa0ceced4 [mdt]
 #9 [ffff88141d669be8] mdt_intent_reint at ffffffffa0ced53d [mdt]
#10 [ffff88141d669c38] mdt_intent_policy at ffffffffa0cebc09 [mdt]
#11 [ffff88141d669c78] ldlm_lock_enqueue at ffffffffa0795391 [ptlrpc]
#12 [ffff88141d669cd8] ldlm_handle_enqueue0 at ffffffffa07bb1dd [ptlrpc]
#13 [ffff88141d669d48] mdt_enqueue at ffffffffa0cec586 [mdt]
#14 [ffff88141d669d68] mdt_handle_common at ffffffffa0ce1772 [mdt]
#15 [ffff88141d669db8] mdt_regular_handle at ffffffffa0ce2665 [mdt]
#16 [ffff88141d669dc8] ptlrpc_main at ffffffffa07e9b9e [ptlrpc]
#17 [ffff88141d669f48] kernel_thread at ffffffff8100c0ca

PID: 20661  TASK: ffff882cfdadf540  CPU: 2   COMMAND: "mdt_39"
 #0 [ffff882745d7d870] schedule at ffffffff8150de92
 #1 [ffff882745d7d938] cfs_waitq_wait at ffffffffa04ab60e [libcfs]
 #2 [ffff882745d7d948] ldlm_completion_ast at ffffffffa07b2dca [ptlrpc]
 #3 [ffff882745d7d9e8] ldlm_cli_enqueue_local at ffffffffa07b2536 [ptlrpc]
 #4 [ffff882745d7da78] mdt_object_lock at ffffffffa0cde2a0 [mdt]
 #5 [ffff882745d7db18] mdt_getattr_name_lock at ffffffffa0ceec62 [mdt]
 #6 [ffff882745d7dbd8] mdt_intent_getattr at ffffffffa0cefc1d [mdt]
 #7 [ffff882745d7dc38] mdt_intent_policy at ffffffffa0cebc09 [mdt]
 #8 [ffff882745d7dc78] ldlm_lock_enqueue at ffffffffa0795391 [ptlrpc]
 #9 [ffff882745d7dcd8] ldlm_handle_enqueue0 at ffffffffa07bb1dd [ptlrpc]
#10 [ffff882745d7dd48] mdt_enqueue at ffffffffa0cec586 [mdt]
#11 [ffff882745d7dd68] mdt_handle_common at ffffffffa0ce1772 [mdt]
#12 [ffff882745d7ddb8] mdt_regular_handle at ffffffffa0ce2665 [mdt]
#13 [ffff882745d7ddc8] ptlrpc_main at ffffffffa07e9b9e [ptlrpc]
#14 [ffff882745d7df48] kernel_thread at ffffffff8100c0ca

Another thing, as the maintenance window was over and customer started mounting the Lustre-Client nodes and one of the node failed right after connecting to MDS. The only difference was that they used the different mount option. Originally that client node was mounted using "loclflock" option and this time it was mounted as "flock".

Mar  5 13:18:29 c6 kernel: LustreError: 11984:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff881589b2f240 (22024075/0/0/0) (rc: 2)
Mar  5 13:18:29 c6 kernel: Lustre: dc2-OST002c-osc-ffff8880133b3400: Connection restored to service dc2-OST002c using nid 10.10.0.5@o2ib.
Mar  5 13:18:29 c6 kernel: LustreError: 11137:0:(lov_request.c:211:lov_update_enqueue_set()) enqueue objid 0x1843a subobj 0x1500f8b on OST idx 44: rc -5
Mar  5 13:18:29 c6 kernel: LustreError: 11356:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
Mar  5 13:39:26 c6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.11@o2ib. The ldlm_enqueue operation failed with -107
Mar  5 13:39:26 c6 kernel: Lustre: dc2-OST0073-osc-ffff8880133b3400: Connection to service dc2-OST0073 via nid 10.10.0.11@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Mar  5 13:39:28 c6 kernel: LustreError: 11265:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Mar  5 13:39:28 c6 kernel: LustreError: 11265:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Mar  5 13:39:35 c6 kernel: LustreError: 167-0: This client was evicted by dc2-OST0073; in progress operations using this service will fail.
Mar  5 13:39:37 c6 kernel: Lustre: Server dc2-OST0073_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Mar  5 13:39:37 c6 kernel: LustreError: 13543:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO
Mar  5 13:39:37 c6 kernel: LustreError: 3378:0:(client.c:859:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff881fd195bc00 x1461691364654913/t0 o4->dc2-OST0073_UUID@10.10.0.11@o2ib:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Mar  5 13:39:37 c6 kernel: LustreError: 3378:0:(client.c:859:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Mar  5 13:39:38 c6 kernel: LustreError: 13616:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0073-osc-ffff8880133b3400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Mar  5 13:39:38 c6 kernel: LustreError: 13462:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
Mar  5 13:39:38 c6 kernel: LustreError: 13616:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff8815c0741b00 (21885599/0/0/0) (rc: 1)
Mar  5 13:39:39 c6 kernel: Lustre: dc2-OST0073-osc-ffff8880133b3400: Connection restored to service dc2-OST0073 using nid 10.10.0.11@o2ib.
Mar  5 13:44:23 c6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.8@o2ib. The ldlm_enqueue operation failed with -107
Mar  5 13:44:24 c6 kernel: Lustre: dc2-OST0052-osc-ffff8880133b3400: Connection to service dc2-OST0052 via nid 10.10.0.8@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Mar  5 13:44:31 c6 kernel: LustreError: 167-0: This client was evicted by dc2-OST0052; in progress operations using this service will fail.
Mar  5 13:44:31 c6 kernel: Lustre: Server dc2-OST0052_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Mar  5 13:44:31 c6 kernel: LustreError: 15981:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO
Mar  5 13:44:31 c6 kernel: LustreError: 15948:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO
Mar  5 13:44:31 c6 kernel: LustreError: 3378:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed
Mar  5 13:44:31 c6 kernel: LustreError: 3378:0:(osc_request.c:2357:brw_interpret()) LBUG

I have attached the backtrace logs and the logs from server and client nodes for this client eviction incidence.

https://jira.hpdd.intel.com/secure/attachment/14246/server_logs_03_05.log
https://jira.hpdd.intel.com/secure/attachment/14245/rt_119855.c6.lustre_logs
https://jira.hpdd.intel.com/secure/attachment/14244/backtrace.tar.gz

Thank You,
Manish

Comment by Zhenyu Xu [ 10/Mar/14 ]

the last client LBUG so far is addressed by patch http://review.whamcloud.com/#/c/5971 (LU-3067)

Comment by Manish Patel (Inactive) [ 10/Mar/14 ]

Hi Zhenyu,

Thank you for the updates. What about the backtrace of the MDS server which I provided? Can you check and see if that helps to find out about why the "stats of directory listing was not showing" and the "lock completion callback time out" and "blocking timeouts"?

Thank You,
Manish

Comment by Zhenyu Xu [ 11/Mar/14 ]

Some info in server_logs_03_05.log reminds me of LU-4359, they are some sort look alike.

Mar  5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Client 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp) refused reconnection, still busy with 5 active RPCs
Mar  5 13:18:29 dc2oss03 kernel: : LustreError: 20387:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff880767c28c00 x1461679860288976/t0(0) o3->86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043551 ref 1 fl Interpret:/0/0 rc 0/0
Mar  5 13:18:29 dc2oss03 kernel: : LustreError: 20387:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 2 previous similar messages
Mar  5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110
Mar  5 13:18:29 dc2oss03 kernel: : Lustre: Skipped 2 previous similar messages
Mar  5 13:18:29 dc2oss03 kernel: : LustreError: 20923:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff880883a49000 x1461679860288966/t0(0) o3->86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043548 ref 1 fl Interpret:/0/0 rc 0/0
Mar  5 13:18:29 dc2oss03 kernel: : LustreError: 20923:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 1 previous similar message
Mar  5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110
Mar  5 13:18:29 dc2oss03 kernel: : Lustre: Skipped 1 previous similar message
Mar  5 13:18:16 dc2oss05 kernel: : Lustre: 15852:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1394043489/real 1394043489]  req@ffff88181a7a8800 x1456766146495206/t0(0) o104->dc2-OST002c@149.165.226.206@tcp:15/16 lens 296/192 e 0 to 1 dl 1394043496 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar  5 13:18:16 dc2oss05 kernel: : Lustre: 15852:0:(client.c:1817:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
Mar  5 13:18:16 dc2oss05 kernel: : LustreError: 138-a: dc2-OST002c: A client on nid 149.165.226.206@tcp was evicted due to a lock blocking callback time out: rc -107
Mar  5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Client 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp) refused reconnection, still busy with 4 active RPCs
Mar  5 13:18:26 dc2oss08 kernel: : LustreError: 27762:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff880bd5827400 x1461679860288975/t0(0) o3->86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043551 ref 1 fl Interpret:/0/0 rc 0/0
Mar  5 13:18:26 dc2oss08 kernel: : LustreError: 27762:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 3 previous similar messages
Mar  5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110
Mar  5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110

It turned out that there is some problem about their network setup, does this system use hybrid network and would you mind taking consider of the network issue possibility?

Comment by Manish Patel (Inactive) [ 13/Mar/14 ]

Hi Zhenyu,

Yes this is the hybrid network with IB and TCP network with multiple subnet. At the last maintenance window we did some network changes suggested by the Intel in LU-4531, but still we seeing the MDS lockup or directory stats listing issues, so I would like you to reconsider looking at those old backtraces which I sent last time.

Yesterday we had one more downtime due to MDS lockup and we had to reboot the MDS to fix the directory stats listings issues. We have collected the stacktrace and the logs from the server and the clients of that time period. The logs are as below:

http://ddntsr.com/ftp/2014-03-12-29545_stacktraces.tgz
https://jira.hpdd.intel.com/secure/attachment/14289/march_12_svr.log.gz
https://jira.hpdd.intel.com/secure/attachment/14290/aprun-DebugLogs.tar.gz
http://ddntsr.com/ftp/2014-03-13-29545_es_lustre_showall_2014-03-13_082243.tar.bz2

Let me know if you need anything else from our end.

Thank You,
Manish

Comment by Oleg Drokin [ 14/Mar/14 ]

From the logs it looks like some OST became super slow for some reason (overloaded?) there are signs of MDT requests taking long time (in ost logs).
The "all logs" file sadly does not have any logs for dc2mds01 from what I see, so it's a bit harder to fully verify this theory, but it's pretty plausible.
We can see this picture in your earlier log excerpt too.
The code that's waiting for the objects to be allocated does so while holding an ldlm lock on the file and parent, so conflicting accesses also block in a cascading manner.

Also we can see there's a bunch of clients dropping (no pings happening) that I am not sure why, some network problems?

Mar 12 09:26:17 dc2oss01 kernel: : LustreError: 20182:0:(filter_io_26.c:781:filt
er_commitrw_write()) dc2-OST000a: slow direct_io 82s
Mar 12 09:26:18 dc2oss01 kernel: : LustreError: 22824:0:(filter_io_26.c:781:filt
er_commitrw_write()) dc2-OST000a: slow direct_io 82s
Mar 12 09:26:18 dc2oss01 kernel: : LustreError: 22824:0:(filter_io_26.c:781:filt
er_commitrw_write()) Skipped 4 previous similar messages
Mar 12 09:26:20 dc2oss01 kernel: : LustreError: 22816:0:(filter_io_26.c:781:filt
er_commitrw_write()) dc2-OST000a: slow direct_io 73s
Mar 12 09:26:20 dc2oss01 kernel: : LustreError: 22816:0:(filter_io_26.c:781:filt
er_commitrw_write()) Skipped 2 previous similar messages
Mar 12 09:26:24 dc2oss01 kernel: : LustreError: 19592:0:(filter_io_26.c:781:filt
er_commitrw_write()) dc2-OST000a: slow direct_io 79s
Mar 12 09:26:24 dc2oss01 kernel: : LustreError: 19592:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 4 previous similar messages
Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20421:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 92s
Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20421:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 8 previous similar messages
Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 15568:0:(filter.c:3459:filter_setattr()) dc2-OST000a: slow i_alloc_sem 89s
Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20225:0:(filter_io.c:752:filter_preprw_write()) dc2-OST000a: slow i_alloc_sem 63s
Mar 12 09:26:34 dc2oss01 kernel: : Lustre: dc2-OST000a: Export ffff88043011b000 already connecting from 149.165.226.218@tcp
Mar 12 09:26:56 dc2oss01 kernel: : LustreError: 22815:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 111s
Mar 12 09:26:56 dc2oss01 kernel: : LustreError: 22815:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 11 previous similar messages
Mar 12 09:26:57 dc2oss01 kernel: : Lustre: dc2-OST000a: Export ffff88043011b000 already connecting from 149.165.226.218@tcp
Mar 12 09:27:11 dc2oss09 kernel: : Lustre: dc2-OST0056: Client adc9bfa5-dcf7-5426-7ed1-a1ddfa428cf5 (at 129.79.10.14@tcp) reconnecting
Mar 12 09:27:11 dc2oss09 kernel: : Lustre: Skipped 120 previous similar messages
Mar 12 09:27:11 dc2oss09 kernel: : Lustre: 16243:0:(ldlm_lib.c:952:target_handle_connect()) dc2-OST0056: connection from adc9bfa5-dcf7-5426-7ed1-a1ddfa428cf5@129.79.10.14@tcp t0 exp ffff881829c07800 cur 1394630831 last 1394630831
Mar 12 09:27:11 dc2oss09 kernel: : Lustre: 16243:0:(ldlm_lib.c:952:target_handle_connect()) Skipped 130 previous similar messages
Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 22911:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 130s
Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 22911:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 7 previous similar messages
Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 15219:0:(filter_io.c:871:filter_preprw_write()) dc2-OST000a: slow start_page_write 129s
Comment by Liang Zhen (Inactive) [ 14/Mar/14 ]

Hi, I have a few questions here, it could be helpful if you can provide information about them:

  • when client(s) got evicted, I assume other clients in the same network are still functional, right?
  • if they are still functional, could you run LNet stress tests (LNet selftest) between another client and servers
  • when this issue appeared, could you check all routers (I assume there are four?) are still working.
    • check router status on the evicted clients
    • check router status on other clients in the same network.
    • check router status on server
    • login routers to see if they are able to connect to other nodes.
      Thanks!
Comment by Isaac Huang (Inactive) [ 14/Mar/14 ]

When running LNet selftest between client and servers, it makes sense to do it in 3 steps:
1. test between clients and routers
2. test between routers and servers
3. test between clients and servers

If either 1 or 2 fails, there's no need to go step 3. If step 3 fails, we know there's something wrong with LNet routing as the complete path has been verified by steps 1 and 2.

Comment by Justin Miller (Inactive) [ 14/Mar/14 ]

IB OSS - 10.10.0.1@o2ib
GNI/IB Router - 30@gni
GNI/IB Router - 10.10.0.51@o2ib
GNI Client - 396@gni

1. test between client and router

/opt/ddn/bin/lnet_selftest.sh -c "396@gni" -s "30@gni" -d 1:1 > /tmp/4616.client-router.lnet

2. test between routers and servers

/opt/ddn/bin/lnet_selftest.sh -c "30@gni" -s "10.10.0.1@o2ib" -d 1:1 > /tmp/4616.router-server.lnet

3. test between clients and servers

/opt/ddn/bin/lnet_selftest.sh -c "396@gni" -s "10.10.0.1@o2ib" -d 1:1 >
/tmp/4616.client-server.lnet

Comment by Justin Miller (Inactive) [ 14/Mar/14 ]

> when client(s) got evicted, I assume other clients in the same network are still functional, right?

Yes, other clients remain functional

> if they are still functional, could you run LNet stress tests (LNet selftest) between another client and servers

Yes, lnet selftest has been run from another client on that network and attached to this ticket

> when this issue appeared, could you check all routers (I assume there are four?) are still working.

There are four TCP<->IB routers for the clusters that mount over TCP.
There are twenty two GNI<->IB routers for the Cray clients on GNI.

The lnet selftests attached are for the GNI<->IB router, IB server, GNI client.

> check router status on the evicted clients

All routers are in UP state

> check router status on other clients in the same network.

All routers are in UP state

> check router status on server

All routers are in UP state

> login routers to see if they are able to connect to other nodes.

GNI<->IB routers are able to ping all GNI clients, all IB Lustre servers, including nodes in the bad state

Comment by Isaac Huang (Inactive) [ 14/Mar/14 ]

In 4616.client-server.lnet and 4616.router-server.lnet, there's lots of output like:
[R] Avg: inf MB/s Min: inf MB/s Max: inf MB/s

These don't look right, but I don't know why yet.

Comment by Bruno Faccini (Inactive) [ 14/Mar/14 ]

Just an update to explain the debug session I ran earlier today with the IU team :

_ they 1st showed me "live" how easy they can reproduce the issue. Its interesting to note that only certain directory seem to be always able to trigger and all others do not !!! When we do a simple "ls" in a known directory to trigger, it just hangs and the Client will be later evicted and can no longer reconnect.

_ after that, I asked to redo it from a fresh Client. Before to reproduce, I enabled the full debug-logs and grew the buffer size to 2M, and cleared the debug-log on both the MDS and the Client. And a few second after we reproduced the "ls" hang, I dumped the debug-log in /tmp/lustre-debug.bfi on both the MDS and the Client, then I forced-crash the Client and took a full stacks trace on the MDS (to be found in /var/log/messages).

Here is the Console output of the cmds I ran on the Client, that shows the "ls" PID :

[root@c4 ~]# lctl clear
[root@c4 ~]# date ; ls /N/dc2/scratch/smccaula &
Fri Mar 14 12:51:52 EDT 2014
[1] 70546
[root@c4 ~]# lctl dk /tmp/lustre-log.bfi
Debug log: 115742 lines, 115742 kept, 0 dropped, 0 bad.
[root@c4 ~]# sync
[root@c4 ~]# echo 1 > /proc/sys/kernel/sysrq
[root@c4 ~]# echo c > /proc/sys
sys/           sysrq-trigger  sysvipc/
[root@c4 ~]# echo c > /proc/sysrq-trigger

After that, I requested via email to the IU team to provide us the following infos :

/tmp/lustre-log.bfi for both Client/c4 and MDS
/var/log/messages for both Client/c4 and MDS
crash-dump of Client + lustre-modules/kernel-debuginfo* rpms
LNET NIDs of Client+MDS
Comment by Bruno Faccini (Inactive) [ 15/Mar/14 ]

I checked the uploaded stuff and seems that the /tmp/lustre-log.bfi (full debug-log) of both Client/c4 and MDS are missing, along with the kernel-debuginfo* rpms. They are not available on the IU dropbox too.

Seems that others debug-logs have been provided but their date is more than 2 hours later than the debug session I ran.

Also does the Client/c4 run with a stock 2.6.32-358.11.1.el6.x86_64 kernel ?

Comment by Peter Jones [ 15/Mar/14 ]

Just to keep all the info in the ticket for those monitoring

"We uploaded the missing /tmp/lustre-log.bfi to the ftp server as well as
the box site.

I am not sure why the timestamps are different. Are the different inside
the file (log entry timestamps) or is the file's timestamp off?

I believe the kernel-debuginfo* files you are looking for are inside the
tar files we uploaded. If they are not, we can re-upload.

I will check with our Cray admin to get the answer on the c4 client
version.

Please let me know if you need anything else."

then

"Concerning he time stamps being off, we had to perform the debug/crash
procedure multiple times before we properly captured all the debug info.
So it makes sense that the time stamps are later, as we performed the
actions later.

I will let you know about the client version when we get an update."

then

"c4 is running 1.8.9 with the ll_getname patch."

Comment by Justin Miller (Inactive) [ 15/Mar/14 ]

the lustre-log.bfi from the MDS has been upload to the ftp and box.iu.edu as dc2mds01.lustre-log.bfi.gz

the kernel-debuginfo* rpms are in box.iu.edu as well, or also in the client and mds tgz on the ftp site.

Concerning the time stamps being off, we had to perform the debug/crash procedure multiple times before we properly captured all the debug info. So it makes sense that the time stamps are later, as we performed the
actions later.

The full debug information provided is from a host named oss30, which is TCP client that connects through the dc2xfer[01-04] routers.

[root@oss30 ~]# rpm -qa | grep lustre
lustre-client-2.1.6-2.6.32_358.11.1.el6.x86_64.x86_64
lustre-client-modules-2.1.6-2.6.32_358.11.1.el6.x86_64.x86_64
[root@oss30 ~]# uname -a
Linux oss30 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed May 15 10:48:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

there is some confusion about the c4 client that needs cleared up. c4 was the client we were using for interactive debugging with Bruno, but its debug dumps got lost. We followed the same procedure for crashing and collecting on another client oss30

Comment by Di Wang [ 15/Mar/14 ]

A few information got from discussion with Justin

1. when "ls" failure happens, it only hit the certain directories, i.e. client only got evicted when ls these certain directories. But these "certain" directories might change after each restart.

2. The FS is created with lustre 2.1.3, right now it is 2.1.6 plus some patches. But client is 1.8.6, which is be mounted at least 2 filesystem, one is 2.1.6, another one is 1.8. (Justin, please confirm?)

3. The "ls" failure will happen on all clients, no matter 1.8 and 2.1.3 clients.

4. Before the failure happens on these directory, no correlated applications are being identified to touch these directories. But needs confirm.

Comment by Peter Jones [ 15/Mar/14 ]

Thanks Di. For #2 the patches are LU-2943 and LU-3645

Comment by Bruno Faccini (Inactive) [ 15/Mar/14 ]

Justin: Thanks for the clarification. But the "date/time" confusion I seemed to introduce, comes from the fact I was expecting files/logs/crash-dump from the session on c4/Client and did not find any clear indication/warning that you re-ran the session on an other node!!

Comment by Bruno Faccini (Inactive) [ 15/Mar/14 ]

Andreas, Di, feel free to correct/add !!
The MDS full stacks output in the syslog/messages shows a lot of threads being stuck doing renames. Actually we wonder if this could not be related to the problem.
In order to confirm this it would be nice if you can also provide us with the syslog/messages file from the MDS (dc2mds01 ?) that was used during the time when I was connected at IU and we reproduced with Client/c4, can you also provide this ??

Also, thanks to provide the dc2mds01.lustre-log.bfi from my debug session, but I still don't find the Client/c4 one in what is on ftp or dropbox. In fact, I may be wrong, but I think you did not actually provide anything from my debug session when connected at IU but only what you got from the remake on your own, can you help there ?

Comment by Justin Miller (Inactive) [ 15/Mar/14 ]

Bruno: You are right, we should have been clear about the logs that were upload. We unfortunately were unable to save full stack output from c4. We recreated the full stack output from another very similar client following the same procedures, and that is what we shared with everyone. They're from oss30 which I described previously.

We do have the dc2mds01 syslog/messages from the time of our interactive debug session while using c4. I have uploaded that log to box as 4616.mds01syslog.20130314.txt In fact that file is the entire March 14 messages file for dc2mds01, so it will have the c4 client debug, as well as our later oss30 client debug.

Comment by Andreas Dilger [ 15/Mar/14 ]

Looking through the mds_messages.log file, and filtering out the many other messages which are being printed there (e.g. spurious llog -ENOSPC error that would be fixed by LU-2336 patch http://review.whamcloud.com/5146, patch needed for b1_8 clients to fix LU-532 http://review.whamcloud.com/1161 warning, etc), it seems the first real problem in that log is Wednesday morning (2014-03-12 09:31):

Mar 12 09:31:29 dc2mds01 kernel: : Lustre: Service thread pid 11131 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 12 09:31:29 dc2mds01 kernel: : Pid: 11131, comm: mdt_40
Mar 12 09:31:29 dc2mds01 kernel: :
Mar 12 09:31:29 dc2mds01 kernel: : Call Trace:
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa04b260e>] cfs_waitq_wait+0xe/0x10 [libcfs]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0a2c50a>] lov_create+0xbaa/0x1400 [lov]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c0f90e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c21871>] mdd_create+0xf81/0x1a90 [mdd]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0d4b3f7>] cml_create+0x97/0x250 [cmm]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0caa19d>] mdt_reint_open+0x1abd/0x28f0 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c92c81>] mdt_reint_rec+0x41/0xe0 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c89ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c8a53d>] mdt_intent_reint+0x1ed/0x530 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c88c09>] mdt_intent_policy+0x379/0x690 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa06ff391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa07251dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c89586>] mdt_enqueue+0x46/0x130 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c7e772>] mdt_handle_common+0x932/0x1750 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0c7f665>] mdt_regular_handle+0x15/0x20 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [<ffffffffa0753b9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
Mar 12 09:31:29 dc2mds01 kernel: : LustreError: dumping log to /tmp/lustre-log.1394631089.11131

After this quite a number of other MDS threads also reported waiting and dumping stacks in lov_create(). Later on Wednesday the MDS was rebooted (17:04 and again at 18:02). The same problem with create threads waiting in lov_create() looks to have happened again a few hours later that evening (2014-03-12 23:01). An increasing number of threads appeared to become permanently blocked on the 13th and 14th, until the MDS was taken offline on the 14th, but I suspect those are just a byproduct of the other threads holding locks on the affected directories.

It isn't totally clear, but this may be caused by the LU-3645 patch http://review.whamcloud.com/7559 that is (AFAIK) installed at IU, but not landed on b2_1 yet. Fan Yong's comments on that patch are:

osc_create_with_fakereq() is blocked operation, call it within the interpret environment may cause the ptlrpcd to be blocked. Such behaviour should be avoided, otherwise the whole system may be hung.

If you really need to make some blocked operation, you can make it to be processed by some non-ptlrpcd environment. For example, a dedicated thread just handle the async_create callback.

I can't say for sure that this is the problem since the stacks of the other threads on the MDS are not available, but it does seem possible that this patch is causing problems. It might also just be that the OSTs are overloaded, but I never saw any messages in the rest of the console logs for any of the MDS threads that reported being blocked, and if the OSTs were just busy the threads would eventually have become unstuck. Given the number of other messages being printed, I would have expected some messages to be printed by those threads (there were other unrelated messages occasionally being printed before the threads dumped stack).

I would recommend that this patch be reverted from the MDS for now, to see if it is the cause of the problem. Should the problem happen again, in particular either of the messages:

Lustre: Service thread pid NNNNN was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later.
LustreError: ### lock timed out (enqueued at NNNNNNNNNN 200s ago); not entering recovery in server code, just going back to sleep

indicating that there are hung threads on the MDS, please dump all of the process stacks on the MDS via "echo t > /proc/sysrq-trigger" and attach the stack traces from the console here. It may be that only a partial dump of the stack traces is shown in /var/log/messages because of the large volume of messages, so preferably they can be captured directly from a serial console.

At this point, I don't think that the clients have anything to do with the current problem, they are just suffering from the hung MDS threads not being able to service requests, or blocking on locks held by the hung MDS service threads. I don't think the current problem relates to the one reported earlier in this ticket.

The original problem may potentially be caused by the MDS being busy and not processing RPC requests in a timely manner and evicting clients too soon. I saw a number of cases in the logs where a client was evicted due to a timed-out lock callback, and then a few seconds later the MDS rejected a cancel request from the same client. That kind of problem might be fixed by increasing the AT max timeout, or more recent changes to high-priority request handling.

Mar 10 17:52:11 dc2mds01 kernel: : LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 149.165.238.1@tcp  ns: mdt-ffff881837d4e000 lock: ffff881d6af0c000/0x78c70fdb970d7e0 lrc: 3/0,0 mode: PR/PR res: 8589943942/77226 bits 0x3 rrc: 13 type: IBT flags: 0x4000020 remote: 0xe213f1ffc604946c expref: 54330 pid: 11173 timeout: 4825691675
Mar 10 17:52:12 dc2mds01 kernel: : LustreError: 20011:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 149.165.238.1@tcp arrived at 1394488331 with bad export cookie 543933852487261410

It would probably be good to split these issues into two separate bugs.

Comment by nasf (Inactive) [ 16/Mar/14 ]

mdt_reint (include create case) and mdt_getattr are handled via the same request portal on the MDS, as long as the create operations as blocked because of interpret hung, even though other rdpg/seq service threads are idle, the "ls -l" on client still will be affected. Just as you have seen that the files namelist is available (via rpgd service), but you cannot get the owner/mode information (via mdt_getattr/mdt_getattr_name).

So I agree with Andreas to drop the patch http://review.whamcloud.com/7559 as a temporary solution.

Comment by Peter Jones [ 02/May/14 ]

As per DDN ok to close this ticket

Generated at Sat Feb 10 01:44:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.