[LU-10090] unable to mount client through router Created: 05/Oct/17  Updated: 07/Oct/17  Resolved: 07/Oct/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Sonia Sharma (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: File mountnbp1.trace.ldebug.out.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Lustre routers: 2.10.1 and 2.9
Clients Lustre 2.10.1 and 2.9
Servers 2.7.1

Some node will not mount lustre after a reboot. Once they get into this state even a reboot won't fix the mounting issue. We can ping and lctl ping the router. The mount request get to mgs and mgs replies. When the mount is attempted on the router we get this error

Oct  5 16:11:56 elrtr8 kernel: [1507245116.628251] LNetError: 13746:0:(o2iblnd_cb.c:3186:kiblnd_check_conns()) Timed out RDMA with 10.149.4.164@o2ib313 (57): c: 58, oc: 0, rc: 63

Here is the debug trace from the client:

00000020:01200004:12.0:1507244951.155731:0:7926:0:(obd_mount.c:1383:lustre_fill_super()) VFS Op: sb ffff88102bae7000
00000020:01000004:12.0:1507244951.155740:0:7926:0:(obd_mount.c:845:lmd_print())   mount data:
00000020:01000004:12.0:1507244951.155740:0:7926:0:(obd_mount.c:847:lmd_print()) profile: nbp1-client
00000020:01000004:12.0:1507244951.155741:0:7926:0:(obd_mount.c:848:lmd_print()) device:  10.151.26.117@o2ib:/nbp1
00000020:01000004:12.0:1507244951.155742:0:7926:0:(obd_mount.c:849:lmd_print()) flags:   2
00000020:01000004:12.0:1507244951.155743:0:7926:0:(obd_mount.c:852:lmd_print()) options: flock
00000020:01000004:12.0:1507244951.155744:0:7926:0:(obd_mount.c:1408:lustre_fill_super()) Mounting client nbp1-client
00000020:01000004:12.0:1507244951.155799:0:7926:0:(obd_mount.c:335:lustre_start_mgc()) Start MGC 'MGC10.151.26.117@o2ib'
00000020:00000080:12.0:1507244951.155803:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf005
00000020:00000080:12.0:1507244951.155806:0:7926:0:(obd_config.c:1155:class_process_config()) adding mapping from uuid MGC10.151.26.117@o2ib_0 to nid 0x500000a971a75 (10.151.26.117@o2ib)
00000020:01000004:12.0:1507244951.155825:0:7926:0:(obd_mount.c:190:lustre_start_simple()) Starting obd MGC10.151.26.117@o2ib (typ=mgc)
00000020:00000080:12.0:1507244951.155827:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf001
00000020:00000080:12.0:1507244951.155828:0:7926:0:(obd_config.c:358:class_attach()) attach type mgc name: MGC10.151.26.117@o2ib uuid: da6e980a-41dc-cc34-5d4d-3f8c4e7055bd
00000020:00000080:12.0:1507244951.155881:0:7926:0:(genops.c:367:class_newdev()) Adding new device MGC10.151.26.117@o2ib (ffff88102ee7f048)
00000020:00000080:12.0:1507244951.155884:0:7926:0:(obd_config.c:428:class_attach()) OBD: dev 0 attached type mgc with refcount 1
00000020:00000080:12.0:1507244951.155887:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf003
00010000:00080000:12.0:1507244951.161768:0:7926:0:(ldlm_lib.c:108:import_set_conn()) imp ffff88103ac54000@MGC10.151.26.117@o2ib: add connection MGC10.151.26.117@o2ib_0 at head
00000040:01000000:12.0:1507244951.161795:0:7926:0:(llog_obd.c:210:llog_setup()) obd MGC10.151.26.117@o2ib ctxt 1 is initialized
10000000:01000000:13.0:1507244951.161847:0:7935:0:(mgc_request.c:590:mgc_requeue_thread()) Starting requeue thread
00000020:00000080:12.0:1507244951.161853:0:7926:0:(obd_config.c:548:class_setup()) finished setup of obd MGC10.151.26.117@o2ib (uuid da6e980a-41dc-cc34-5d4d-3f8c4e7055bd)
00000020:00000080:12.0:1507244951.161859:0:7926:0:(genops.c:1165:class_connect()) connect: client da6e980a-41dc-cc34-5d4d-3f8c4e7055bd, cookie 0x9b89528d2f304388
00000100:00080000:12.0:1507244951.161862:0:7926:0:(import.c:675:ptlrpc_connect_import()) ffff88103ac54000 MGS: changing import state from NEW to CONNECTING
00000100:00080000:12.0:1507244951.161865:0:7926:0:(import.c:519:import_select_connection()) MGC10.151.26.117@o2ib: connect to NID 10.151.26.117@o2ib last attempt 0
00000100:00080000:12.0:1507244951.161868:0:7926:0:(import.c:597:import_select_connection()) MGC10.151.26.117@o2ib: import ffff88103ac54000 using connection MGC10.151.26.117@o2ib_0/10.151.26.117@o2ib
00000100:00080000:12.0:1507244951.161880:0:7926:0:(pinger.c:409:ptlrpc_pinger_add_import()) adding pingable import da6e980a-41dc-cc34-5d4d-3f8c4e7055bd->MGS
00000080:01000000:12.0:1507244951.161903:0:7926:0:(llite_lib.c:107:ll_init_sbi()) generated uuid: b353adbd-7e36-52a1-59db-5414fb4a5659
00000080:01000000:12.0:1507244951.161905:0:7926:0:(llite_lib.c:793:ll_options()) Parsing opts flock
10000000:01000000:12.0:1507244951.161974:0:7926:0:(mgc_request.c:2148:mgc_process_config()) parse_log nbp1-client from 0
10000000:01000000:12.0:1507244951.161976:0:7926:0:(mgc_request.c:323:config_log_add()) adding config log nbp1-client:ffff88102bae7000
10000000:01000000:12.0:1507244951.161978:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-sptlrpc:          (null)
10000000:01000000:12.0:1507244951.161980:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-sptlrpc to resid 0x3170626e/0x0 (nbp1)
10000000:01000000:12.0:1507244951.161984:0:7926:0:(mgc_request.c:2014:mgc_process_log()) Process log nbp1-sptlrpc:          (null) from 1
10000000:01000000:12.0:1507244951.161985:0:7926:0:(mgc_request.c:1097:mgc_enqueue()) Enqueue for nbp1-sptlrpc (res 0x3170626e)
00000100:00080000:12.0:1507244951.161999:0:7926:0:(client.c:1561:ptlrpc_send_new_req()) @@@ req from PID 7926 waiting for recovery: (FULL != CONNECTING)  req@ffff88103b4fa940 x1580459456725360/t0(0) o101->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00080000:27.0F:1507244951.162249:0:7777:0:(import.c:1043:ptlrpc_connect_interpret()) MGC10.151.26.117@o2ib: connect to target with instance 0
00000100:00080000:27.0:1507244951.162257:0:7777:0:(import.c:918:ptlrpc_connect_set_flags()) MGC10.151.26.117@o2ib: Resetting ns_connect_flags to server flags: 0x11005000020
10000000:01000000:27.0:1507244951.162261:0:7777:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808005
00000100:00080000:27.0:1507244951.162263:0:7777:0:(import.c:1146:ptlrpc_connect_interpret()) ffff88103ac54000 MGS: changing import state from CONNECTING to FULL
10000000:01000000:27.0:1507244951.162266:0:7777:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808004
00000100:00080000:27.0:1507244951.162271:0:7777:0:(pinger.c:171:ptlrpc_pinger_ir_up()) IR up
00000100:00080000:27.0:1507244951.162276:0:7777:0:(recover.c:241:ptlrpc_wake_delayed()) @@@ waking (set ffff88103b7d3dc0):  req@ffff88103b4fa940 x1580459456725360/t0(0) o101->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
10000000:01000000:10.0F:1507244951.168915:0:7926:0:(mgc_request.c:2088:mgc_process_log()) MGC10.151.26.117@o2ib: configuration from log 'nbp1-sptlrpc' failed (-2).
10000000:01000000:10.0:1507244951.168922:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log params:ffff88102bae7000
10000000:01000000:10.0:1507244951.168925:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log params to resid 0x736d61726170/0x3 (params)
10000000:01000000:10.0:1507244951.168928:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-client:ffff88102bae7000
10000000:01000000:10.0:1507244951.168930:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-client to resid 0x3170626e/0x0 (nbp1)
10000000:01000000:10.0:1507244951.168932:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-cliir:ffff88102bae7000
10000000:01000000:10.0:1507244951.168933:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-cliir to resid 0x3170626e/0x2 (nbp1)
10000000:01000000:10.0:1507244951.168935:0:7926:0:(mgc_request.c:2014:mgc_process_log()) Process log nbp1-client:ffff88102bae7000 from 1
10000000:01000000:10.0:1507244951.168937:0:7926:0:(mgc_request.c:1097:mgc_enqueue()) Enqueue for nbp1-client (res 0x3170626e)
00000800:00000100:14.0:1507245116.707129:0:7765:0:(o2iblnd_cb.c:1915:kiblnd_close_conn_locked()) Closing conn to 10.149.25.204@o2ib313: error 0(waiting)
00000100:00000400:53.0F:1507245293.173620:0:7926:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507244951/real 1507244951]  req@ffff88102bccbc80 x1580459456725424/t0(0) o503->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507245293 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
00000100:02020000:53.0:1507245293.173648:0:7926:0:(import.c:185:ptlrpc_set_import_discon()) 166-1: MGC10.151.26.117@o2ib: Connection to MGS (at 10.151.26.117@o2ib) was lost; in progress operations using this service will fail
00000100:00080000:53.0:1507245293.173652:0:7926:0:(import.c:187:ptlrpc_set_import_discon()) ffff88103ac54000 MGS: changing import state from FULL to DISCONN
10000000:01000000:53.0:1507245293.173655:0:7926:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808001
00000100:00080000:53.0:1507245293.173657:0:7926:0:(pinger.c:178:ptlrpc_pinger_ir_down()) IR down
00000100:00080000:53.0:1507245293.173658:0:7926:0:(import.c:436:ptlrpc_fail_import()) import MGS@MGC10.151.26.117@o2ib_0 for MGC10.151.26.117@o2ib not replayable, auto-deactivating
00000100:00080000:53.0:1507245293.173660:0:7926:0:(import.c:214:ptlrpc_deactivate_and_unlock_import()) setting import MGS INVALID
10000000:01000000:53.0:1507245293.173663:0:7926:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808002
00000100:00080000:53.0:1507245293.173665:0:7926:0:(import.c:413:ptlrpc_pinger_force()) MGS: waking up pinger s:DISCONN
00000100:00080000:27.0:1507245293.173673:0:7834:0:(pinger.c:217:ptlrpc_pinger_process_import()) da6e980a-41dc-cc34-5d4d-3f8c4e7055bd->MGS: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0
00000100:00080000:27.0:1507245293.173677:0:7834:0:(recover.c:58:ptlrpc_initiate_recovery()) MGS: starting recovery
00000100:00080000:27.0:1507245293.173679:0:7834:0:(import.c:675:ptlrpc_connect_import()) ffff88103ac54000 MGS: changing import state from DISCONN to CONNECTING
00000100:00080000:27.0:1507245293.173683:0:7834:0:(import.c:519:import_select_connection()) MGC10.151.26.117@o2ib: connect to NID 10.151.26.117@o2ib last attempt 4296875997
00000100:00080000:27.0:1507245293.173687:0:7834:0:(import.c:597:import_select_connection()) MGC10.151.26.117@o2ib: import ffff88103ac54000 using connection MGC10.151.26.117@o2ib_0/10.151.26.117@o2ib
10000000:01000000:53.0:1507245293.173688:0:7926:0:(mgc_request.c:2088:mgc_process_log()) MGC10.151.26.117@o2ib: configuration from log 'nbp1-client' failed (-5).
00000020:02020000:53.0:1507245293.173694:0:7926:0:(obd_mount.c:114:lustre_process_log()) 15c-8: MGC10.151.26.117@o2ib: The configuration from log 'nbp1-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
10000000:01000000:53.0:1507245293.173704:0:7926:0:(mgc_request.c:146:config_log_put()) dropping config log nbp1-cliir
10000000:01000000:53.0:1507245293.173706:0:7926:0:(mgc_request.c:146:config_log_put()) dropping config log params
10000000:01000000:53.0:1507245293.173708:0:7926:0:(mgc_request.c:501:config_log_end()) end config log nbp1-client (0)
00000080:02000400:53.0:1507245293.173836:0:7926:0:(llite_lib.c:1135:ll_put_super()) Unmounted nbp1-client




 Comments   
Comment by Mahmoud Hanafi [ 05/Oct/17 ]

Here is the log from the MGS

00000100:00100000:21.0:1507244951.155843:0:5942:0:(events.c:349:request_in_callback()) peer: 12345-10.149.4.164@o2ib313
00000100:00100000:20.0:1507244951.155882:0:58048:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016795
00000100:00100000:20.0:1507244951.155887:0:58048:0:(service.c:2069:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:0+-99:7777:x1580459456725344:12345-10.149.4.164@o2ib313:250
00010000:00080000:20.0:1507244951.155905:0:58048:0:(ldlm_lib.c:1034:target_handle_connect()) MGS: connection from da6e980a-41dc-cc34-5d4d-3f8c4e7055bd@10.149.4.164@o2ib313 t0 exp (null) cur 1507244951 last 0
00000100:00100000:20.0:1507244951.155979:0:58048:0:(service.c:2119:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+4:7777:x1580459456725344:12345-10.149.4.164@o2ib313:250 Request procesed in 94us (141us total) trans 0 rc 0/0
00000100:00100000:20.0:1507244951.155983:0:58048:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016795
00000100:00100000:9.0:1507244951.156258:0:5934:0:(events.c:349:request_in_callback()) peer: 12345-10.149.4.164@o2ib313
00000100:00100000:20.0:1507244951.156301:0:58048:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016796
00000100:00100000:20.0:1507244951.156305:0:58048:0:(service.c:2069:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+4:7926:x1580459456725360:12345-10.149.4.164@o2ib313:101
00000100:00100000:20.0:1507244951.162257:0:58048:0:(service.c:2119:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+5:7926:x1580459456725360:12345-10.149.4.164@o2ib313:101 Request procesed in 5951us (5998us total) trans 0 rc 0/0
00000100:00100000:20.0:1507244951.162261:0:58048:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016796
00000100:00100000:20.0:1507244951.162551:0:58048:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016802
00000100:00100000:20.0:1507244951.162554:0:58048:0:(service.c:2069:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+5:7926:x1580459456725376:12345-10.149.4.164@o2ib313:501
00000100:00100000:20.0:1507244951.162592:0:58048:0:(service.c:2119:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+5:7926:x1580459456725376:12345-10.149.4.164@o2ib313:501 Request procesed in 38us (58us total) trans 0 rc -2/-2
00000100:00100000:20.0:1507244951.162595:0:58048:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016802
00000100:00100000:9.0:1507244951.162917:0:5934:0:(events.c:349:request_in_callback()) peer: 12345-10.149.4.164@o2ib313
00000100:00100000:20.0:1507244951.162935:0:58048:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016803
00000100:00100000:20.0:1507244951.162938:0:58048:0:(service.c:2069:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+5:7926:x1580459456725392:12345-10.149.4.164@o2ib313:101
00000100:00100000:20.0:1507244951.164599:0:58048:0:(service.c:2119:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+6:7926:x1580459456725392:12345-10.149.4.164@o2ib313:101 Request procesed in 1659us (1682us total) trans 0 rc 0/0
00000100:00100000:20.0:1507244951.164602:0:58048:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016803
00000100:00100000:16.0:1507244951.164877:0:5939:0:(events.c:349:request_in_callback()) peer: 12345-10.149.4.164@o2ib313
00000100:00100000:20.0:1507244951.164900:0:58048:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016806
00000100:00100000:20.0:1507244951.164905:0:58048:0:(service.c:2069:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+6:7926:x1580459456725408:12345-10.149.4.164@o2ib313:501
00000100:00100000:20.0:1507244951.164954:0:58048:0:(service.c:2119:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+7:7926:x1580459456725408:12345-10.149.4.164@o2ib313:501 Request procesed in 50us (81us total) trans 0 rc 0/0
00000100:00100000:20.0:1507244951.164960:0:58048:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016806
00000100:00100000:20.0:1507244951.165168:0:58048:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016807
00000100:00100000:20.0:1507244951.165173:0:58048:0:(service.c:2069:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+7:7926:x1580459456725424:12345-10.149.4.164@o2ib313:503
00000100:00100000:20.0:1507244951.165222:0:58048:0:(service.c:2119:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0006:da6e980a-41dc-cc34-5d4d-3f8c4e7055bd+7:7926:x1580459456725424:12345-10.149.4.164@o2ib313:503 Request procesed in 49us (77us total) trans 0 rc 0/0
00000100:00100000:20.0:1507244951.165227:0:58048:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.149.4.164@o2ib313, seq: 2519016807
Comment by Mahmoud Hanafi [ 06/Oct/17 ]

Attached more detail trace from  the client.

 

Comment by Joseph Gmitter (Inactive) [ 06/Oct/17 ]

Hi Sonia,

Can you please investigate this?

Thanks.
Joe

Comment by Mahmoud Hanafi [ 07/Oct/17 ]

We found the issue to be a misbehaving ib switch. This case can be close.

 

Comment by Joseph Gmitter (Inactive) [ 07/Oct/17 ]

Thanks, will close the case.

Generated at Sat Feb 10 02:31:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.