Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Lustre routers: 2.10.1 and 2.9
Clients Lustre 2.10.1 and 2.9
Servers 2.7.1
Some node will not mount lustre after a reboot. Once they get into this state even a reboot won't fix the mounting issue. We can ping and lctl ping the router. The mount request get to mgs and mgs replies. When the mount is attempted on the router we get this error
Oct 5 16:11:56 elrtr8 kernel: [1507245116.628251] LNetError: 13746:0:(o2iblnd_cb.c:3186:kiblnd_check_conns()) Timed out RDMA with 10.149.4.164@o2ib313 (57): c: 58, oc: 0, rc: 63
Here is the debug trace from the client:
00000020:01200004:12.0:1507244951.155731:0:7926:0:(obd_mount.c:1383:lustre_fill_super()) VFS Op: sb ffff88102bae7000 00000020:01000004:12.0:1507244951.155740:0:7926:0:(obd_mount.c:845:lmd_print()) mount data: 00000020:01000004:12.0:1507244951.155740:0:7926:0:(obd_mount.c:847:lmd_print()) profile: nbp1-client 00000020:01000004:12.0:1507244951.155741:0:7926:0:(obd_mount.c:848:lmd_print()) device: 10.151.26.117@o2ib:/nbp1 00000020:01000004:12.0:1507244951.155742:0:7926:0:(obd_mount.c:849:lmd_print()) flags: 2 00000020:01000004:12.0:1507244951.155743:0:7926:0:(obd_mount.c:852:lmd_print()) options: flock 00000020:01000004:12.0:1507244951.155744:0:7926:0:(obd_mount.c:1408:lustre_fill_super()) Mounting client nbp1-client 00000020:01000004:12.0:1507244951.155799:0:7926:0:(obd_mount.c:335:lustre_start_mgc()) Start MGC 'MGC10.151.26.117@o2ib' 00000020:00000080:12.0:1507244951.155803:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf005 00000020:00000080:12.0:1507244951.155806:0:7926:0:(obd_config.c:1155:class_process_config()) adding mapping from uuid MGC10.151.26.117@o2ib_0 to nid 0x500000a971a75 (10.151.26.117@o2ib) 00000020:01000004:12.0:1507244951.155825:0:7926:0:(obd_mount.c:190:lustre_start_simple()) Starting obd MGC10.151.26.117@o2ib (typ=mgc) 00000020:00000080:12.0:1507244951.155827:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf001 00000020:00000080:12.0:1507244951.155828:0:7926:0:(obd_config.c:358:class_attach()) attach type mgc name: MGC10.151.26.117@o2ib uuid: da6e980a-41dc-cc34-5d4d-3f8c4e7055bd 00000020:00000080:12.0:1507244951.155881:0:7926:0:(genops.c:367:class_newdev()) Adding new device MGC10.151.26.117@o2ib (ffff88102ee7f048) 00000020:00000080:12.0:1507244951.155884:0:7926:0:(obd_config.c:428:class_attach()) OBD: dev 0 attached type mgc with refcount 1 00000020:00000080:12.0:1507244951.155887:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf003 00010000:00080000:12.0:1507244951.161768:0:7926:0:(ldlm_lib.c:108:import_set_conn()) imp ffff88103ac54000@MGC10.151.26.117@o2ib: add connection MGC10.151.26.117@o2ib_0 at head 00000040:01000000:12.0:1507244951.161795:0:7926:0:(llog_obd.c:210:llog_setup()) obd MGC10.151.26.117@o2ib ctxt 1 is initialized 10000000:01000000:13.0:1507244951.161847:0:7935:0:(mgc_request.c:590:mgc_requeue_thread()) Starting requeue thread 00000020:00000080:12.0:1507244951.161853:0:7926:0:(obd_config.c:548:class_setup()) finished setup of obd MGC10.151.26.117@o2ib (uuid da6e980a-41dc-cc34-5d4d-3f8c4e7055bd) 00000020:00000080:12.0:1507244951.161859:0:7926:0:(genops.c:1165:class_connect()) connect: client da6e980a-41dc-cc34-5d4d-3f8c4e7055bd, cookie 0x9b89528d2f304388 00000100:00080000:12.0:1507244951.161862:0:7926:0:(import.c:675:ptlrpc_connect_import()) ffff88103ac54000 MGS: changing import state from NEW to CONNECTING 00000100:00080000:12.0:1507244951.161865:0:7926:0:(import.c:519:import_select_connection()) MGC10.151.26.117@o2ib: connect to NID 10.151.26.117@o2ib last attempt 0 00000100:00080000:12.0:1507244951.161868:0:7926:0:(import.c:597:import_select_connection()) MGC10.151.26.117@o2ib: import ffff88103ac54000 using connection MGC10.151.26.117@o2ib_0/10.151.26.117@o2ib 00000100:00080000:12.0:1507244951.161880:0:7926:0:(pinger.c:409:ptlrpc_pinger_add_import()) adding pingable import da6e980a-41dc-cc34-5d4d-3f8c4e7055bd->MGS 00000080:01000000:12.0:1507244951.161903:0:7926:0:(llite_lib.c:107:ll_init_sbi()) generated uuid: b353adbd-7e36-52a1-59db-5414fb4a5659 00000080:01000000:12.0:1507244951.161905:0:7926:0:(llite_lib.c:793:ll_options()) Parsing opts flock 10000000:01000000:12.0:1507244951.161974:0:7926:0:(mgc_request.c:2148:mgc_process_config()) parse_log nbp1-client from 0 10000000:01000000:12.0:1507244951.161976:0:7926:0:(mgc_request.c:323:config_log_add()) adding config log nbp1-client:ffff88102bae7000 10000000:01000000:12.0:1507244951.161978:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-sptlrpc: (null) 10000000:01000000:12.0:1507244951.161980:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-sptlrpc to resid 0x3170626e/0x0 (nbp1) 10000000:01000000:12.0:1507244951.161984:0:7926:0:(mgc_request.c:2014:mgc_process_log()) Process log nbp1-sptlrpc: (null) from 1 10000000:01000000:12.0:1507244951.161985:0:7926:0:(mgc_request.c:1097:mgc_enqueue()) Enqueue for nbp1-sptlrpc (res 0x3170626e) 00000100:00080000:12.0:1507244951.161999:0:7926:0:(client.c:1561:ptlrpc_send_new_req()) @@@ req from PID 7926 waiting for recovery: (FULL != CONNECTING) req@ffff88103b4fa940 x1580459456725360/t0(0) o101->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00080000:27.0F:1507244951.162249:0:7777:0:(import.c:1043:ptlrpc_connect_interpret()) MGC10.151.26.117@o2ib: connect to target with instance 0 00000100:00080000:27.0:1507244951.162257:0:7777:0:(import.c:918:ptlrpc_connect_set_flags()) MGC10.151.26.117@o2ib: Resetting ns_connect_flags to server flags: 0x11005000020 10000000:01000000:27.0:1507244951.162261:0:7777:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808005 00000100:00080000:27.0:1507244951.162263:0:7777:0:(import.c:1146:ptlrpc_connect_interpret()) ffff88103ac54000 MGS: changing import state from CONNECTING to FULL 10000000:01000000:27.0:1507244951.162266:0:7777:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808004 00000100:00080000:27.0:1507244951.162271:0:7777:0:(pinger.c:171:ptlrpc_pinger_ir_up()) IR up 00000100:00080000:27.0:1507244951.162276:0:7777:0:(recover.c:241:ptlrpc_wake_delayed()) @@@ waking (set ffff88103b7d3dc0): req@ffff88103b4fa940 x1580459456725360/t0(0) o101->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 10000000:01000000:10.0F:1507244951.168915:0:7926:0:(mgc_request.c:2088:mgc_process_log()) MGC10.151.26.117@o2ib: configuration from log 'nbp1-sptlrpc' failed (-2). 10000000:01000000:10.0:1507244951.168922:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log params:ffff88102bae7000 10000000:01000000:10.0:1507244951.168925:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log params to resid 0x736d61726170/0x3 (params) 10000000:01000000:10.0:1507244951.168928:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-client:ffff88102bae7000 10000000:01000000:10.0:1507244951.168930:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-client to resid 0x3170626e/0x0 (nbp1) 10000000:01000000:10.0:1507244951.168932:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-cliir:ffff88102bae7000 10000000:01000000:10.0:1507244951.168933:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-cliir to resid 0x3170626e/0x2 (nbp1) 10000000:01000000:10.0:1507244951.168935:0:7926:0:(mgc_request.c:2014:mgc_process_log()) Process log nbp1-client:ffff88102bae7000 from 1 10000000:01000000:10.0:1507244951.168937:0:7926:0:(mgc_request.c:1097:mgc_enqueue()) Enqueue for nbp1-client (res 0x3170626e) 00000800:00000100:14.0:1507245116.707129:0:7765:0:(o2iblnd_cb.c:1915:kiblnd_close_conn_locked()) Closing conn to 10.149.25.204@o2ib313: error 0(waiting) 00000100:00000400:53.0F:1507245293.173620:0:7926:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507244951/real 1507244951] req@ffff88102bccbc80 x1580459456725424/t0(0) o503->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507245293 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:02020000:53.0:1507245293.173648:0:7926:0:(import.c:185:ptlrpc_set_import_discon()) 166-1: MGC10.151.26.117@o2ib: Connection to MGS (at 10.151.26.117@o2ib) was lost; in progress operations using this service will fail 00000100:00080000:53.0:1507245293.173652:0:7926:0:(import.c:187:ptlrpc_set_import_discon()) ffff88103ac54000 MGS: changing import state from FULL to DISCONN 10000000:01000000:53.0:1507245293.173655:0:7926:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808001 00000100:00080000:53.0:1507245293.173657:0:7926:0:(pinger.c:178:ptlrpc_pinger_ir_down()) IR down 00000100:00080000:53.0:1507245293.173658:0:7926:0:(import.c:436:ptlrpc_fail_import()) import MGS@MGC10.151.26.117@o2ib_0 for MGC10.151.26.117@o2ib not replayable, auto-deactivating 00000100:00080000:53.0:1507245293.173660:0:7926:0:(import.c:214:ptlrpc_deactivate_and_unlock_import()) setting import MGS INVALID 10000000:01000000:53.0:1507245293.173663:0:7926:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808002 00000100:00080000:53.0:1507245293.173665:0:7926:0:(import.c:413:ptlrpc_pinger_force()) MGS: waking up pinger s:DISCONN 00000100:00080000:27.0:1507245293.173673:0:7834:0:(pinger.c:217:ptlrpc_pinger_process_import()) da6e980a-41dc-cc34-5d4d-3f8c4e7055bd->MGS: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:27.0:1507245293.173677:0:7834:0:(recover.c:58:ptlrpc_initiate_recovery()) MGS: starting recovery 00000100:00080000:27.0:1507245293.173679:0:7834:0:(import.c:675:ptlrpc_connect_import()) ffff88103ac54000 MGS: changing import state from DISCONN to CONNECTING 00000100:00080000:27.0:1507245293.173683:0:7834:0:(import.c:519:import_select_connection()) MGC10.151.26.117@o2ib: connect to NID 10.151.26.117@o2ib last attempt 4296875997 00000100:00080000:27.0:1507245293.173687:0:7834:0:(import.c:597:import_select_connection()) MGC10.151.26.117@o2ib: import ffff88103ac54000 using connection MGC10.151.26.117@o2ib_0/10.151.26.117@o2ib 10000000:01000000:53.0:1507245293.173688:0:7926:0:(mgc_request.c:2088:mgc_process_log()) MGC10.151.26.117@o2ib: configuration from log 'nbp1-client' failed (-5). 00000020:02020000:53.0:1507245293.173694:0:7926:0:(obd_mount.c:114:lustre_process_log()) 15c-8: MGC10.151.26.117@o2ib: The configuration from log 'nbp1-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. 10000000:01000000:53.0:1507245293.173704:0:7926:0:(mgc_request.c:146:config_log_put()) dropping config log nbp1-cliir 10000000:01000000:53.0:1507245293.173706:0:7926:0:(mgc_request.c:146:config_log_put()) dropping config log params 10000000:01000000:53.0:1507245293.173708:0:7926:0:(mgc_request.c:501:config_log_end()) end config log nbp1-client (0) 00000080:02000400:53.0:1507245293.173836:0:7926:0:(llite_lib.c:1135:ll_put_super()) Unmounted nbp1-client