Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10090

unable to mount client through router

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Lustre routers: 2.10.1 and 2.9
      Clients Lustre 2.10.1 and 2.9
      Servers 2.7.1

      Some node will not mount lustre after a reboot. Once they get into this state even a reboot won't fix the mounting issue. We can ping and lctl ping the router. The mount request get to mgs and mgs replies. When the mount is attempted on the router we get this error

      Oct  5 16:11:56 elrtr8 kernel: [1507245116.628251] LNetError: 13746:0:(o2iblnd_cb.c:3186:kiblnd_check_conns()) Timed out RDMA with 10.149.4.164@o2ib313 (57): c: 58, oc: 0, rc: 63
      
      

      Here is the debug trace from the client:

      00000020:01200004:12.0:1507244951.155731:0:7926:0:(obd_mount.c:1383:lustre_fill_super()) VFS Op: sb ffff88102bae7000
      00000020:01000004:12.0:1507244951.155740:0:7926:0:(obd_mount.c:845:lmd_print())   mount data:
      00000020:01000004:12.0:1507244951.155740:0:7926:0:(obd_mount.c:847:lmd_print()) profile: nbp1-client
      00000020:01000004:12.0:1507244951.155741:0:7926:0:(obd_mount.c:848:lmd_print()) device:  10.151.26.117@o2ib:/nbp1
      00000020:01000004:12.0:1507244951.155742:0:7926:0:(obd_mount.c:849:lmd_print()) flags:   2
      00000020:01000004:12.0:1507244951.155743:0:7926:0:(obd_mount.c:852:lmd_print()) options: flock
      00000020:01000004:12.0:1507244951.155744:0:7926:0:(obd_mount.c:1408:lustre_fill_super()) Mounting client nbp1-client
      00000020:01000004:12.0:1507244951.155799:0:7926:0:(obd_mount.c:335:lustre_start_mgc()) Start MGC 'MGC10.151.26.117@o2ib'
      00000020:00000080:12.0:1507244951.155803:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf005
      00000020:00000080:12.0:1507244951.155806:0:7926:0:(obd_config.c:1155:class_process_config()) adding mapping from uuid MGC10.151.26.117@o2ib_0 to nid 0x500000a971a75 (10.151.26.117@o2ib)
      00000020:01000004:12.0:1507244951.155825:0:7926:0:(obd_mount.c:190:lustre_start_simple()) Starting obd MGC10.151.26.117@o2ib (typ=mgc)
      00000020:00000080:12.0:1507244951.155827:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf001
      00000020:00000080:12.0:1507244951.155828:0:7926:0:(obd_config.c:358:class_attach()) attach type mgc name: MGC10.151.26.117@o2ib uuid: da6e980a-41dc-cc34-5d4d-3f8c4e7055bd
      00000020:00000080:12.0:1507244951.155881:0:7926:0:(genops.c:367:class_newdev()) Adding new device MGC10.151.26.117@o2ib (ffff88102ee7f048)
      00000020:00000080:12.0:1507244951.155884:0:7926:0:(obd_config.c:428:class_attach()) OBD: dev 0 attached type mgc with refcount 1
      00000020:00000080:12.0:1507244951.155887:0:7926:0:(obd_config.c:1144:class_process_config()) processing cmd: cf003
      00010000:00080000:12.0:1507244951.161768:0:7926:0:(ldlm_lib.c:108:import_set_conn()) imp ffff88103ac54000@MGC10.151.26.117@o2ib: add connection MGC10.151.26.117@o2ib_0 at head
      00000040:01000000:12.0:1507244951.161795:0:7926:0:(llog_obd.c:210:llog_setup()) obd MGC10.151.26.117@o2ib ctxt 1 is initialized
      10000000:01000000:13.0:1507244951.161847:0:7935:0:(mgc_request.c:590:mgc_requeue_thread()) Starting requeue thread
      00000020:00000080:12.0:1507244951.161853:0:7926:0:(obd_config.c:548:class_setup()) finished setup of obd MGC10.151.26.117@o2ib (uuid da6e980a-41dc-cc34-5d4d-3f8c4e7055bd)
      00000020:00000080:12.0:1507244951.161859:0:7926:0:(genops.c:1165:class_connect()) connect: client da6e980a-41dc-cc34-5d4d-3f8c4e7055bd, cookie 0x9b89528d2f304388
      00000100:00080000:12.0:1507244951.161862:0:7926:0:(import.c:675:ptlrpc_connect_import()) ffff88103ac54000 MGS: changing import state from NEW to CONNECTING
      00000100:00080000:12.0:1507244951.161865:0:7926:0:(import.c:519:import_select_connection()) MGC10.151.26.117@o2ib: connect to NID 10.151.26.117@o2ib last attempt 0
      00000100:00080000:12.0:1507244951.161868:0:7926:0:(import.c:597:import_select_connection()) MGC10.151.26.117@o2ib: import ffff88103ac54000 using connection MGC10.151.26.117@o2ib_0/10.151.26.117@o2ib
      00000100:00080000:12.0:1507244951.161880:0:7926:0:(pinger.c:409:ptlrpc_pinger_add_import()) adding pingable import da6e980a-41dc-cc34-5d4d-3f8c4e7055bd->MGS
      00000080:01000000:12.0:1507244951.161903:0:7926:0:(llite_lib.c:107:ll_init_sbi()) generated uuid: b353adbd-7e36-52a1-59db-5414fb4a5659
      00000080:01000000:12.0:1507244951.161905:0:7926:0:(llite_lib.c:793:ll_options()) Parsing opts flock
      10000000:01000000:12.0:1507244951.161974:0:7926:0:(mgc_request.c:2148:mgc_process_config()) parse_log nbp1-client from 0
      10000000:01000000:12.0:1507244951.161976:0:7926:0:(mgc_request.c:323:config_log_add()) adding config log nbp1-client:ffff88102bae7000
      10000000:01000000:12.0:1507244951.161978:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-sptlrpc:          (null)
      10000000:01000000:12.0:1507244951.161980:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-sptlrpc to resid 0x3170626e/0x0 (nbp1)
      10000000:01000000:12.0:1507244951.161984:0:7926:0:(mgc_request.c:2014:mgc_process_log()) Process log nbp1-sptlrpc:          (null) from 1
      10000000:01000000:12.0:1507244951.161985:0:7926:0:(mgc_request.c:1097:mgc_enqueue()) Enqueue for nbp1-sptlrpc (res 0x3170626e)
      00000100:00080000:12.0:1507244951.161999:0:7926:0:(client.c:1561:ptlrpc_send_new_req()) @@@ req from PID 7926 waiting for recovery: (FULL != CONNECTING)  req@ffff88103b4fa940 x1580459456725360/t0(0) o101->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      00000100:00080000:27.0F:1507244951.162249:0:7777:0:(import.c:1043:ptlrpc_connect_interpret()) MGC10.151.26.117@o2ib: connect to target with instance 0
      00000100:00080000:27.0:1507244951.162257:0:7777:0:(import.c:918:ptlrpc_connect_set_flags()) MGC10.151.26.117@o2ib: Resetting ns_connect_flags to server flags: 0x11005000020
      10000000:01000000:27.0:1507244951.162261:0:7777:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808005
      00000100:00080000:27.0:1507244951.162263:0:7777:0:(import.c:1146:ptlrpc_connect_interpret()) ffff88103ac54000 MGS: changing import state from CONNECTING to FULL
      10000000:01000000:27.0:1507244951.162266:0:7777:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808004
      00000100:00080000:27.0:1507244951.162271:0:7777:0:(pinger.c:171:ptlrpc_pinger_ir_up()) IR up
      00000100:00080000:27.0:1507244951.162276:0:7777:0:(recover.c:241:ptlrpc_wake_delayed()) @@@ waking (set ffff88103b7d3dc0):  req@ffff88103b4fa940 x1580459456725360/t0(0) o101->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      10000000:01000000:10.0F:1507244951.168915:0:7926:0:(mgc_request.c:2088:mgc_process_log()) MGC10.151.26.117@o2ib: configuration from log 'nbp1-sptlrpc' failed (-2).
      10000000:01000000:10.0:1507244951.168922:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log params:ffff88102bae7000
      10000000:01000000:10.0:1507244951.168925:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log params to resid 0x736d61726170/0x3 (params)
      10000000:01000000:10.0:1507244951.168928:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-client:ffff88102bae7000
      10000000:01000000:10.0:1507244951.168930:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-client to resid 0x3170626e/0x0 (nbp1)
      10000000:01000000:10.0:1507244951.168932:0:7926:0:(mgc_request.c:209:do_config_log_add()) do adding config log nbp1-cliir:ffff88102bae7000
      10000000:01000000:10.0:1507244951.168933:0:7926:0:(mgc_request.c:88:mgc_name2resid()) log nbp1-cliir to resid 0x3170626e/0x2 (nbp1)
      10000000:01000000:10.0:1507244951.168935:0:7926:0:(mgc_request.c:2014:mgc_process_log()) Process log nbp1-client:ffff88102bae7000 from 1
      10000000:01000000:10.0:1507244951.168937:0:7926:0:(mgc_request.c:1097:mgc_enqueue()) Enqueue for nbp1-client (res 0x3170626e)
      00000800:00000100:14.0:1507245116.707129:0:7765:0:(o2iblnd_cb.c:1915:kiblnd_close_conn_locked()) Closing conn to 10.149.25.204@o2ib313: error 0(waiting)
      00000100:00000400:53.0F:1507245293.173620:0:7926:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507244951/real 1507244951]  req@ffff88102bccbc80 x1580459456725424/t0(0) o503->MGC10.151.26.117@o2ib@10.151.26.117@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507245293 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      00000100:02020000:53.0:1507245293.173648:0:7926:0:(import.c:185:ptlrpc_set_import_discon()) 166-1: MGC10.151.26.117@o2ib: Connection to MGS (at 10.151.26.117@o2ib) was lost; in progress operations using this service will fail
      00000100:00080000:53.0:1507245293.173652:0:7926:0:(import.c:187:ptlrpc_set_import_discon()) ffff88103ac54000 MGS: changing import state from FULL to DISCONN
      10000000:01000000:53.0:1507245293.173655:0:7926:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808001
      00000100:00080000:53.0:1507245293.173657:0:7926:0:(pinger.c:178:ptlrpc_pinger_ir_down()) IR down
      00000100:00080000:53.0:1507245293.173658:0:7926:0:(import.c:436:ptlrpc_fail_import()) import MGS@MGC10.151.26.117@o2ib_0 for MGC10.151.26.117@o2ib not replayable, auto-deactivating
      00000100:00080000:53.0:1507245293.173660:0:7926:0:(import.c:214:ptlrpc_deactivate_and_unlock_import()) setting import MGS INVALID
      10000000:01000000:53.0:1507245293.173663:0:7926:0:(mgc_request.c:1320:mgc_import_event()) import event 0x808002
      00000100:00080000:53.0:1507245293.173665:0:7926:0:(import.c:413:ptlrpc_pinger_force()) MGS: waking up pinger s:DISCONN
      00000100:00080000:27.0:1507245293.173673:0:7834:0:(pinger.c:217:ptlrpc_pinger_process_import()) da6e980a-41dc-cc34-5d4d-3f8c4e7055bd->MGS: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0
      00000100:00080000:27.0:1507245293.173677:0:7834:0:(recover.c:58:ptlrpc_initiate_recovery()) MGS: starting recovery
      00000100:00080000:27.0:1507245293.173679:0:7834:0:(import.c:675:ptlrpc_connect_import()) ffff88103ac54000 MGS: changing import state from DISCONN to CONNECTING
      00000100:00080000:27.0:1507245293.173683:0:7834:0:(import.c:519:import_select_connection()) MGC10.151.26.117@o2ib: connect to NID 10.151.26.117@o2ib last attempt 4296875997
      00000100:00080000:27.0:1507245293.173687:0:7834:0:(import.c:597:import_select_connection()) MGC10.151.26.117@o2ib: import ffff88103ac54000 using connection MGC10.151.26.117@o2ib_0/10.151.26.117@o2ib
      10000000:01000000:53.0:1507245293.173688:0:7926:0:(mgc_request.c:2088:mgc_process_log()) MGC10.151.26.117@o2ib: configuration from log 'nbp1-client' failed (-5).
      00000020:02020000:53.0:1507245293.173694:0:7926:0:(obd_mount.c:114:lustre_process_log()) 15c-8: MGC10.151.26.117@o2ib: The configuration from log 'nbp1-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      10000000:01000000:53.0:1507245293.173704:0:7926:0:(mgc_request.c:146:config_log_put()) dropping config log nbp1-cliir
      10000000:01000000:53.0:1507245293.173706:0:7926:0:(mgc_request.c:146:config_log_put()) dropping config log params
      10000000:01000000:53.0:1507245293.173708:0:7926:0:(mgc_request.c:501:config_log_end()) end config log nbp1-client (0)
      00000080:02000400:53.0:1507245293.173836:0:7926:0:(llite_lib.c:1135:ll_put_super()) Unmounted nbp1-client
      
      
      

      Attachments

        Activity

          People

            sharmaso Sonia Sharma (Inactive)
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: