Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13200

hang in lnet_wait_known_routerstate

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      We have a hang on production systems' 2.12.3 where lnet never sets up on server if some routers are bad (hanged on modprobe lustre, no lnet service)

      approximative backtrace:

      #0 __schedule
      #1 schedule
      #2 schedule_timeout
      #3 lnet_router_post_mt_start
      #4 lnet_monitor_thr_start
      #5 LNetNIInit
      #6 ptlrpc_ni_init
      #7 ptlrpc_init_portals
      #8 init_module
      #9 do_one_initcall
      #10 load_module
      #11 sys_finit_module
      #12 system_call_fastpath
      

      But it's really stuck on the loop checking for rtr->lpni_alive_count to be non-zero on all lnet routers (loop on &the_lnet.ln_routers)

      I'm not sure why we don't always get stuck (there always are a couple of routers down on), and it got stuck last time. In the little traces I have left (didn't get a full crash on this one unfortunately), but it looks a lot like LU-13001... except that 2.12.3 doesn't have LU-11297 so the patch for that one doesn't make sense.

      OTOH LU-11298 changes that to check (rtr->lp_state & LNET_PEER_DISCOVERED) instead, that sounds like it could be a good idea? I honestly can't say without a dump at hand unfortunately I will need to try to reproduce somewhere more practical....

      Attachments

        Activity

          People

            pjones Peter Jones
            martinetd Dominique Martinet (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: