Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8511

mdc stuck in EVICTED state

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      On our 2.8 DNE testbed, we are seeing, not too infrequently, MDCs that get stuck in the EVICTED state. The clients are running Lustre 2.8.0_0.0.llnlpreview.18 (see the lustre-release-fe-llnl repo).

      The MDC seems to be permanently stuck. See the following example:

      [root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# cat state
      current_state: EVICTED
      state_history:
       - [ 1470865592, DISCONN ]
       - [ 1470865612, CONNECTING ]
       - [ 1470865667, DISCONN ]
       - [ 1470865687, CONNECTING ]
       - [ 1470865742, DISCONN ]
       - [ 1470865762, CONNECTING ]
       - [ 1470865762, DISCONN ]
       - [ 1470865771, CONNECTING ]
       - [ 1470865771, REPLAY ]
       - [ 1470865771, REPLAY_LOCKS ]
       - [ 1470865771, REPLAY_WAIT ]
       - [ 1470865831, RECOVER ]
       - [ 1470865831, FULL ]
       - [ 1470950043, DISCONN ]
       - [ 1470950043, CONNECTING ]
       - [ 1470950043, EVICTED ]
      [root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# date +%s
      1471481367
      

      Note that it appears to have stopped trying to connect after the eviction, and that was apparently over six days ago.

      On the client's console I see:

      2016-08-17 17:24:02 [705378.725578] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
      2016-08-17 17:24:02 [705378.741582] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:24:02 [705378.754003] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
      2016-08-17 17:24:02 [705378.786766] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:24:02 [705378.799218] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
      2016-08-17 17:24:02 [705378.819955] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:34:02 [705979.145608] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
      2016-08-17 17:34:02 [705979.161601] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:34:02 [705979.174037] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
      2016-08-17 17:34:02 [705979.206710] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:34:02 [705979.219194] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
      2016-08-17 17:34:02 [705979.239870] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:35:08 [706044.254083] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
      2016-08-17 17:39:41 [706317.503378] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
      2016-08-17 17:44:02 [706579.565744] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
      2016-08-17 17:44:03 [706579.581674] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:44:03 [706579.594086] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
      2016-08-17 17:44:03 [706579.626684] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:44:03 [706579.639068] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
      2016-08-17 17:44:03 [706579.659692] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      

      The "lquake-MDT000a_UUID: rc = -110 waiting for callback" repeats every ten minutes.

      PID: 27781  TASK: ffff880f6be4e780  CPU: 5   COMMAND: "ll_sa_25078"
       #0 [ffff88202318b700] __schedule+0x295 at ffffffff81651975
       #1 [ffff88202318b768] schedule+0x29 at ffffffff81652049
       #2 [ffff88202318b778] schedule_timeout+0x175 at ffffffff8164fa75
       #3 [ffff88202318b820] ptlrpc_set_wait+0x4c0 at ffffffffa0dafda0 [ptlrpc]
       #4 [ffff88202318b8c8] ptlrpc_queue_wait+0x7d at ffffffffa0db025d [ptlrpc]
       #5 [ffff88202318b8e8] mdc_getpage+0x1e1 at ffffffffa0fadf61 [mdc]
       #6 [ffff88202318b9c8] mdc_read_page_remote+0x135 at ffffffffa0fae535 [mdc]
       #7 [ffff88202318ba48] do_read_cache_page+0x7f at ffffffff81170cbf
       #8 [ffff88202318ba90] read_cache_page+0x1c at ffffffff81170e1c
       #9 [ffff88202318baa0] mdc_read_page+0x1b4 at ffffffffa0fab314 [mdc]
      #10 [ffff88202318bb90] lmv_read_striped_page+0x5f8 at ffffffffa0ff14a7 [lmv]
      #11 [ffff88202318bca8] lmv_read_page+0x521 at ffffffffa0fe34e1 [lmv]
      #12 [ffff88202318bd00] ll_get_dir_page+0xc8 at ffffffffa1015178 [lustre]
      #13 [ffff88202318bd40] ll_statahead_thread+0x2bc at ffffffffa10691cc [lustre]
      #14 [ffff88202318bec8] kthread+0xcf at ffffffff810a997f
      #15 [ffff88202318bf50] ret_from_fork+0x58 at ffffffff8165d658
      

      Another client node that has had a single MDC (out of 16) stuck in EVICTED state for nearly 7 days also has a single SA thread stuck:

      crash> bt -sx 166895
      PID: 166895  TASK: ffff880fdcd6a280  CPU: 1   COMMAND: "ll_sa_166412"
       #0 [ffff880f0f213700] __schedule+0x295 at ffffffff81651975
       #1 [ffff880f0f213768] schedule+0x29 at ffffffff81652049
       #2 [ffff880f0f213778] schedule_timeout+0x175 at ffffffff8164fa75
       #3 [ffff880f0f213820] ptlrpc_set_wait+0x4c0 at ffffffffa0dc4da0 [ptlrpc]
       #4 [ffff880f0f2138c8] ptlrpc_queue_wait+0x7d at ffffffffa0dc525d [ptlrpc]
       #5 [ffff880f0f2138e8] mdc_getpage+0x1e1 at ffffffffa0fc2f61 [mdc]
       #6 [ffff880f0f2139c8] mdc_read_page_remote+0x135 at ffffffffa0fc3535 [mdc]
       #7 [ffff880f0f213a48] do_read_cache_page+0x7f at ffffffff81170cbf
       #8 [ffff880f0f213a90] read_cache_page+0x1c at ffffffff81170e1c
       #9 [ffff880f0f213aa0] mdc_read_page+0x1b4 at ffffffffa0fc0314 [mdc]
      #10 [ffff880f0f213b90] lmv_read_striped_page+0x5f8 at ffffffffa10064a7 [lmv]
      #11 [ffff880f0f213ca8] lmv_read_page+0x521 at ffffffffa0ff84e1 [lmv]
      #12 [ffff880f0f213d00] ll_get_dir_page+0xc8 at ffffffffa102a178 [lustre]
      #13 [ffff880f0f213d40] ll_statahead_thread+0x2bc at ffffffffa107e1cc [lustre]
      #14 [ffff880f0f213ec8] kthread+0xcf at ffffffff810a997f
      #15 [ffff880f0f213f50] ret_from_fork+0x58 at ffffffff8165d658
      

      I can't necessarily say that SA is implicated though. It could simply be that SA is hanging because someone discovered the problem by running "ls".

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: