Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      On our 2.8 DNE testbed, we are seeing, not too infrequently, MDCs that get stuck in the EVICTED state. The clients are running Lustre 2.8.0_0.0.llnlpreview.18 (see the lustre-release-fe-llnl repo).

      The MDC seems to be permanently stuck. See the following example:

      [root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# cat state
      current_state: EVICTED
      state_history:
       - [ 1470865592, DISCONN ]
       - [ 1470865612, CONNECTING ]
       - [ 1470865667, DISCONN ]
       - [ 1470865687, CONNECTING ]
       - [ 1470865742, DISCONN ]
       - [ 1470865762, CONNECTING ]
       - [ 1470865762, DISCONN ]
       - [ 1470865771, CONNECTING ]
       - [ 1470865771, REPLAY ]
       - [ 1470865771, REPLAY_LOCKS ]
       - [ 1470865771, REPLAY_WAIT ]
       - [ 1470865831, RECOVER ]
       - [ 1470865831, FULL ]
       - [ 1470950043, DISCONN ]
       - [ 1470950043, CONNECTING ]
       - [ 1470950043, EVICTED ]
      [root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# date +%s
      1471481367
      

      Note that it appears to have stopped trying to connect after the eviction, and that was apparently over six days ago.

      On the client's console I see:

      2016-08-17 17:24:02 [705378.725578] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
      2016-08-17 17:24:02 [705378.741582] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:24:02 [705378.754003] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
      2016-08-17 17:24:02 [705378.786766] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:24:02 [705378.799218] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
      2016-08-17 17:24:02 [705378.819955] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:34:02 [705979.145608] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
      2016-08-17 17:34:02 [705979.161601] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:34:02 [705979.174037] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
      2016-08-17 17:34:02 [705979.206710] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:34:02 [705979.219194] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
      2016-08-17 17:34:02 [705979.239870] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:35:08 [706044.254083] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
      2016-08-17 17:39:41 [706317.503378] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
      2016-08-17 17:44:02 [706579.565744] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
      2016-08-17 17:44:03 [706579.581674] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:44:03 [706579.594086] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
      2016-08-17 17:44:03 [706579.626684] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      2016-08-17 17:44:03 [706579.639068] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
      2016-08-17 17:44:03 [706579.659692] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      

      The "lquake-MDT000a_UUID: rc = -110 waiting for callback" repeats every ten minutes.

      PID: 27781  TASK: ffff880f6be4e780  CPU: 5   COMMAND: "ll_sa_25078"
       #0 [ffff88202318b700] __schedule+0x295 at ffffffff81651975
       #1 [ffff88202318b768] schedule+0x29 at ffffffff81652049
       #2 [ffff88202318b778] schedule_timeout+0x175 at ffffffff8164fa75
       #3 [ffff88202318b820] ptlrpc_set_wait+0x4c0 at ffffffffa0dafda0 [ptlrpc]
       #4 [ffff88202318b8c8] ptlrpc_queue_wait+0x7d at ffffffffa0db025d [ptlrpc]
       #5 [ffff88202318b8e8] mdc_getpage+0x1e1 at ffffffffa0fadf61 [mdc]
       #6 [ffff88202318b9c8] mdc_read_page_remote+0x135 at ffffffffa0fae535 [mdc]
       #7 [ffff88202318ba48] do_read_cache_page+0x7f at ffffffff81170cbf
       #8 [ffff88202318ba90] read_cache_page+0x1c at ffffffff81170e1c
       #9 [ffff88202318baa0] mdc_read_page+0x1b4 at ffffffffa0fab314 [mdc]
      #10 [ffff88202318bb90] lmv_read_striped_page+0x5f8 at ffffffffa0ff14a7 [lmv]
      #11 [ffff88202318bca8] lmv_read_page+0x521 at ffffffffa0fe34e1 [lmv]
      #12 [ffff88202318bd00] ll_get_dir_page+0xc8 at ffffffffa1015178 [lustre]
      #13 [ffff88202318bd40] ll_statahead_thread+0x2bc at ffffffffa10691cc [lustre]
      #14 [ffff88202318bec8] kthread+0xcf at ffffffff810a997f
      #15 [ffff88202318bf50] ret_from_fork+0x58 at ffffffff8165d658
      

      Another client node that has had a single MDC (out of 16) stuck in EVICTED state for nearly 7 days also has a single SA thread stuck:

      crash> bt -sx 166895
      PID: 166895  TASK: ffff880fdcd6a280  CPU: 1   COMMAND: "ll_sa_166412"
       #0 [ffff880f0f213700] __schedule+0x295 at ffffffff81651975
       #1 [ffff880f0f213768] schedule+0x29 at ffffffff81652049
       #2 [ffff880f0f213778] schedule_timeout+0x175 at ffffffff8164fa75
       #3 [ffff880f0f213820] ptlrpc_set_wait+0x4c0 at ffffffffa0dc4da0 [ptlrpc]
       #4 [ffff880f0f2138c8] ptlrpc_queue_wait+0x7d at ffffffffa0dc525d [ptlrpc]
       #5 [ffff880f0f2138e8] mdc_getpage+0x1e1 at ffffffffa0fc2f61 [mdc]
       #6 [ffff880f0f2139c8] mdc_read_page_remote+0x135 at ffffffffa0fc3535 [mdc]
       #7 [ffff880f0f213a48] do_read_cache_page+0x7f at ffffffff81170cbf
       #8 [ffff880f0f213a90] read_cache_page+0x1c at ffffffff81170e1c
       #9 [ffff880f0f213aa0] mdc_read_page+0x1b4 at ffffffffa0fc0314 [mdc]
      #10 [ffff880f0f213b90] lmv_read_striped_page+0x5f8 at ffffffffa10064a7 [lmv]
      #11 [ffff880f0f213ca8] lmv_read_page+0x521 at ffffffffa0ff84e1 [lmv]
      #12 [ffff880f0f213d00] ll_get_dir_page+0xc8 at ffffffffa102a178 [lustre]
      #13 [ffff880f0f213d40] ll_statahead_thread+0x2bc at ffffffffa107e1cc [lustre]
      #14 [ffff880f0f213ec8] kthread+0xcf at ffffffff810a997f
      #15 [ffff880f0f213f50] ret_from_fork+0x58 at ffffffff8165d658
      

      I can't necessarily say that SA is implicated though. It could simply be that SA is hanging because someone discovered the problem by running "ls".

      Attachments

        Issue Links

          Activity

            [LU-8511] mdc stuck in EVICTED state
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-10 [ JFC-10 ]
            pjones Peter Jones made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Duplicate of LU-7434

            pjones Peter Jones added a comment - Duplicate of LU-7434
            mdiep Minh Diep made changes -
            Link New: This issue is related to JFC-10 [ JFC-10 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-10 [ JFC-10 ]
            mdiep Minh Diep made changes -
            Link New: This issue is related to JFC-10 [ JFC-10 ]
            morrone Christopher Morrone (Inactive) made changes -
            Labels Original: llnl topllnl New: llnl

            When can we expect the port to be reviewed?

            morrone Christopher Morrone (Inactive) added a comment - When can we expect the port to be reviewed?
            bobijam Zhenyu Xu made changes -
            Link New: This issue is related to LU-7434 [ LU-7434 ]
            bobijam Zhenyu Xu added a comment -

            I think it relates to LU-7434, and the relevant patch port has been pushed at http://review.whamcloud.com/20230

            bobijam Zhenyu Xu added a comment - I think it relates to LU-7434 , and the relevant patch port has been pushed at http://review.whamcloud.com/20230

            People

              bobijam Zhenyu Xu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: