Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6273

Hard Failover replay-dual test_17: Failover OST mount hang

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
    • client and server: lustre-master build # 2856
      zfs
    • 3
    • 17590

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/0429703c-ba58-11e4-8053-5254006e85c2.

      The sub-test test_17 failed with the following error:

      test failed to respond and timed out
      

      ost dmesg

      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.2.4.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 120 previous similar messages
      INFO: task mount.lustre:3630 blocked for more than 120 seconds.
            Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.lustre  D 0000000000000000     0  3630   3629 0x00000080
       ffff88006edf9718 0000000000000082 0000000000000000 ffff88006ef82040
       ffff88006edf9698 ffffffff81055783 ffff88007e4c2ad8 ffff880002216880
       ffff88006ef825f8 ffff88006edf9fd8 000000000000fbc8 ffff88006ef825f8
      Call Trace:
       [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
       [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
       [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
       [<ffffffff8152a213>] wait_for_common+0x123/0x180
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
       [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
       [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
       [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
       [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
       [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
       [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
       [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
       [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
       [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
       [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
       [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
       [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
       [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
       [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
       [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
       [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
       [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
       [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
       [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
       [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
       [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
       [<ffffffff811ad74b>] do_mount+0x2fb/0x930
       [<ffffffff811ade10>] sys_mount+0x90/0xe0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565647/real 1424565647]  req@ffff880070449080 x1493765169611180/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565672 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565712/real 1424565712]  req@ffff880070449680 x1493765169611316/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565737 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      INFO: task mount.lustre:3630 blocked for more than 120 seconds.
            Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.lustre  D 0000000000000000     0  3630   3629 0x00000080
       ffff88006edf9718 0000000000000082 0000000000000000 ffff88006ef82040
       ffff88006edf9698 ffffffff81055783 ffff88007e4c2ad8 ffff880002216880
       ffff88006ef825f8 ffff88006edf9fd8 000000000000fbc8 ffff88006ef825f8
      Call Trace:
       [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
       [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
       [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
       [<ffffffff8152a213>] wait_for_common+0x123/0x180
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
       [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
       [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
       [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
       [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
       [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
       [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
       [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
       [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
       [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
       [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
       [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
       [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
       [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
       [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
       [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
       [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
       [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
       [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
       [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
       [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
       [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
       [<ffffffff811ad74b>] do_mount+0x2fb/0x930
       [<ffffffff811ade10>] sys_mount+0x90/0xe0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.2.4.156@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 304 previous similar messages
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565842/real 1424565842]  req@ffff880070449c80 x1493765169611592/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565867 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
      INFO: task mount.lustre:3630 blocked for more than 120 seconds.
      

      Attachments

        Issue Links

          Activity

            [LU-6273] Hard Failover replay-dual test_17: Failover OST mount hang
            pjones Peter Jones made changes -
            Link Original: This issue is related to LDEV-18 [ LDEV-18 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LDEV-19 [ LDEV-19 ]
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16303/
            Subject: LU-6273 lwp: notify LWP users in dedicated thread
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b1848aa5b23fd332362e9ae3d5aab31d8dd9d920

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16303/ Subject: LU-6273 lwp: notify LWP users in dedicated thread Project: fs/lustre-release Branch: master Current Patch Set: Commit: b1848aa5b23fd332362e9ae3d5aab31d8dd9d920
            sarah Sarah Liu made changes -
            Remote Link Original: This issue links to "Page (HPDD Community Wiki)" [ 15346 ] New: This issue links to "Page (HPDD Community Wiki)" [ 15346 ]
            morrone Christopher Morrone (Inactive) made changes -
            Link Original: This issue is blocking LU-6843 [ LU-6843 ]
            tappro Mikhail Pershin made changes -
            Status Original: Open [ 1 ] New: In Progress [ 3 ]
            morrone Christopher Morrone (Inactive) made changes -
            Link New: This issue is blocking LU-6843 [ LU-6843 ]

            Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16304
            Subject: LU-6273 lwp: notify LWP is ready after llog processing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1be148d882a209137e92e525bc69b601e114646c

            gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16304 Subject: LU-6273 lwp: notify LWP is ready after llog processing Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1be148d882a209137e92e525bc69b601e114646c

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: