Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.14.0
    • lustre-master-ib #404
    • 3
    • 9223372036854775807

    Description

      1 MDS hung during mount during failover process.

      soak-9 console

      [ 3961.086008] mount.lustre    D ffff8f5730291070     0  5206   5205 0x00000082
      [ 3961.093940] Call Trace:
      [ 3961.096752]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.105419]  [<ffffffff99380a09>] schedule+0x29/0x70
      [ 3961.110980]  [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0
      [ 3961.117509]  [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760
      [ 3961.124565]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.133226]  [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140
      [ 3961.139955]  [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20
      [ 3961.146222]  [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass]
      [ 3961.154226]  [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass]
      [ 3961.161271]  [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass]
      [ 3961.169552]  [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      [ 3961.176961]  [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc]
      [ 3961.184004]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.192673]  [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc]
      [ 3961.200110]  [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass]
      [ 3961.207925]  [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass]
      [ 3961.216133]  [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass]
      [ 3961.224020]  [<ffffffff98e23db6>] ? kfree+0x106/0x140
      [ 3961.229698]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.238396]  [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass]
      [ 3961.246314]  [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass]
      [ 3961.254033]  [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      [ 3961.262511]  [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0
      [ 3961.268390]  [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass]
      [ 3961.275401]  [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0
      [ 3961.281064]  [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110
      [ 3961.287299]  [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0
      [ 3961.293070]  [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250
      [ 3961.300073]  [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200
      [ 3961.307276]  [<ffffffff98e70a93>] SyS_mount+0x83/0xd0
      [ 3961.312939]  [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a
      [ 3961.319665]  [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146
      [ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
      [ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect
      
      

      Attachments

        1. lustre-log.1588133843.6068-soak-8
          124.12 MB
        2. soak-11.log-051120
          944 kB
        3. soak-9.log-20200419.gz
          184 kB
        4. trace-8
          1002 kB
        5. trace-s-11-051120
          997 kB
        6. trace-soak8
          976 kB

        Issue Links

          Activity

            [LU-13469] MDS hung during mount
            sarah Sarah Liu added a comment -

            Hi Alex, yes, it is better than before. Right now soak is running with b2_12 testing, I will try the patches when the testing is done.

            sarah Sarah Liu added a comment - Hi Alex, yes, it is better than before. Right now soak is running with b2_12 testing, I will try the patches when the testing is done.
            bzzz Alex Zhuravlev added a comment - - edited

            9 MDS failover completed before the failure

            is it any better than before?

            also, full logs are very appreciated.
            there are two patches which can help to cure the corrupted llog's problem:
            https://review.whamcloud.com/#/c/38385/
            https://review.whamcloud.com/#/c/38387/

            bzzz Alex Zhuravlev added a comment - - edited 9 MDS failover completed before the failure is it any better than before? also, full logs are very appreciated. there are two patches which can help to cure the corrupted llog's problem: https://review.whamcloud.com/#/c/38385/ https://review.whamcloud.com/#/c/38387/
            sarah Sarah Liu added a comment -

            9 MDS failover completed before the failure

            sarah Sarah Liu added a comment - 9 MDS failover completed before the failure

            how many successful failovers had it done?

            bzzz Alex Zhuravlev added a comment - how many successful failovers had it done?
            sarah Sarah Liu added a comment -

            Here is the update. 1 MDS hit LBUG again but a different one when running with master build including the fix of LU-13402

            soak-8 console

            [11360.312904] Lustre: Failing over soaked-MDT0001
            [11360.312929] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -19
            [11360.312937] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recove
            ry to complete
            [11360.349350] LustreError: 4582:0:(import.c:705:ptlrpc_connect_import_locked()) can't connect to a closed import
            [11360.853248] Lustre: server umount soaked-MDT0001 complete
            [11361.443391] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.102@o2ib (no target). If you are running an HA pair check that th
            e target is mounted on the other server.
            [11361.463245] LustreError: Skipped 93 previous similar messages
            [11363.753573] Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.109@o2ib, removing former export from 0@lo
            [11363.765487] Lustre: Skipped 1 previous similar message
            [11393.837251] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail.
            [11393.853109] LustreError: 4733:0:(llog_osd.c:263:llog_osd_read_header()) soaked-MDT0001-osp-MDT0000: bad log  [0x240000401:0x1:0x0] header magic: 0x0 (expected 
            0x10645539)
            [11393.871120] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) ASSERTION( ctxt != ((void *)0) ) failed: 
            [11393.882886] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) LBUG
            [11393.891077] Pid: 4733, comm: lod0000_rec0001 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Thu Apr 23 02:14:20 UTC 2020
            [11393.902641] Call Trace:
            [11393.905394]  [<ffffffffc0ad10cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
            [11393.912739]  [<ffffffffc0ad117c>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [11393.919677]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
            [11393.926941]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
            [11393.934759]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
            [11393.940234]  [<ffffffffa518dd37>] ret_from_fork_nospec_end+0x0/0x39
            [11393.947259]  [<ffffffffffffffff>] 0xffffffffffffffff
            [11393.953057] LustreError: dumping log to /tmp/lustre-log.1590658382.4733
            [11520.707452] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
            [11520.715960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [11520.724727] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
            [11520.732665] Call Trace:
            [11520.735438]  [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20
            [11520.741684]  [<ffffffffa5180a09>] schedule+0x29/0x70
            [11520.747261]  [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
            [11520.754205]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
            [11520.761429]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
            [11520.769234]  [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod]
            [11520.776734]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
            [11520.782200]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
            [11520.789020]  [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21
            [11520.796318]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
            [11550.607865] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reached hard limit: 900, extend: 1
            [11550.622272] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) Skipped 31 previous similar messages
            [11640.796481] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
            [11640.804564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [11640.813320] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
            [11640.821289] Call Trace:
            [11640.824059]  [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20
            [11640.830319]  [<ffffffffa5180a09>] schedule+0x29/0x70
            [11640.835909]  [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
            [11640.842839]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
            [11640.850079]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
            [11640.857882]  [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod]
            [11640.865394]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
            [11640.870885]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
            [11640.877723]  [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21
            [11640.885019]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
            [11760.885726] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
            [11760.893864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [11760.902634] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
            
            sarah Sarah Liu added a comment - Here is the update. 1 MDS hit LBUG again but a different one when running with master build including the fix of LU-13402 soak-8 console [11360.312904] Lustre: Failing over soaked-MDT0001 [11360.312929] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -19 [11360.312937] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recove ry to complete [11360.349350] LustreError: 4582:0:(import.c:705:ptlrpc_connect_import_locked()) can't connect to a closed import [11360.853248] Lustre: server umount soaked-MDT0001 complete [11361.443391] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.102@o2ib (no target). If you are running an HA pair check that th e target is mounted on the other server. [11361.463245] LustreError: Skipped 93 previous similar messages [11363.753573] Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.109@o2ib, removing former export from 0@lo [11363.765487] Lustre: Skipped 1 previous similar message [11393.837251] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. [11393.853109] LustreError: 4733:0:(llog_osd.c:263:llog_osd_read_header()) soaked-MDT0001-osp-MDT0000: bad log [0x240000401:0x1:0x0] header magic: 0x0 (expected 0x10645539) [11393.871120] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) ASSERTION( ctxt != ((void *)0) ) failed: [11393.882886] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) LBUG [11393.891077] Pid: 4733, comm: lod0000_rec0001 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Thu Apr 23 02:14:20 UTC 2020 [11393.902641] Call Trace: [11393.905394] [<ffffffffc0ad10cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [11393.912739] [<ffffffffc0ad117c>] lbug_with_loc+0x4c/0xa0 [libcfs] [11393.919677] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11393.926941] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11393.934759] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11393.940234] [<ffffffffa518dd37>] ret_from_fork_nospec_end+0x0/0x39 [11393.947259] [<ffffffffffffffff>] 0xffffffffffffffff [11393.953057] LustreError: dumping log to /tmp/lustre-log.1590658382.4733 [11520.707452] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11520.715960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11520.724727] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 [11520.732665] Call Trace: [11520.735438] [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20 [11520.741684] [<ffffffffa5180a09>] schedule+0x29/0x70 [11520.747261] [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [11520.754205] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11520.761429] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11520.769234] [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod] [11520.776734] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11520.782200] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11520.789020] [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21 [11520.796318] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11550.607865] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reached hard limit: 900, extend: 1 [11550.622272] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) Skipped 31 previous similar messages [11640.796481] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11640.804564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11640.813320] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 [11640.821289] Call Trace: [11640.824059] [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20 [11640.830319] [<ffffffffa5180a09>] schedule+0x29/0x70 [11640.835909] [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [11640.842839] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11640.850079] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11640.857882] [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod] [11640.865394] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11640.870885] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11640.877723] [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21 [11640.885019] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11760.885726] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11760.893864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11760.902634] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080
            pjones Peter Jones added a comment -

            bzzz I believe that the testing has been delayed due to hardware issues

            pjones Peter Jones added a comment - bzzz I believe that the testing has been delayed due to hardware issues

            hello, any updates on this issue?

             

            bzzz Alex Zhuravlev added a comment - hello, any updates on this issue?  

            sarah I think you should try with the recent master which has LU-13402

            bzzz Alex Zhuravlev added a comment - sarah I think you should try with the recent master which has LU-13402
            sarah Sarah Liu added a comment -

            restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120

            sarah Sarah Liu added a comment - restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120
            sarah Sarah Liu added a comment -

            Ok, I will restart the tests and post logs.
            The quoted log seems hardware related, not expected during the test.

            sarah Sarah Liu added a comment - Ok, I will restart the tests and post logs. The quoted log seems hardware related, not expected during the test.

            People

              bzzz Alex Zhuravlev
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: