Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.14.0
    • lustre-master-ib #404
    • 3
    • 9223372036854775807

    Description

      1 MDS hung during mount during failover process.

      soak-9 console

      [ 3961.086008] mount.lustre    D ffff8f5730291070     0  5206   5205 0x00000082
      [ 3961.093940] Call Trace:
      [ 3961.096752]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.105419]  [<ffffffff99380a09>] schedule+0x29/0x70
      [ 3961.110980]  [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0
      [ 3961.117509]  [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760
      [ 3961.124565]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.133226]  [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140
      [ 3961.139955]  [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20
      [ 3961.146222]  [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass]
      [ 3961.154226]  [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass]
      [ 3961.161271]  [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass]
      [ 3961.169552]  [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      [ 3961.176961]  [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc]
      [ 3961.184004]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.192673]  [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc]
      [ 3961.200110]  [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass]
      [ 3961.207925]  [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass]
      [ 3961.216133]  [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass]
      [ 3961.224020]  [<ffffffff98e23db6>] ? kfree+0x106/0x140
      [ 3961.229698]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.238396]  [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass]
      [ 3961.246314]  [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass]
      [ 3961.254033]  [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      [ 3961.262511]  [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0
      [ 3961.268390]  [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass]
      [ 3961.275401]  [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0
      [ 3961.281064]  [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110
      [ 3961.287299]  [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0
      [ 3961.293070]  [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250
      [ 3961.300073]  [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200
      [ 3961.307276]  [<ffffffff98e70a93>] SyS_mount+0x83/0xd0
      [ 3961.312939]  [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a
      [ 3961.319665]  [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146
      [ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
      [ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect
      
      

      Attachments

        1. lustre-log.1588133843.6068-soak-8
          124.12 MB
        2. soak-11.log-051120
          944 kB
        3. soak-9.log-20200419.gz
          184 kB
        4. trace-8
          1002 kB
        5. trace-s-11-051120
          997 kB
        6. trace-soak8
          976 kB

        Issue Links

          Activity

            [LU-13469] MDS hung during mount
            sarah Sarah Liu added a comment -

            Here is the update. 1 MDS hit LBUG again but a different one when running with master build including the fix of LU-13402

            soak-8 console

            [11360.312904] Lustre: Failing over soaked-MDT0001
            [11360.312929] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -19
            [11360.312937] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recove
            ry to complete
            [11360.349350] LustreError: 4582:0:(import.c:705:ptlrpc_connect_import_locked()) can't connect to a closed import
            [11360.853248] Lustre: server umount soaked-MDT0001 complete
            [11361.443391] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.102@o2ib (no target). If you are running an HA pair check that th
            e target is mounted on the other server.
            [11361.463245] LustreError: Skipped 93 previous similar messages
            [11363.753573] Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.109@o2ib, removing former export from 0@lo
            [11363.765487] Lustre: Skipped 1 previous similar message
            [11393.837251] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail.
            [11393.853109] LustreError: 4733:0:(llog_osd.c:263:llog_osd_read_header()) soaked-MDT0001-osp-MDT0000: bad log  [0x240000401:0x1:0x0] header magic: 0x0 (expected 
            0x10645539)
            [11393.871120] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) ASSERTION( ctxt != ((void *)0) ) failed: 
            [11393.882886] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) LBUG
            [11393.891077] Pid: 4733, comm: lod0000_rec0001 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Thu Apr 23 02:14:20 UTC 2020
            [11393.902641] Call Trace:
            [11393.905394]  [<ffffffffc0ad10cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
            [11393.912739]  [<ffffffffc0ad117c>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [11393.919677]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
            [11393.926941]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
            [11393.934759]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
            [11393.940234]  [<ffffffffa518dd37>] ret_from_fork_nospec_end+0x0/0x39
            [11393.947259]  [<ffffffffffffffff>] 0xffffffffffffffff
            [11393.953057] LustreError: dumping log to /tmp/lustre-log.1590658382.4733
            [11520.707452] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
            [11520.715960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [11520.724727] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
            [11520.732665] Call Trace:
            [11520.735438]  [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20
            [11520.741684]  [<ffffffffa5180a09>] schedule+0x29/0x70
            [11520.747261]  [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
            [11520.754205]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
            [11520.761429]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
            [11520.769234]  [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod]
            [11520.776734]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
            [11520.782200]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
            [11520.789020]  [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21
            [11520.796318]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
            [11550.607865] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reached hard limit: 900, extend: 1
            [11550.622272] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) Skipped 31 previous similar messages
            [11640.796481] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
            [11640.804564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [11640.813320] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
            [11640.821289] Call Trace:
            [11640.824059]  [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20
            [11640.830319]  [<ffffffffa5180a09>] schedule+0x29/0x70
            [11640.835909]  [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
            [11640.842839]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
            [11640.850079]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
            [11640.857882]  [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod]
            [11640.865394]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
            [11640.870885]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
            [11640.877723]  [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21
            [11640.885019]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
            [11760.885726] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
            [11760.893864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [11760.902634] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
            
            sarah Sarah Liu added a comment - Here is the update. 1 MDS hit LBUG again but a different one when running with master build including the fix of LU-13402 soak-8 console [11360.312904] Lustre: Failing over soaked-MDT0001 [11360.312929] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -19 [11360.312937] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recove ry to complete [11360.349350] LustreError: 4582:0:(import.c:705:ptlrpc_connect_import_locked()) can't connect to a closed import [11360.853248] Lustre: server umount soaked-MDT0001 complete [11361.443391] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.102@o2ib (no target). If you are running an HA pair check that th e target is mounted on the other server. [11361.463245] LustreError: Skipped 93 previous similar messages [11363.753573] Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.109@o2ib, removing former export from 0@lo [11363.765487] Lustre: Skipped 1 previous similar message [11393.837251] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. [11393.853109] LustreError: 4733:0:(llog_osd.c:263:llog_osd_read_header()) soaked-MDT0001-osp-MDT0000: bad log [0x240000401:0x1:0x0] header magic: 0x0 (expected 0x10645539) [11393.871120] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) ASSERTION( ctxt != ((void *)0) ) failed: [11393.882886] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) LBUG [11393.891077] Pid: 4733, comm: lod0000_rec0001 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Thu Apr 23 02:14:20 UTC 2020 [11393.902641] Call Trace: [11393.905394] [<ffffffffc0ad10cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [11393.912739] [<ffffffffc0ad117c>] lbug_with_loc+0x4c/0xa0 [libcfs] [11393.919677] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11393.926941] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11393.934759] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11393.940234] [<ffffffffa518dd37>] ret_from_fork_nospec_end+0x0/0x39 [11393.947259] [<ffffffffffffffff>] 0xffffffffffffffff [11393.953057] LustreError: dumping log to /tmp/lustre-log.1590658382.4733 [11520.707452] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11520.715960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11520.724727] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 [11520.732665] Call Trace: [11520.735438] [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20 [11520.741684] [<ffffffffa5180a09>] schedule+0x29/0x70 [11520.747261] [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [11520.754205] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11520.761429] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11520.769234] [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod] [11520.776734] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11520.782200] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11520.789020] [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21 [11520.796318] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11550.607865] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reached hard limit: 900, extend: 1 [11550.622272] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) Skipped 31 previous similar messages [11640.796481] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11640.804564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11640.813320] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 [11640.821289] Call Trace: [11640.824059] [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20 [11640.830319] [<ffffffffa5180a09>] schedule+0x29/0x70 [11640.835909] [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [11640.842839] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11640.850079] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11640.857882] [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod] [11640.865394] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11640.870885] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11640.877723] [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21 [11640.885019] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11760.885726] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11760.893864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11760.902634] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080
            pjones Peter Jones added a comment -

            bzzz I believe that the testing has been delayed due to hardware issues

            pjones Peter Jones added a comment - bzzz I believe that the testing has been delayed due to hardware issues

            hello, any updates on this issue?

             

            bzzz Alex Zhuravlev added a comment - hello, any updates on this issue?  

            sarah I think you should try with the recent master which has LU-13402

            bzzz Alex Zhuravlev added a comment - sarah I think you should try with the recent master which has LU-13402
            sarah Sarah Liu added a comment -

            restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120

            sarah Sarah Liu added a comment - restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120
            sarah Sarah Liu added a comment -

            Ok, I will restart the tests and post logs.
            The quoted log seems hardware related, not expected during the test.

            sarah Sarah Liu added a comment - Ok, I will restart the tests and post logs. The quoted log seems hardware related, not expected during the test.

            sorry, it's not quite enough information..
            it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes.
            one interesting thing from the log attached:

            [ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0)
            [ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00)
            [ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00
            [ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)
            [ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) 
            [ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00)
            [ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840)
            [ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00
            [ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)
            

            I guess this shouldn't happen during this test?

            bzzz Alex Zhuravlev added a comment - sorry, it's not quite enough information.. it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes. one interesting thing from the log attached: [ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0) [ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00) [ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00 [ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) [ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) [ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00) [ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840) [ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00 [ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) I guess this shouldn't happen during this test?
            sarah Sarah Liu added a comment -

            there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover
            1. mds1 failover
            reboot mds1
            mount the disks to failover pair mds2
            after mds1 up, fail back the disks to mds1

            2. mds restart
            this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

            sarah Sarah Liu added a comment - there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover 1. mds1 failover reboot mds1 mount the disks to failover pair mds2 after mds1 up, fail back the disks to mds1 2. mds restart this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

            thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this.
            can you please explain what the test is doing?

            bzzz Alex Zhuravlev added a comment - thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this. can you please explain what the test is doing?
            sarah Sarah Liu added a comment -

            I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

            sarah Sarah Liu added a comment - I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

            sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

            bzzz Alex Zhuravlev added a comment - sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

            People

              bzzz Alex Zhuravlev
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: