Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.14.0
    • lustre-master-ib #404
    • 3
    • 9223372036854775807

    Description

      1 MDS hung during mount during failover process.

      soak-9 console

      [ 3961.086008] mount.lustre    D ffff8f5730291070     0  5206   5205 0x00000082
      [ 3961.093940] Call Trace:
      [ 3961.096752]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.105419]  [<ffffffff99380a09>] schedule+0x29/0x70
      [ 3961.110980]  [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0
      [ 3961.117509]  [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760
      [ 3961.124565]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.133226]  [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140
      [ 3961.139955]  [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20
      [ 3961.146222]  [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass]
      [ 3961.154226]  [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass]
      [ 3961.161271]  [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass]
      [ 3961.169552]  [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      [ 3961.176961]  [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc]
      [ 3961.184004]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.192673]  [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc]
      [ 3961.200110]  [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass]
      [ 3961.207925]  [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass]
      [ 3961.216133]  [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass]
      [ 3961.224020]  [<ffffffff98e23db6>] ? kfree+0x106/0x140
      [ 3961.229698]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.238396]  [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass]
      [ 3961.246314]  [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass]
      [ 3961.254033]  [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      [ 3961.262511]  [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0
      [ 3961.268390]  [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass]
      [ 3961.275401]  [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0
      [ 3961.281064]  [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110
      [ 3961.287299]  [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0
      [ 3961.293070]  [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250
      [ 3961.300073]  [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200
      [ 3961.307276]  [<ffffffff98e70a93>] SyS_mount+0x83/0xd0
      [ 3961.312939]  [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a
      [ 3961.319665]  [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146
      [ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
      [ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect
      
      

      Attachments

        1. lustre-log.1588133843.6068-soak-8
          124.12 MB
        2. soak-11.log-051120
          944 kB
        3. soak-9.log-20200419.gz
          184 kB
        4. trace-8
          1002 kB
        5. trace-s-11-051120
          997 kB
        6. trace-soak8
          976 kB

        Issue Links

          Activity

            [LU-13469] MDS hung during mount

            hello, any updates on this issue?

             

            bzzz Alex Zhuravlev added a comment - hello, any updates on this issue?  

            sarah I think you should try with the recent master which has LU-13402

            bzzz Alex Zhuravlev added a comment - sarah I think you should try with the recent master which has LU-13402
            sarah Sarah Liu added a comment -

            restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120

            sarah Sarah Liu added a comment - restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120
            sarah Sarah Liu added a comment -

            Ok, I will restart the tests and post logs.
            The quoted log seems hardware related, not expected during the test.

            sarah Sarah Liu added a comment - Ok, I will restart the tests and post logs. The quoted log seems hardware related, not expected during the test.

            sorry, it's not quite enough information..
            it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes.
            one interesting thing from the log attached:

            [ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0)
            [ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00)
            [ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00
            [ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)
            [ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) 
            [ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00)
            [ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840)
            [ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00
            [ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)
            

            I guess this shouldn't happen during this test?

            bzzz Alex Zhuravlev added a comment - sorry, it's not quite enough information.. it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes. one interesting thing from the log attached: [ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0) [ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00) [ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00 [ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) [ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) [ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00) [ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840) [ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00 [ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) I guess this shouldn't happen during this test?
            sarah Sarah Liu added a comment -

            there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover
            1. mds1 failover
            reboot mds1
            mount the disks to failover pair mds2
            after mds1 up, fail back the disks to mds1

            2. mds restart
            this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

            sarah Sarah Liu added a comment - there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover 1. mds1 failover reboot mds1 mount the disks to failover pair mds2 after mds1 up, fail back the disks to mds1 2. mds restart this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

            thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this.
            can you please explain what the test is doing?

            bzzz Alex Zhuravlev added a comment - thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this. can you please explain what the test is doing?
            sarah Sarah Liu added a comment -

            I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

            sarah Sarah Liu added a comment - I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

            sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

            bzzz Alex Zhuravlev added a comment - sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file
            sarah Sarah Liu added a comment -

            Hi Alex,

            I am having a weird issue when setting up the panic_on_lbug=0 permanently on soak-8(MGS), Here is what I did
            1. lctl set_param -P panic_on_lbug=0
            2. umount and remount as ldiskfs and check the config log, and the value was set as 0
            3. mount lustre, check the panic_on_lbug=1, it didn't change.

            I am not sure if this is related to the llog issue here, can you please check? Do you need any log for this? If it is unrelated, I will create a new ticket, and may need delete bad stuff and restart.

            Thanks

            sarah Sarah Liu added a comment - Hi Alex, I am having a weird issue when setting up the panic_on_lbug=0 permanently on soak-8(MGS), Here is what I did 1. lctl set_param -P panic_on_lbug=0 2. umount and remount as ldiskfs and check the config log, and the value was set as 0 3. mount lustre, check the panic_on_lbug=1, it didn't change. I am not sure if this is related to the llog issue here, can you please check? Do you need any log for this? If it is unrelated, I will create a new ticket, and may need delete bad stuff and restart. Thanks
            sarah Sarah Liu added a comment -

            Hi Alex, I will restart with the debug on

            sarah Sarah Liu added a comment - Hi Alex, I will restart with the debug on

            People

              bzzz Alex Zhuravlev
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: