Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4325

Config failover between 2 Lustre serves, simulating one server crashed, the other server crashed unexpected when it take the task of the crashed one

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.0
    • None
    • 2 Lustre server +1 client server
    • 3
    • 11824

    Description

      1.mount 1 MDT and 4 OSTs on the Lustre Server1.
      2.mount 4 OSTs on the Lustre Server2.
      3.Config Lustre failover between the 2 Lustre Servers.
      4.mount the Lustre File System on the Lustre Client.
      5.Write and Read datas on the Lustre Client.
      6.Simulating the Lustre Server1 crashed.
      7.The Lustre Server2 crashed unexpectedly when it take the task of the Lustre Server1 ,the call trace info as follow:
      LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 192.168.22.202@tcp (no target)
      LustreError: Skipped 3 previous similar messages
      LDISKFS-fs (sde): recovery complete
      LDISKFS-fs (sde): mounted filesystem with ordered data mode. quota=on. Opts:
      LustreError: 10026:0:(genops.c:320:class_newdev()) Device MGC192.168.22.50@tcp already exists at 2, won't add
      LustreError: 10026:0:(obd_config.c:374:class_attach()) Cannot create device MGC192.168.22.50@tcp of type mgc : -17
      LustreError: 10026:0:(obd_mount.c:196:lustre_start_simple()) MGC192.168.22.50@tcp attach error -17
      LustreError: 10026:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
      LustreError: 10026:0:(obd_mount_server.c:1426:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
      LustreError: 10026:0:(obd_mount_server.c:1456:server_put_super()) no obd lustre-MDT0000
      LustreError: 10026:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-MDT0000 not registered
      LustreError: 10026:0:(genops.c:1570:obd_exports_barrier()) ASSERTION( list_empty(&obd->obd_exports) ) failed:
      LustreError: 10026:0:(genops.c:1570:obd_exports_barrier()) LBUG
      Pid: 10026, comm: mount.lustre

      Call Trace:
      [<ffffffffa070f8a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa070feb7>] lbug_with_loc+0x47/0xb0 [libcfs]
      [<ffffffffa0818d91>] obd_exports_barrier+0x181/0x190 [obdclass]
      [<ffffffffa0f23886>] mgs_device_fini+0xf6/0x5c0 [mgs]
      [<ffffffffa0843837>] class_cleanup+0x817/0xe00 [obdclass]
      [<ffffffffa081ce2c>] ? class_name2dev+0x7c/0xe0 [obdclass]
      [<ffffffffa0847e9b>] class_process_config+0x1b6b/0x2f60 [obdclass]
      [<ffffffffa0710b90>] ? cfs_alloc+0x30/0x60 [libcfs]
      [<ffffffffa0849723>] class_manual_cleanup+0x493/0xe80 [obdclass]
      [<ffffffff8147a1fe>] ? _read_unlock+0xe/0x10
      [<ffffffffa081ce2c>] ? class_name2dev+0x7c/0xe0 [obdclass]
      [<ffffffffa0884b9d>] server_put_super+0x42d/0x2580 [obdclass]
      [<ffffffffa0887440>] server_fill_super+0x750/0x1580 [obdclass]
      [<ffffffffa0854c98>] lustre_fill_super+0x1d8/0x530 [obdclass]
      [<ffffffffa0854ac0>] ? lustre_fill_super+0x0/0x530 [obdclass]
      [<ffffffff8114d21f>] get_sb_nodev+0x5f/0xa0
      [<ffffffffa084c3f5>] lustre_get_sb+0x25/0x30 [obdclass]
      [<ffffffff8114c74b>] vfs_kern_mount+0x7b/0x1b0
      [<ffffffff8114c8f2>] do_kern_mount+0x52/0x130
      [<ffffffff81168912>] do_mount+0x2d2/0x8c0
      [<ffffffff81168f90>] sys_mount+0x90/0xe0
      [<ffffffff81002f5b>] system_call_fastpath+0x16/0x1b

      Message fromKernel panic - not syncing: LBUG
      Pid: 10026, comm: mount.lustre Tainted: PF --------------- 2.6.32-358.6.2.l2.08 #2
      Call Trace:
      [<ffffffff81476fa7>] ? panic+0xa1/0x163
      [<ffffffffa070ff0b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      [<ffffffffa0818d91>] ? obd_exports_barrier+0x181/0x190 [obdclass]
      [<ffffffffa0f23886>] ? mgs_device_fini+0xf6/0x5c0 [mgs]
      [<ffffffffa0843837>] ? class_cleanup+0x817/0xe00 [obdclass]
      [<ffffffffa081ce2c>] ? class_name2dev+0x7c/0xe0 [obdclass]
      [<ffffffffa0847e9b>] ? class_process_config+0x1b6b/0x2f60 [obdclass]
      syslogd@50:B3:4 [<ffffffffa0710b90>] ? cfs_alloc+0x30/0x60 [libcfs]
      [<ffffffffa0849723>] ? class_manual_cleanup+0x493/0xe80 [obdclass]
      2:00:01:01 at Se [<ffffffff8147a1fe>] ? _read_unlock+0xe/0x10
      [<ffffffffa081ce2c>] ? class_name2dev+0x7c/0xe0 [obdclass]
      [<ffffffffa0884b9d>] ? server_put_super+0x42d/0x2580 [obdclass]
      [<ffffffffa0887440>] ? server_fill_super+0x750/0x1580 [obdclass]
      p 22 12:53:12 .. [<ffffffffa0854c98>] ? lustre_fill_super+0x1d8/0x530 [obdclass]
      [<ffffffffa0854ac0>] ? lustre_fill_super+0x0/0x530 [obdclass]
      [<ffffffff8114d21f>] ? get_sb_nodev+0x5f/0xa0
      [<ffffffffa084c3f5>] ? lustre_get_sb+0x25/0x30 [obdclass]
      [<ffffffff8114c74b>] ? vfs_kern_mount+0x7b/0x1b0
      [<ffffffff8114c8f2>] ? do_kern_mount+0x52/0x130
      [<ffffffff81168912>] ? do_mount+0x2d2/0x8c0
      [<ffffffff81168f90>] ? sys_mount+0x90/0xe0
      [<ffffffff81002f5b>] ? system_call_fastpath+0x16/0x1b
      *******show para for nt_memcpy16*******
      src: ffff880285fc4f00, dst: ffffc90112030e70, len: 56
      *******show para for panic done*******
      ODSP:MSG:BUGON: This stack is bug.
      ODSP:MSG:BUGON: Local was taken over by peer. Suspend CPU.
      ODSP:MSG:BUGON: Local was taken over by peer. Suspend CPU.

      Attachments

        Activity

          [LU-4325] Config failover between 2 Lustre serves, simulating one server crashed, the other server crashed unexpected when it take the task of the crashed one
          yueyuling yueyuling added a comment -

          Thank you for your attention to the two problems.
          You are right, the two problems have the same stacktrace. Because I pay more attention to the phenomenon of the problem, not the stacktrace. I think the phenomenon of the two problems are different , so the two problems are different.

          yueyuling yueyuling added a comment - Thank you for your attention to the two problems. You are right, the two problems have the same stacktrace. Because I pay more attention to the phenomenon of the problem, not the stacktrace. I think the phenomenon of the two problems are different , so the two problems are different.
          green Oleg Drokin added a comment -

          How is this ticked different from LU-4190 that you filed on Oct 30th and that has exactly the same stacktrace?

          green Oleg Drokin added a comment - How is this ticked different from LU-4190 that you filed on Oct 30th and that has exactly the same stacktrace?

          People

            wc-triage WC Triage
            yueyuling yueyuling
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: