Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5148

OSTs won't mount following upgrade to 2.4.2

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • None
    • 3
    • 14209

    Description

      A production lustre cluster "porter" was upgraded from 2.4.0-28chaos to lustre-2.4.2-11chaos today. OSTs now will not start.

      # porter1 /root > /etc/init.d/lustre start
      Stopping snmpd:                                            [  OK  ]
      Shutting down cerebrod:                                    [  OK  ]
      Mounting porter1/lse-ost0 on /mnt/lustre/local/lse-OST0001
      mount.lustre: mount porter1/lse-ost0 at /mnt/lustre/local/lse-OST0001 failed: Input/output error
      Is the MGS running?
      # porter1 /root > 
      
      Lustre: Lustre: Build Version: 2.4.2-11chaos-11chaos--PRISTINE-2.6.32-431.17.2.1chaos.ch5.2.x86_64
      LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.115.67@o2ib10 (no target)
      LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.38@o2ib7 (no target)
      LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.101@o2ib7 (no target)
      LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881026873800 x1470103660003336/t0(0) o253->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 5426:0:(obd_mount_server.c:1140:server_register_target()) lse-OST0001: error registering with the MGS: rc = -5 (not fatal)
      LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.116.205@o2ib5 (no target)
      LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.114.162@o2ib5 (no target)
      LustreError: Skipped 19 previous similar messages
      LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881026873800 x1470103660003340/t0(0) o101->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.162@o2ib7 (no target)
      LustreError: Skipped 23 previous similar messages
      LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881026873800 x1470103660003344/t0(0) o101->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 15c-8: MGC172.19.1.165@o2ib100: The configuration from log 'lse-OST0001' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      LustreError: 5426:0:(obd_mount_server.c:1273:server_start_targets()) failed to start server lse-OST0001: -5
      Lustre: lse-OST0001: Unable to start target: -5
      LustreError: 5426:0:(obd_mount_server.c:865:lustre_disconnect_lwp()) lse-MDT0000-lwp-OST0001: Can't end config log lse-client.
      LustreError: 5426:0:(obd_mount_server.c:1442:server_put_super()) lse-OST0001: failed to disconnect lwp. (rc=-2)
      LustreError: 5426:0:(obd_mount_server.c:1472:server_put_super()) no obd lse-OST0001
      Lustre: server umount lse-OST0001 complete
      LustreError: 5426:0:(obd_mount.c:1290:lustre_fill_super()) Unable to mount  (-5)
      
      # porter1 /root > lctl ping 172.19.1.165@o2ib100 # <-- MGS NID
      12345-0@lo
      12345-172.19.1.165@o2ib100
      

      Attachments

        Issue Links

          Activity

            [LU-5148] OSTs won't mount following upgrade to 2.4.2

            Closing as stale.

            nedbass Ned Bass (Inactive) added a comment - Closing as stale.

            Hello Ned,

            Do you have any update for us on this elderly ticket? Has this issue been resolved by use of later versions, for example?

            We would like to mark it as resolved, if you have no objection?

            Thanks,
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - - edited Hello Ned, Do you have any update for us on this elderly ticket? Has this issue been resolved by use of later versions, for example? We would like to mark it as resolved, if you have no objection? Thanks, ~ jfc.

            I attempted to reproduce this on Hyperion, by starting with 2.4.0 and upgrading to 2.4.2 after running some IO tests. I could not create the failure, however was using the whamcloud 2.4.2 release, which may be different

            cliffw Cliff White (Inactive) added a comment - I attempted to reproduce this on Hyperion, by starting with 2.4.0 and upgrading to 2.4.2 after running some IO tests. I could not create the failure, however was using the whamcloud 2.4.2 release, which may be different
            green Oleg Drokin added a comment -

            Looking at the stacktraces in the logs, it seems everybody is either blocked on the transaction commit wait inside zfs or on the semaphore that is held by somebody that waits on the transaction commit.

            So it really looks like some sort of in-zfs wait for me. There's no dump of all threads stacks in here, so I wonder if you have one where it is visible there's lustre induced deadlock of some sort above zfs?

            green Oleg Drokin added a comment - Looking at the stacktraces in the logs, it seems everybody is either blocked on the transaction commit wait inside zfs or on the semaphore that is held by somebody that waits on the transaction commit. So it really looks like some sort of in-zfs wait for me. There's no dump of all threads stacks in here, so I wonder if you have one where it is visible there's lustre induced deadlock of some sort above zfs?

            this debug patch changed the IR (Imperative Recovery) operations in MGS to update asynchronously, and if the issue won't occur again, we can isolate the problem
            as related to the slow synchronization of ZFS, just as the problem shown in LU-2887, then we can try to create corresponding patches to fix it.
            Thanks.

            hongchao.zhang Hongchao Zhang added a comment - this debug patch changed the IR (Imperative Recovery) operations in MGS to update asynchronously, and if the issue won't occur again, we can isolate the problem as related to the slow synchronization of ZFS, just as the problem shown in LU-2887 , then we can try to create corresponding patches to fix it. Thanks.
            pjones Peter Jones added a comment -

            Hongchao

            Could you please elaborate as to how this patch works?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please elaborate as to how this patch works? Thanks Peter

            Hi,

            Could you please try the debug patch at http://review.whamcloud.com/#/c/10869/ to check whether this issue occurs again?

            Thanks very much!

            hongchao.zhang Hongchao Zhang added a comment - Hi, Could you please try the debug patch at http://review.whamcloud.com/#/c/10869/ to check whether this issue occurs again? Thanks very much!

            People

              hongchao.zhang Hongchao Zhang
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: