Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9838

target registration mount fails with -108 but then succeeds if retried immediately

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.10.0
    • None
    • Lustre: Build Version: 2.10.0_15_gbaa1ce2 from b2_10
    • 3
    • 9223372036854775807

    Description

      Occasionally, when we try to register (i.e. mount for a first time) a target it will fail with:

      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running?
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: -108
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1576:server_put_super()) no obd testfs-OST0001
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: Evicted from MGS (at 10.14.83.68@tcp) after server handle changed from 0xddf3a8acc8d5136b to 0xddf3a8acc8d51555
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: MGC10.14.83.68@tcp: Connection restored to 10.14.83.68@tcp (at 10.14.83.68@tcp)
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-108)
      
      
      

      But then if we try the exact same mount command a second time, immediately followig the failure it succeeds:

      Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: new disk, initializing
      Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: srv-testfs-OST0001: No data found on store. Initialize space
      Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: Imperative Recovery not enabled, recovery window 300-900
      Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 30649:0:(osd_oi.c:503:osd_oid()) testfs-OST0001-osd: unsupported quota oid: 0x16
      
      

      In the MGS's syslog around the failure time was:

      Aug  5 03:35:50 lotus-58vm15.lotus.hpdd.lab.intel.com kernel: Lustre: MGS: Connection restored to 7d1fb120-bb6f-8171-c97c-aa44f5e9c3db (at 10.14.83.71@tcp)
      
      

      where 10.14.83.71 is lotus-58vm18.

      I've set this issue to Critical because while it may be at least disconcerting for a human user who might just be persistent and curious enough to just try a second time and discover it works, it wreaks havoc on automated systems like IML and Pacemaker, that expect commands that should succeed to succeed.

      Attachments

        1. mds-debug.log
          64 kB
        2. mds-debug.log.bz2
          8.54 MB
        3. oss-debug.log
          101 kB
        4. oss-debug.log.bz2
          7.10 MB

        Issue Links

          Activity

            [LU-9838] target registration mount fails with -108 but then succeeds if retried immediately

            Close old issue related to 2.10 servers. Should be reproduced on modern system to confirm if issue still exists.

            adilger Andreas Dilger added a comment - Close old issue related to 2.10 servers. Should be reproduced on modern system to confirm if issue still exists.

            lustre versions on server and clients have been upgraded. above issue no longer relevant to us.

            javed javed shaikh (Inactive) added a comment - lustre versions on server and clients have been upgraded. above issue no longer relevant to us.

            working around this for IML 4.0 GA

            bhoagland Brad Hoagland (Inactive) added a comment - working around this for IML 4.0 GA

            Is this something that is reasonable to fix on b2_10?

            brian Brian Murrell (Inactive) added a comment - Is this something that is reasonable to fix on b2_10?
            green Oleg Drokin added a comment -

            So the way it unfolds is this:

            00000100:00000040:1.0:1503245974.332685:0:1172:0:(lustre_net.h:2457:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc"  req@ffff88001d2c2400 x1576258375780640/t0(0) o250->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
            00000100:00000040:1.0:1503245974.332731:0:1172:0:(niobuf.c:896:ptl_send_rpc()) @@@ send flg=0  req@ffff88001d2c2400 x1576258375780640/t0(0) o250->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 1503246010 ref 2 fl Rpc:N/0/ffffffff rc 0/-1
            00000100:00000200:1.0:1503245974.332898:0:1168:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff88001d2c2400 x1576258375780640/t0(0) o250->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 1503246010 ref 2 fl Rpc:N/0/ffffffff rc 0/-1
            10000000:01000000:0.0:1503245974.355790:0:19104:0:(mgc_request.c:2139:mgc_process_log()) MGC10.14.80.248@tcp: configuration from log 'testfs-client' succeeded (0).
            00000100:00000040:0.0:1503245974.355897:0:19104:0:(lustre_net.h:2457:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc"  req@ffff8800449dea00 x1576258375780672/t0(0) o253->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
            00000100:00000200:0.0:1503245974.355926:0:19104:0:(client.c:1181:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8800449dea00 x1576258375780672/t0(0) o253->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
            00000100:00000400:0.0:1503246010.355433:0:1172:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1503245974/real 1503245974]  req@ffff88001d2c2400 x1576258375780640/t0(0) o250->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 1 dl 1503246010 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            

            So at least in this instance it looks like we tried to connect (o250) to the MGS, but it did not respond to us, yet we already had a cached config log from somewhere so we decided to send the registration (o253) and because the mgc imports are obd_no_recov until actually connected we bail right out instead of waiting for the import to become active first.

            This whole strange thing with mgc where it's not recoverable and has strange no-resend policies has bitten us again and it's probably time to take a good hard look at it.

            green Oleg Drokin added a comment - So the way it unfolds is this: 00000100:00000040:1.0:1503245974.332685:0:1172:0:(lustre_net.h:2457:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff88001d2c2400 x1576258375780640/t0(0) o250->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00000040:1.0:1503245974.332731:0:1172:0:(niobuf.c:896:ptl_send_rpc()) @@@ send flg=0 req@ffff88001d2c2400 x1576258375780640/t0(0) o250->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 1503246010 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 00000100:00000200:1.0:1503245974.332898:0:1168:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff88001d2c2400 x1576258375780640/t0(0) o250->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 1503246010 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 10000000:01000000:0.0:1503245974.355790:0:19104:0:(mgc_request.c:2139:mgc_process_log()) MGC10.14.80.248@tcp: configuration from log 'testfs-client' succeeded (0). 00000100:00000040:0.0:1503245974.355897:0:19104:0:(lustre_net.h:2457:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff8800449dea00 x1576258375780672/t0(0) o253->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 00000100:00000200:0.0:1503245974.355926:0:19104:0:(client.c:1181:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8800449dea00 x1576258375780672/t0(0) o253->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000400:0.0:1503246010.355433:0:1172:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1503245974/real 1503245974] req@ffff88001d2c2400 x1576258375780640/t0(0) o250->MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 1 dl 1503246010 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 So at least in this instance it looks like we tried to connect (o250) to the MGS, but it did not respond to us, yet we already had a cached config log from somewhere so we decided to send the registration (o253) and because the mgc imports are obd_no_recov until actually connected we bail right out instead of waiting for the import to become active first. This whole strange thing with mgc where it's not recoverable and has strange no-resend policies has bitten us again and it's probably time to take a good hard look at it.
            brian Brian Murrell (Inactive) added a comment - - edited

            That's strange.

            In any case, I tried to attach two more logs but ran into DCO-7458 so they are on the ftp site in /uploads/LU-9838/ in files mds-debug.log.bz2 and /tmp/oss-debug.log.bz2.

            brian Brian Murrell (Inactive) added a comment - - edited That's strange. In any case, I tried to attach two more logs but ran into DCO-7458 so they are on the ftp site in /uploads/ LU-9838 /  in files mds-debug.log.bz2 and /tmp/oss-debug.log.bz2 .
            green Oleg Drokin added a comment -

            Well, this time it failed with -114 - "operation already in progress", not -108 - "Cannot send after transport endpoint shutdown" so I am not sure if this is the same problem or not.

            00000020:00000001:0.0:1503141275.035516:0:16570:0:(obd_mount_server.c:1753:osd_start()) Process entered
            00000020:01000004:0.0:1503141275.035517:0:16570:0:(obd_mount_server.c:1757:osd_start()) Attempting to start testfs-OST0000, type=osd-zfs, lsifl=200002, mountfl=0
            00000020:01000004:0.0:1503141275.035522:0:16570:0:(obd_mount_server.c:1776:osd_start()) testfs-OST0000-osd already started
            00000020:00000001:0.0:1503141275.035523:0:16570:0:(obd_mount_server.c:1783:osd_start()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)
            00000020:00020000:0.0:1503141275.035525:0:16570:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk3/testfs-OST0000: -114
            00000020:00000001:0.0:1503141275.044922:0:16570:0:(obd_mount.c:657:lustre_put_lsi()) Process entered
            00000020:01000004:0.0:1503141275.044924:0:16570:0:(obd_mount.c:661:lustre_put_lsi()) put ffff88003d7e1000 1
            

            This was even before any talking to MGS, so I guess it is something else.
            It sounds like the fs was mounted when you tried to mount it again to me?

            green Oleg Drokin added a comment - Well, this time it failed with -114 - "operation already in progress", not -108 - "Cannot send after transport endpoint shutdown" so I am not sure if this is the same problem or not. 00000020:00000001:0.0:1503141275.035516:0:16570:0:(obd_mount_server.c:1753:osd_start()) Process entered 00000020:01000004:0.0:1503141275.035517:0:16570:0:(obd_mount_server.c:1757:osd_start()) Attempting to start testfs-OST0000, type=osd-zfs, lsifl=200002, mountfl=0 00000020:01000004:0.0:1503141275.035522:0:16570:0:(obd_mount_server.c:1776:osd_start()) testfs-OST0000-osd already started 00000020:00000001:0.0:1503141275.035523:0:16570:0:(obd_mount_server.c:1783:osd_start()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e) 00000020:00020000:0.0:1503141275.035525:0:16570:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk3/testfs-OST0000: -114 00000020:00000001:0.0:1503141275.044922:0:16570:0:(obd_mount.c:657:lustre_put_lsi()) Process entered 00000020:01000004:0.0:1503141275.044924:0:16570:0:(obd_mount.c:661:lustre_put_lsi()) put ffff88003d7e1000 1 This was even before any talking to MGS, so I guess it is something else. It sounds like the fs was mounted when you tried to mount it again to me?

            Debug logs attached: mds-debug.log and oss-debug.log.

            brian Brian Murrell (Inactive) added a comment - Debug logs attached: mds-debug.log and oss-debug.log .
            joe.grund Joe Grund added a comment -

            Hitting intermittent failures in IML, being tracked under: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/107

            joe.grund Joe Grund added a comment - Hitting intermittent failures in IML, being tracked under: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/107
            green Oleg Drokin added a comment - - edited

            basically I wand you to do this before the registration but after modules are loaded on both mgs and the target:

            lctl set_param debug=-1
            lctl set_param debug_mb=200
            
            

            then do your registration attempt and immediately after that on both nodes:

            lctl dk >/tmp/debug.log
            
            

            If there was a failure - collect those two files and make them available.

            green Oleg Drokin added a comment - - edited basically I wand you to do this before the registration but after modules are loaded on both mgs and the target: lctl set_param debug=-1 lctl set_param debug_mb=200 then do your registration attempt and immediately after that on both nodes: lctl dk >/tmp/debug.log If there was a failure - collect those two files and make them available.

            People

              wc-triage WC Triage
              brian Brian Murrell (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: