Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9838

target registration mount fails with -108 but then succeeds if retried immediately

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.10.0
    • None
    • Lustre: Build Version: 2.10.0_15_gbaa1ce2 from b2_10
    • 3
    • 9223372036854775807

    Description

      Occasionally, when we try to register (i.e. mount for a first time) a target it will fail with:

      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running?
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: -108
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1576:server_put_super()) no obd testfs-OST0001
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: Evicted from MGS (at 10.14.83.68@tcp) after server handle changed from 0xddf3a8acc8d5136b to 0xddf3a8acc8d51555
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: MGC10.14.83.68@tcp: Connection restored to 10.14.83.68@tcp (at 10.14.83.68@tcp)
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete
      Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-108)
      
      
      

      But then if we try the exact same mount command a second time, immediately followig the failure it succeeds:

      Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: new disk, initializing
      Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: srv-testfs-OST0001: No data found on store. Initialize space
      Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: Imperative Recovery not enabled, recovery window 300-900
      Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 30649:0:(osd_oi.c:503:osd_oid()) testfs-OST0001-osd: unsupported quota oid: 0x16
      
      

      In the MGS's syslog around the failure time was:

      Aug  5 03:35:50 lotus-58vm15.lotus.hpdd.lab.intel.com kernel: Lustre: MGS: Connection restored to 7d1fb120-bb6f-8171-c97c-aa44f5e9c3db (at 10.14.83.71@tcp)
      
      

      where 10.14.83.71 is lotus-58vm18.

      I've set this issue to Critical because while it may be at least disconcerting for a human user who might just be persistent and curious enough to just try a second time and discover it works, it wreaks havoc on automated systems like IML and Pacemaker, that expect commands that should succeed to succeed.

      Attachments

        1. mds-debug.log
          64 kB
        2. mds-debug.log.bz2
          8.54 MB
        3. oss-debug.log
          101 kB
        4. oss-debug.log.bz2
          7.10 MB

        Issue Links

          Activity

            People

              wc-triage WC Triage
              brian Brian Murrell (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: