Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1797

MDS deactivates/reactivates OSTs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.3.0
    • None
    • Lustre 2.2.93, LLNL/Hyperion
    • 3
    • 8543

    Description

      Running the SWL test suite (mixture of IO jobs spread across all clients) I am seeing this on multiple OSTs.

      Aug 28 05:11:41 ehyperion-dit34 kernel: Lustre: lustre-OST000d: Client lustre-MDT0000-mdtlov_UUID (at 192.168.127.6@o2ib1) reconnecting
      Aug 28 05:11:41 ehyperion-dit34 kernel: Lustre: lustre-OST000d: received MDS connection from 192.168.127.6@o2ib1
      Aug 28 05:11:41 ehyperion-rst6 kernel: Lustre: 3947:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346155795/real 1346155795]  req@ffff880093a2d000 x1411378033913049/t0(0) o5->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:7/4 lens 432/432 e 0 to 1 dl 1346155901 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
      Aug 28 05:11:41 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection to lustre-OST000d (at 192.168.127.65@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      Aug 28 05:11:41 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection restored to lustre-OST000d (at 192.168.127.65@o2ib1)
      Aug 28 05:11:41 ehyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000d_UUID now active, resetting orphans
      Aug 28 05:11:55 ehyperion-rst6 kernel: LustreError: 3947:0:(osc_create.c:169:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc  req@ffff880093a2d000 x1411378033913049/t0(0) o5->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:7/4 lens 432/432 e 0 to 1 dl 1346155901 ref 1 fl Interpret:RXN/0/ffffffff rc -107/-1
      Aug 28 05:12:31 ehyperion-rst6 kernel: LustreError: 12495:0:(mds_lov.c:883:__mds_lov_synchronize()) lustre-OST000d_UUID failed at mds_lov_clear_orphans: -5
      Aug 28 05:12:31 ehyperion-rst6 kernel: LustreError: 12495:0:(mds_lov.c:903:__mds_lov_synchronize()) lustre-OST000d_UUID sync failed -5, deactivating
      Aug 28 05:23:07 ehyperion-rst6 kernel: Lustre: 3948:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346156476/real 1346156476]  req@ffff8801fa613400 x1411378034235068/t0(0) o400->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:28/4 lens 224/224 e 0 to 1 dl 1346156587 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
      Aug 28 05:23:07 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection to lustre-OST000d (at 192.168.127.65@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      Aug 28 05:23:07 ehyperion-dit34 kernel: Lustre: lustre-OST000d: Client lustre-MDT0000-mdtlov_UUID (at 192.168.127.6@o2ib1) reconnecting
      Aug 28 05:23:07 ehyperion-dit34 kernel: Lustre: lustre-OST000d: received MDS connection from 192.168.127.6@o2ib1
      Aug 28 05:23:07 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection restored to lustre-OST000d (at 192.168.127.65@o2ib1)
      Aug 28 05:23:07 ehyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000d_UUID now active, resetting orphans
      
      

      In this case, the OST recovered without intervention. In other cases 'lctl --device NN recover' has fixed the problem, resulting msg:

      Aug 28 07:56:04 ehyperion-dit33 kernel: Lustre: 7426:0:(llog_net.c:162:llog_receptor_accept()) changing the import ffff8802dbaaf800 - ffff8802cf6ff000
      Aug 28 07:56:04 ehyperion-dit33 kernel: Lustre: 7426:0:(llog_net.c:162:llog_receptor_accept()) Skipped 1 previous similar message
      

      This does not appear to be causing SWL tests to fail at this time.

      Attachments

        Activity

          People

            green Oleg Drokin
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: