Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2327

Clients stuck in mount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.0
    • Sequoia, 2.3.54-2chaos, github.com/chaos/lustre
    • 3
    • 5558

    Description

      We have another instance of clients getting stuck at mount time. Out of 768 client, 766 mounted successfully, and 2 appear to be stuck. So far it has been several minutes. They both had the same soft lockup messages on the console:

      LustreError: 20578:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
      BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244]
      Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
      NIP: c00000000042e190 LR: 8000000003a9ff2c CTR: c00000000042e160
      REGS: c0000003ca60fba0 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
      MSR: 0000000080029000 <EE,ME,CE>  CR: 84228424  XER: 20000000
      TASK = c0000003ca51e600[3244] 'll_ping' THREAD: c0000003ca60c000 CPU: 33
      GPR00: 0000000080000021 c0000003ca60fe20 c0000000006de510 c0000003c00c0a78
      GPR04: 2222222222222222 0000000000000000 c0000003ca60fd38 0000000000000000
      GPR08: 0000000000000000 0000000080000021 0000000000000000 c00000000042e160
      GPR12: 8000000003aebda8 c00000000075d200
      NIP [c00000000042e190] ._spin_lock+0x30/0x44
      LR [8000000003a9ff2c] .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
      Call Trace:
      [c0000003ca60fe20] [8000000003a9fea0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable)
      [c0000003ca60ff90] [c00000000001a9e0] .kernel_thread+0x54/0x70
      Instruction dump:
      38000000 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0
      4c00012c 2fa00000 4dfe0020 7c210b78 <80030000> 2fa00000 40defff4 7c421378
      BUG: soft lockup - CPU#49 stuck for 68s! [ptlrpcd_rcv:3175]
      Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
      NIP: c00000000042e198 LR: 8000000003a66af4 CTR: c00000000042e160
      REGS: c0000003e9bf3900 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
      MSR: 0000000080029000 <EE,ME,CE>  CR: 84228444  XER: 20000000
      TASK = c0000003ea1f0740[3175] 'ptlrpcd_rcv' THREAD: c0000003e9bf0000 CPU: 49
      GPR00: 0000000080000021 c0000003e9bf3b80 c0000000006de510 c0000003c00c0a78
      GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
      GPR08: c0000003c00c0a60 0000000080000031 c00000036a060900 c00000000042e160
      GPR12: 8000000003aebda8 c00000000076a200
      NIP [c00000000042e198] ._spin_lock+0x38/0x44
      LR [8000000003a66af4] .ptlrpc_check_set+0x4f4/0x4e80 [ptlrpc]
      Call Trace:
      [c0000003e9bf3b80] [8000000003a66964] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] (unreliable)
      [c0000003e9bf3d20] [8000000003abd1cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
      [c0000003e9bf3e40] [8000000003abd6b8] .ptlrpcd+0x2b8/0x510 [ptlrpc]
      [c0000003e9bf3f90] [c00000000001a9e0] .kernel_thread+0x54/0x70
      Instruction dump:
      812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2fa00000
      4dfe0020 7c210b78 80030000 2fa00000 <40defff4> 7c421378 4bffffc8 7c0802a6
      

      Both nodes are pretty much idle. I'll see what other information I can gather.

      Attachments

        1. lustre-log.1357681275.3261.txt
          3.02 MB
        2. seqio162_console.txt
          1.58 MB
        3. seqio162_lustre_log.txt
          3.06 MB
        4. seqio542_console_sysrq.log
          1.66 MB
        5. seqio542_lustre.log
          7.99 MB
        6. seqio652_lustre.log
          8.00 MB
        7. seqio652_lustre2.log
          4.02 MB

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: