Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1469

Hyperion DAT - failures with multiple loads (simul+mdtest+ior)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.1.2
    • None
    • 3
    • 3978

    Description

      Divided the 1012 DAT nodes into three group, each ~337 nodes.
      Group 1 - iorfpp
      Group 2 - simul
      Group 3 - mdtestfpp
      Each Group was run separately, and each passed. When all Groups run simultaneously, one or more of the tests will fail.
      Server-side errors start with:

      Jun 1 18:29:00 ehyperion-dit29 kernel: LustreError: 13446:0:(ldlm_request.c:91:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1338600340, 200s ago); not entering recovery in server code, just going back to sleep ns: filter-lustre-OST0013_UUID lock: ffff8802a31d4b40/0x676dc3d48e273bbd lrc: 3/0,1 mode: -/PW res: 96739/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x80004000 remote: 0x0 expref: -99 pid: 13446 timeout 0

      Server debug logs, one failure client debug log, and messages are on FTP site, filename DAT2.tar.gz

      Attachments

        Activity

          People

            green Oleg Drokin
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: