Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4584

Lock revocation process fails consistently

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • 3
    • 12530

    Description

      Some users have reported to us that the "rm" command is taking a long time. Some investigation revealed that at least the first "rm" in a directory takes just over 100 seconds, which of course sounds like OBD_TIMEOUT_DEFAULT.

      This isn't necessarily the simplest reproducer, but the following reproducer is completely consistent:

      1. set directory striping default count to 48
      2. touch a file on client A
      3. rm file on client B

      The clients are running 2.4.0-19chaos, servers are at 2.4.0-21chaos. The servers are using zfs as the backend.

      I have some lustre logs that I will share and talk about in additional posts to this ticket. But essentially it looks like the server always times out on a AST to client A (explaining the 100 second delay). It is not really clear yet to me why that happens, because client A appears to be completely responsive. My current suspicion is the the MDT is to blame.

      Attachments

        1. 172.16.66.4@tcp.log.bz2
          40 kB
        2. 172.16.66.5@tcp.log.bz2
          53 kB
        3. 172.20.20.201@o2ib500.log.bz2
          8.52 MB
        4. client_log_20140206.txt
          375 kB
        5. inflames.log
          2.40 MB

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: