Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4584

Lock revocation process fails consistently

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • 3
    • 12530

    Description

      Some users have reported to us that the "rm" command is taking a long time. Some investigation revealed that at least the first "rm" in a directory takes just over 100 seconds, which of course sounds like OBD_TIMEOUT_DEFAULT.

      This isn't necessarily the simplest reproducer, but the following reproducer is completely consistent:

      1. set directory striping default count to 48
      2. touch a file on client A
      3. rm file on client B

      The clients are running 2.4.0-19chaos, servers are at 2.4.0-21chaos. The servers are using zfs as the backend.

      I have some lustre logs that I will share and talk about in additional posts to this ticket. But essentially it looks like the server always times out on a AST to client A (explaining the 100 second delay). It is not really clear yet to me why that happens, because client A appears to be completely responsive. My current suspicion is the the MDT is to blame.

      Attachments

        1. 172.16.66.4@tcp.log.bz2
          40 kB
        2. 172.16.66.5@tcp.log.bz2
          53 kB
        3. 172.20.20.201@o2ib500.log.bz2
          8.52 MB
        4. client_log_20140206.txt
          375 kB
        5. inflames.log
          2.40 MB

        Issue Links

          Activity

            [LU-4584] Lock revocation process fails consistently
            green Oleg Drokin added a comment -

            patch http://review.whamcloud.com/#/c/9488/ fixed to eliminate the assertion in mdt_intent_lock_replace() now.
            Also 2.4 deployments would need to carry patch http://review.whamcloud.com/6511 from LU-3428. This should address all related woes in b2_4. 2.6+ will be fixed by patches from LU-2827 and friends. As for 2.5 I think we'll still move with -2827 as the more generic solution.

            green Oleg Drokin added a comment - patch http://review.whamcloud.com/#/c/9488/ fixed to eliminate the assertion in mdt_intent_lock_replace() now. Also 2.4 deployments would need to carry patch http://review.whamcloud.com/6511 from LU-3428 . This should address all related woes in b2_4. 2.6+ will be fixed by patches from LU-2827 and friends. As for 2.5 I think we'll still move with -2827 as the more generic solution.

            Yes, that was my belief. I would like Intel to enumerate the failure modes that users can expect to be fixed, and those that will not be fixed by LU-4584.

            morrone Christopher Morrone (Inactive) added a comment - Yes, that was my belief. I would like Intel to enumerate the failure modes that users can expect to be fixed, and those that will not be fixed by LU-4584 .

            I have several different reproducers of this problem. What I found what LU-4584 address some of the reproducers but not all of them. The patch for LU-2827 addressed more of my reproducers but I still had client evictions.

            simmonsja James A Simmons added a comment - I have several different reproducers of this problem. What I found what LU-4584 address some of the reproducers but not all of them. The patch for LU-2827 addressed more of my reproducers but I still had client evictions.

            Could you please provide an explanation of what operations will not be fixed by the LU-4584 patch, as compared with the more general LU-2827 fix?

            morrone Christopher Morrone (Inactive) added a comment - Could you please provide an explanation of what operations will not be fixed by the LU-4584 patch, as compared with the more general LU-2827 fix?

            FYI, we put the LU-4584 patch into production, and it didn't take too long before we hit the assertion that James reported. I opened LU-5525 for that bug.

            I am beginning to suspect that the patch either did cause the assertion, or made the assertion more common.

            morrone Christopher Morrone (Inactive) added a comment - FYI, we put the LU-4584 patch into production, and it didn't take too long before we hit the assertion that James reported. I opened LU-5525 for that bug. I am beginning to suspect that the patch either did cause the assertion, or made the assertion more common.

            I have been testing with the LU-4584 patch and I'm still seeing client evictions. Could it be possible to get the LU-2827 patch working on 2.4

            simmonsja James A Simmons added a comment - I have been testing with the LU-4584 patch and I'm still seeing client evictions. Could it be possible to get the LU-2827 patch working on 2.4

            It was my bad. The last test shot we used our 2.4 production file system which didn't have the patch from here. So the above breakage is expected. We are in the process of testing this at larger scale (500 nodes) production machine. Yes ORNL has created a public git tree

            https://github.com/ORNL-TechInt/lustre

            So people can examine our special sauce.

            simmonsja James A Simmons added a comment - It was my bad. The last test shot we used our 2.4 production file system which didn't have the patch from here. So the above breakage is expected. We are in the process of testing this at larger scale (500 nodes) production machine. Yes ORNL has created a public git tree https://github.com/ORNL-TechInt/lustre So people can examine our special sauce.

            James, can you share the patch stack you are using? That might help us figure out if you are reporting the same issue or something else. And if it isn't exactly the same issue, we really need to get you to report it in another ticket.

            morrone Christopher Morrone (Inactive) added a comment - James, can you share the patch stack you are using? That might help us figure out if you are reporting the same issue or something else. And if it isn't exactly the same issue, we really need to get you to report it in another ticket.

            Just finished a test shot with Cray 2.5 clients to see if the client evicts stopped. Their default client which is some 2.5 version with many many patches lacked the LU-2827 and LU-4861 patches I founded that helped with 2.5.2. So I applied patches from LU-2827 and LU-4861 but still had client evicts. I collected the logs from the server side and have placed them here:

            ftp.whamcloud.com/uploads/LU-4584/atlas2_testshot_Jul_29_2014_debug_logs.tar.gz

            simmonsja James A Simmons added a comment - Just finished a test shot with Cray 2.5 clients to see if the client evicts stopped. Their default client which is some 2.5 version with many many patches lacked the LU-2827 and LU-4861 patches I founded that helped with 2.5.2. So I applied patches from LU-2827 and LU-4861 but still had client evicts. I collected the logs from the server side and have placed them here: ftp.whamcloud.com/uploads/ LU-4584 /atlas2_testshot_Jul_29_2014_debug_logs.tar.gz

            BTW, I forgot to indicate here that my b2_4 patch/back-port for LU-2827 (http://review.whamcloud.com/10902) has still some problem and needs some re-work, because MDS bombs with "(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed" when running LLNL reproducer from LU-4584 or recovery-small/test_53 in auto-tests.
            More to come, crash-dump is under investigations, but we still can use http://review.whamcloud.com/9488 as a fix for b2_4.

            bfaccini Bruno Faccini (Inactive) added a comment - BTW, I forgot to indicate here that my b2_4 patch/back-port for LU-2827 ( http://review.whamcloud.com/10902 ) has still some problem and needs some re-work, because MDS bombs with "(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed" when running LLNL reproducer from LU-4584 or recovery-small/test_53 in auto-tests. More to come, crash-dump is under investigations, but we still can use http://review.whamcloud.com/9488 as a fix for b2_4.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Merged b2_4 backport of both #5978 and #10378 master changes from LU-2827, is at http://review.whamcloud.com/10902.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Merged b2_4 backport of both #5978 and #10378 master changes from LU-2827 , is at http://review.whamcloud.com/10902 .

            People

              bfaccini Bruno Faccini (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: