Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5514

After upgrade from 1.8.7 to 2.5.2 stack trace cfs_hash_bd_lookup_intent

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 3
    • 15371

    Description

      We have upgraded from 1.8.7 to 2.5.2

      After mounting the FS at approximately 17:29

      At 18:49 we see the following

      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 15310 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: Pid: 15310, comm: mdt01_000
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel:
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: Call Trace:
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa057ffd5>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06bdeab>] lu_object_find_at+0xab/0x360 [obdclass]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff81282965>] ? _atomic_dec_and_lock+0x55/0x80
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06be19f>] lu_object_find_slice+0x1f/0x80 [obdclass]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0f99370>] mdd_object_find+0x10/0x70 [mdd]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0fa97ba>] mdd_is_parent+0xaa/0x3a0 [mdd]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0fa9c0c>] mdd_is_subdir+0x15c/0x200 [mdd]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e6d509>] mdt_reint_rename+0x1049/0x1c20 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e48940>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0819f60>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0586f76>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06ddc00>] ? lu_ucred+0x20/0x30 [obdclass]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e684a1>] mdt_reint_rec+0x41/0xe0 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e4dcb3>] mdt_reint_internal+0x4c3/0x780 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e4dfb4>] mdt_reint+0x44/0xe0 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e5158a>] mdt_handle_common+0x52a/0x1470 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e8d755>] mds_regular_handle+0x15/0x20 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0853bc5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa056b4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa057c3cf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa084b2a9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0854f2d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0854440>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8109ab56>] kthread+0x96/0xa0
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel:
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: LustreError: dumping log to /tmp/lustre-log.1408470549.15310
      Aug 19 18:51:19 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 15470 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Aug 19 18:51:19 cs04r-sc-mds03-01 kernel: Pid: 15470, comm: mdt03_004

      Attachments

        Issue Links

          Activity

            [LU-5514] After upgrade from 1.8.7 to 2.5.2 stack trace cfs_hash_bd_lookup_intent
            pjones Peter Jones added a comment -

            ok fair enough then. I just did not want you to move forward with patches for something that was a rare edge case

            pjones Peter Jones added a comment - ok fair enough then. I just did not want you to move forward with patches for something that was a rare edge case

            There's a number of processes/applications that tend to create files with a temporary extension and than rename them to the proper name, to avoid any race conditions where processing starts before the data is there, so not that uncommon. And considering that we first saw it with basically only ior test runs, I'd say, something is doing it quite frequently...

            Cheers,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - There's a number of processes/applications that tend to create files with a temporary extension and than rename them to the proper name, to avoid any race conditions where processing starts before the data is there, so not that uncommon. And considering that we first saw it with basically only ior test runs, I'd say, something is doing it quite frequently... Cheers, Frederik
            pjones Peter Jones added a comment -

            Frederik

            Is rename commonly used during regular production usage or is this an issue that is more likely to hit in a stressing test run?

            Peter

            pjones Peter Jones added a comment - Frederik Is rename commonly used during regular production usage or is this an issue that is more likely to hit in a stressing test run? Peter
            laisiyao Lai Siyao added a comment -

            Yes, it's sufficient with these 3 patches. Once those two patches for LU-4725 are landed on 2.5.3, you can use 2.5.3 directly, but currently you have to apply these patches yourself.

            laisiyao Lai Siyao added a comment - Yes, it's sufficient with these 3 patches. Once those two patches for LU-4725 are landed on 2.5.3, you can use 2.5.3 directly, but currently you have to apply these patches yourself.

            Peter, Lai,

            thanks for clarifying this.

            We'll look into compiling lustre 2.5.2 + those 3 patches, but if there is an easier way for us to get the correct compiled versions, let us know.

            I assume, initially it should be sufficient to run this patches version on the MDS, correct?

            Kind regards,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Peter, Lai, thanks for clarifying this. We'll look into compiling lustre 2.5.2 + those 3 patches, but if there is an easier way for us to get the correct compiled versions, let us know. I assume, initially it should be sufficient to run this patches version on the MDS, correct? Kind regards, Frederik
            laisiyao Lai Siyao added a comment -

            This turns out to be a duplicate of LU-4725, and two patches are in review phase: http://review.whamcloud.com/#/c/10916/ http://review.whamcloud.com/#/c/10917/.

            This is the same issue as http://review.whamcloud.com/#/c/10816/, but this happens in rename, while that's in getattr. It's a race in lookup the same object twice while object is destroyed in the mean time. I'd suggest you apply these three patches to fully solve such issue.

            laisiyao Lai Siyao added a comment - This turns out to be a duplicate of LU-4725 , and two patches are in review phase: http://review.whamcloud.com/#/c/10916/ http://review.whamcloud.com/#/c/10917/ . This is the same issue as http://review.whamcloud.com/#/c/10816/ , but this happens in rename, while that's in getattr. It's a race in lookup the same object twice while object is destroyed in the mean time. I'd suggest you apply these three patches to fully solve such issue.
            laisiyao Lai Siyao added a comment -

            racer.sh test script is the reproducer for http://review.whamcloud.com/#/c/10816/, which you may use as well.

            laisiyao Lai Siyao added a comment - racer.sh test script is the reproducer for http://review.whamcloud.com/#/c/10816/ , which you may use as well.
            pjones Peter Jones added a comment -

            Frederik

            Lai is suggesting that you should apply the patch - which is missing from 2.5.2 but will be in the upcoming 2.5.3 release.

            Regards

            Peter

            pjones Peter Jones added a comment - Frederik Lai is suggesting that you should apply the patch - which is missing from 2.5.2 but will be in the upcoming 2.5.3 release. Regards Peter

            Lai,

            I'm not sure I fully understand your comment, did you want us (Diamond) to verify if that patch is included in 2.5.2?

            Also, after looking at the patch and issues referenced in the patch, I'm still not clear what might have happened? I assume it needs special circumstances during a rename/getattr to trigger this. Are there any more details know? Any hints where we should start looking for a reproducer?

            FYI, the file system is not yet seeing normal use as we're currently running tests before letting the users back on...

            Cheers,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Lai, I'm not sure I fully understand your comment, did you want us (Diamond) to verify if that patch is included in 2.5.2? Also, after looking at the patch and issues referenced in the patch, I'm still not clear what might have happened? I assume it needs special circumstances during a rename/getattr to trigger this. Are there any more details know? Any hints where we should start looking for a reproducer? FYI, the file system is not yet seeing normal use as we're currently running tests before letting the users back on... Cheers, Frederik
            laisiyao Lai Siyao added a comment -

            I don't see anything particular in the log. And there is a patch for a similar issue: http://review.whamcloud.com/#/c/10816/, however the patch looks not included in 2.5.2, could you verify this?

            Besides, http://review.whamcloud.com/#/c/10816/ may not be able to fix this issue because that's in getattr context, but this in rename. There may be a similar error in rename.

            laisiyao Lai Siyao added a comment - I don't see anything particular in the log. And there is a patch for a similar issue: http://review.whamcloud.com/#/c/10816/ , however the patch looks not included in 2.5.2, could you verify this? Besides, http://review.whamcloud.com/#/c/10816/ may not be able to fix this issue because that's in getattr context, but this in rename. There may be a similar error in rename.

            People

              laisiyao Lai Siyao
              davebond-diamond Dave Bond (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: