Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5514

After upgrade from 1.8.7 to 2.5.2 stack trace cfs_hash_bd_lookup_intent

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 3
    • 15371

    Description

      We have upgraded from 1.8.7 to 2.5.2

      After mounting the FS at approximately 17:29

      At 18:49 we see the following

      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 15310 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: Pid: 15310, comm: mdt01_000
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel:
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: Call Trace:
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa057ffd5>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06bdeab>] lu_object_find_at+0xab/0x360 [obdclass]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff81282965>] ? _atomic_dec_and_lock+0x55/0x80
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06be19f>] lu_object_find_slice+0x1f/0x80 [obdclass]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0f99370>] mdd_object_find+0x10/0x70 [mdd]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0fa97ba>] mdd_is_parent+0xaa/0x3a0 [mdd]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0fa9c0c>] mdd_is_subdir+0x15c/0x200 [mdd]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e6d509>] mdt_reint_rename+0x1049/0x1c20 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e48940>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0819f60>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0586f76>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06ddc00>] ? lu_ucred+0x20/0x30 [obdclass]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e684a1>] mdt_reint_rec+0x41/0xe0 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e4dcb3>] mdt_reint_internal+0x4c3/0x780 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e4dfb4>] mdt_reint+0x44/0xe0 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e5158a>] mdt_handle_common+0x52a/0x1470 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e8d755>] mds_regular_handle+0x15/0x20 [mdt]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0853bc5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa056b4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa057c3cf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa084b2a9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0854f2d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0854440>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8109ab56>] kthread+0x96/0xa0
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel:
      Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: LustreError: dumping log to /tmp/lustre-log.1408470549.15310
      Aug 19 18:51:19 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 15470 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Aug 19 18:51:19 cs04r-sc-mds03-01 kernel: Pid: 15470, comm: mdt03_004

      Attachments

        Issue Links

          Activity

            [LU-5514] After upgrade from 1.8.7 to 2.5.2 stack trace cfs_hash_bd_lookup_intent
            adilger Andreas Dilger made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones made changes -
            End date New: 21/Aug/14
            Start date New: 20/Aug/14
            pjones Peter Jones made changes -
            Link New: This issue is related to LU-4725 [ LU-4725 ]
            pjones Peter Jones added a comment -

            ok fair enough then. I just did not want you to move forward with patches for something that was a rare edge case

            pjones Peter Jones added a comment - ok fair enough then. I just did not want you to move forward with patches for something that was a rare edge case

            There's a number of processes/applications that tend to create files with a temporary extension and than rename them to the proper name, to avoid any race conditions where processing starts before the data is there, so not that uncommon. And considering that we first saw it with basically only ior test runs, I'd say, something is doing it quite frequently...

            Cheers,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - There's a number of processes/applications that tend to create files with a temporary extension and than rename them to the proper name, to avoid any race conditions where processing starts before the data is there, so not that uncommon. And considering that we first saw it with basically only ior test runs, I'd say, something is doing it quite frequently... Cheers, Frederik
            pjones Peter Jones added a comment -

            Frederik

            Is rename commonly used during regular production usage or is this an issue that is more likely to hit in a stressing test run?

            Peter

            pjones Peter Jones added a comment - Frederik Is rename commonly used during regular production usage or is this an issue that is more likely to hit in a stressing test run? Peter
            laisiyao Lai Siyao added a comment -

            Yes, it's sufficient with these 3 patches. Once those two patches for LU-4725 are landed on 2.5.3, you can use 2.5.3 directly, but currently you have to apply these patches yourself.

            laisiyao Lai Siyao added a comment - Yes, it's sufficient with these 3 patches. Once those two patches for LU-4725 are landed on 2.5.3, you can use 2.5.3 directly, but currently you have to apply these patches yourself.

            Peter, Lai,

            thanks for clarifying this.

            We'll look into compiling lustre 2.5.2 + those 3 patches, but if there is an easier way for us to get the correct compiled versions, let us know.

            I assume, initially it should be sufficient to run this patches version on the MDS, correct?

            Kind regards,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Peter, Lai, thanks for clarifying this. We'll look into compiling lustre 2.5.2 + those 3 patches, but if there is an easier way for us to get the correct compiled versions, let us know. I assume, initially it should be sufficient to run this patches version on the MDS, correct? Kind regards, Frederik
            laisiyao Lai Siyao added a comment -

            This turns out to be a duplicate of LU-4725, and two patches are in review phase: http://review.whamcloud.com/#/c/10916/ http://review.whamcloud.com/#/c/10917/.

            This is the same issue as http://review.whamcloud.com/#/c/10816/, but this happens in rename, while that's in getattr. It's a race in lookup the same object twice while object is destroyed in the mean time. I'd suggest you apply these three patches to fully solve such issue.

            laisiyao Lai Siyao added a comment - This turns out to be a duplicate of LU-4725 , and two patches are in review phase: http://review.whamcloud.com/#/c/10916/ http://review.whamcloud.com/#/c/10917/ . This is the same issue as http://review.whamcloud.com/#/c/10816/ , but this happens in rename, while that's in getattr. It's a race in lookup the same object twice while object is destroyed in the mean time. I'd suggest you apply these three patches to fully solve such issue.
            laisiyao Lai Siyao added a comment -

            racer.sh test script is the reproducer for http://review.whamcloud.com/#/c/10816/, which you may use as well.

            laisiyao Lai Siyao added a comment - racer.sh test script is the reproducer for http://review.whamcloud.com/#/c/10816/ , which you may use as well.

            People

              laisiyao Lai Siyao
              davebond-diamond Dave Bond (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: