Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4552

osc_cache.c:899:osc_extent_wait() timeout quite often

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.0, Lustre 2.4.2
    • None
    • RHEL6
    • 2
    • 12438

    Description

      We hit client hangs quite often on the all login nodes and following Lustre error messages printed out. It can't be recovery until client reboots.

      Jan 22 17:23:23 ff01 kernel: LustreError: 84026:0,(osc_cache.c:899:osc_extent_wait()) extent ffff8831a49b0678@{[0 > 0/255], [3|0|+|rpc|wihY|ffff88283005bc48], [4096|1|+||ffff8828fb76b228|256|ffff88319695e040]} home2-OST000b-osc-ffff883fdbbd8800: wait ext to 0 timedout, recovery in progress?
      

      Attachments

        1. lctl.dk.23.17.tgz
          1.40 MB
        2. lctl.dk.after.tgz
          885 kB
        3. lctl.dk1.tgz
          0.2 kB
        4. messages.after_call_trace
          1.75 MB
        5. messages.after_osc_msg
          2.51 MB
        6. messages.before_call_trace
          1.01 MB

        Issue Links

          Activity

            [LU-4552] osc_cache.c:899:osc_extent_wait() timeout quite often
            pjones Peter Jones added a comment -

            Thanks Ihara

            pjones Peter Jones added a comment - Thanks Ihara

            Bruno, Yes, as far as we tested, I think it's duplicated issue of LU-4300. Please close this ticket LU-4552.

            ihara Shuichi Ihara (Inactive) added a comment - Bruno, Yes, as far as we tested, I think it's duplicated issue of LU-4300 . Please close this ticket LU-4552 .

            Hello Shuichi, do you agree if I close this ticket as a dup of LU-4300 ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Shuichi, do you agree if I close this ticket as a dup of LU-4300 ?

            Thanks, Bruno!
            After "echo 0 > /proc/fs/lustre/ldlm/namespaces/*/early_lock_cancel" setting, the problem was not reproduced. So, it looks like this is same problem to LU-4300. We tried a couple of times, but didn't happen anything and installer finisehd without errors.

            ihara Shuichi Ihara (Inactive) added a comment - Thanks, Bruno! After "echo 0 > /proc/fs/lustre/ldlm/namespaces/*/early_lock_cancel" setting, the problem was not reproduced. So, it looks like this is same problem to LU-4300 . We tried a couple of times, but didn't happen anything and installer finisehd without errors.

            Hello Shuichi,
            After having a look to the back-traces (still need to review the Lustre debug-logs!), your problem seems similar to the one reported in LU-4300.
            Also, could you try to run the same 100% reproducer on a node where ELC has been disabled ?? I think this can be set with "echo 0 > /proc/fs/lustre/ldlm/namespaces/*/early_lock_cancel".

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Shuichi, After having a look to the back-traces (still need to review the Lustre debug-logs!), your problem seems similar to the one reported in LU-4300 . Also, could you try to run the same 100% reproducer on a node where ELC has been disabled ?? I think this can be set with "echo 0 > /proc/fs/lustre/ldlm/namespaces/*/early_lock_cancel".

            Here is backtrace of 1) before dump calltrace, 2) after dump calltrace and 3) after printout of osc_extent_wait() timeout messages.

            ihara Shuichi Ihara (Inactive) added a comment - Here is backtrace of 1) before dump calltrace, 2) after dump calltrace and 3) after printout of osc_extent_wait() timeout messages.

            Jinshan, Hongchao
            No, this is not RPM package, and not opensource software. sow we can't share. It's java based installer. I don't think they are doing very specific things, but I didn't reproduce this problem with other way in our lab either.
            Howerver, we can 100% reproduce this problem with that java instlaler at the customer site. If you want any addtiianl information, please let me know. We will collect all informaiton whatever you want and repruce problem.

            ihara Shuichi Ihara (Inactive) added a comment - Jinshan, Hongchao No, this is not RPM package, and not opensource software. sow we can't share. It's java based installer. I don't think they are doing very specific things, but I didn't reproduce this problem with other way in our lab either. Howerver, we can 100% reproduce this problem with that java instlaler at the customer site. If you want any addtiianl information, please let me know. We will collect all informaiton whatever you want and repruce problem.
            hongchao.zhang Hongchao Zhang added a comment - - edited

            HI Ihara,

            do you use rpm as the software installer?
            I run " while [ true ]; do du -a /mnt/lustre >/dev/null 2>&1 ; done &" in the background, and continuously run "rpm -ivh", but can't reproduce it.
            and both 2.5.0 and 2.4.2 are tested.

            Thanks

            hongchao.zhang Hongchao Zhang added a comment - - edited HI Ihara, do you use rpm as the software installer? I run " while [ true ]; do du -a /mnt/lustre >/dev/null 2>&1 ; done &" in the background, and continuously run "rpm -ivh", but can't reproduce it. and both 2.5.0 and 2.4.2 are tested. Thanks

            is it easy to be reproduced? In that case, it'll be a good idea to share us the reproduce program.

            jay Jinshan Xiong (Inactive) added a comment - is it easy to be reproduced? In that case, it'll be a good idea to share us the reproduce program.

            Hi Jinshan,
            We didn't have it yet, but we can reproduce same problem and get backtrace. I Will collect them soon. Meantime, I'm uploading Lustre debug log that we got.

            ihara Shuichi Ihara (Inactive) added a comment - Hi Jinshan, We didn't have it yet, but we can reproduce same problem and get backtrace. I Will collect them soon. Meantime, I'm uploading Lustre debug log that we got.

            People

              hongchao.zhang Hongchao Zhang
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: