Details

    • Technical task
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.1.1
    • None
    • 9747

    Description

      Hit the following LBUG:

      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lock@ffff8802615024d8[4 2 0 0 0 00000005] P(0):[0, 18446744073709551615]@[0x16b9ac4cbe:0xc:0x0] {
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     vvp@ffff88025946e9e8: 
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     lov@ffff880431e80cf8: 2
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     0 0: ---
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     1 1: lock@ffff8803c465faf8[1 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100a80000:0x1b3b212:0x0] {
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     lovsub@ffff8801f6d645a0: [1 ffff880431e80cf8 P(0):[0, 18446744073709551615]@[0x16b9ac4cbe:0xc:0x0]] 
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     osc@ffff8803c4694b50: ffff88010cd766c0 00101001 0xe12a56a3ad7ca7fa 3 ffff880428397e48 size: 0 mtime: 1335700029 atime: 1335700029 ctime: 1335700029 blocks: 0
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff8803c465faf8
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) 
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff8802615024d8
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) Delete CLS_HELD lock
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:383:lovsub_lock_delete_one()) Impossible state: 2
      2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:384:lovsub_lock_delete_one()) LBUG
      

      Attachments

        Activity

          [LU-1355] LBUG in lovsub_lock_delete_one -- Impossible state: 2 (CLS_ENQUEUED)
          jay Jinshan Xiong (Inactive) added a comment - Fixed in LU-1299

          Also note the originally reported occurrences involved rm.

          nedbass Ned Bass (Inactive) added a comment - Also note the originally reported occurrences involved rm.

          Yes, original code can trigger fake OOM because wrong error code was returned by ll_fault(), so I guess OOM you have seen should go away after LU-1299 is applied.

          From what I have seen from log, it looks very like that a glimpse of file size was interrupted by a signal. Now that you mentioned it was hit in a normal file system usage like ls, it would exist another path to have the same back trace because ls won't issue signals afaik.

          Anyway, please you apply this patch and try to reproduce it again, this way we can get more information and move steps forward.

          jay Jinshan Xiong (Inactive) added a comment - Yes, original code can trigger fake OOM because wrong error code was returned by ll_fault(), so I guess OOM you have seen should go away after LU-1299 is applied. From what I have seen from log, it looks very like that a glimpse of file size was interrupted by a signal. Now that you mentioned it was hit in a normal file system usage like ls, it would exist another path to have the same back trace because ls won't issue signals afaik. Anyway, please you apply this patch and try to reproduce it again, this way we can get more information and move steps forward.

          Thanks, we'll give the patch a try.

          I'm not sure that your comment about LU-1299 concealing this bug is consistent with what we've seen. In particular, AFAIK we've only hit LU-1299 by running a truncated executable, which should be rare. On the other hand we run into this bug during normal filesystem usage.

          Although, now that I write that, I'm realizing that we have a local bug open about OOMs occurring despite the application is using little real memory. So perhaps we have unknowingly been hitting LU-1299 after all.

          nedbass Ned Bass (Inactive) added a comment - Thanks, we'll give the patch a try. I'm not sure that your comment about LU-1299 concealing this bug is consistent with what we've seen. In particular, AFAIK we've only hit LU-1299 by running a truncated executable, which should be rare. On the other hand we run into this bug during normal filesystem usage. Although, now that I write that, I'm realizing that we have a local bug open about OOMs occurring despite the application is using little real memory. So perhaps we have unknowingly been hitting LU-1299 after all.

          Ned, I don't need debug log since I've known every details of this bug. Thanks.

          jay Jinshan Xiong (Inactive) added a comment - Ned, I don't need debug log since I've known every details of this bug. Thanks.

          Actually my previous comment was inaccurate. This bug is concealed by LU-1299 because LU-1299 should be hit first if it was not fixed.

          Please check patch: http://review.whamcloud.com/2632 for a fix.

          This is still about signal handling problem.

          jay Jinshan Xiong (Inactive) added a comment - Actually my previous comment was inaccurate. This bug is concealed by LU-1299 because LU-1299 should be hit first if it was not fixed. Please check patch: http://review.whamcloud.com/2632 for a fix. This is still about signal handling problem.

          Do you still need the debug log? Now that I'm trying to reproduce it, of course it is being elusive.

          nedbass Ned Bass (Inactive) added a comment - Do you still need the debug log? Now that I'm trying to reproduce it, of course it is being elusive.

          I confirm that this bug was imported by LU-1299. I will cook a patch tomorrow.

          jay Jinshan Xiong (Inactive) added a comment - I confirm that this bug was imported by LU-1299 . I will cook a patch tomorrow.

          I need more time on this bug.

          Ned, if you can reproduce this issue quite often, is it possible to collect a debug log at the client side. I need DLMTRACE to be set.

          jay Jinshan Xiong (Inactive) added a comment - I need more time on this bug. Ned, if you can reproduce this issue quite often, is it possible to collect a debug log at the client side. I need DLMTRACE to be set.

          That's great news! We hit it very often. I just unintentionally reproduced it running ls on a login node. I wonder if we landed something that makes it more likely to hit since we never saw it before.

          nedbass Ned Bass (Inactive) added a comment - That's great news! We hit it very often. I just unintentionally reproduced it running ls on a login node. I wonder if we landed something that makes it more likely to hit since we never saw it before.

          People

            jay Jinshan Xiong (Inactive)
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: