Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2557

osc_page_delete()) Trying to teardown failed: -16 (EBUSY)

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre client 2.3 rc6, SLES11 2.6.32.59 or SLES11 3.0.38
      Lustre server 2.3 rc6, SLES11 2.6.32.59(direct-attached) or centos 2.6.32(external server)
    • 2
    • 5983

    Description

      The "teardown failure" is handled with an assertion failure.

      The stack trace from the assertion failure call:
      [<ffffffffa06a1e61>] osc_page_delete+0x2d1/0x2e0 [osc]
      [<ffffffffa02d33fd>] cl_page_delete0+0xcd/0x4f0 [obdclass]
      [<ffffffffa02d3862>] cl_page_delete+0x42/0x120 [obdclass]
      [<ffffffffa0826e9d>] ll_invalidatepage+0x8d/0x170 [lustre]
      [<ffffffffa081e264>] ll_page_mkwrite+0x7c4/0x840 [lustre]
      [<ffffffff810ef21b>] __do_fault+0xbb/0x4c0
      [<ffffffff810f2c1b>] handle_mm_fault+0x1db/0xe50
      [<ffffffff810289f7>] do_page_fault+0x147/0x2c0
      [<ffffffff812d90df>] page_fault+0x1f/0x30

      And extract from pid 617 Lustre traces:

      00000008:00020000:21.0:1353473247.507160:0:617:0:(osc_page.c:411:osc_page_delete()) page@ffff88080aa2f180[1 ffff8801ff1e1508:51 ^ffff88080aa2f240_(null) 4 0 1 (null) (null) 0x0]
      00000008:00020000:21.0:1353473247.520125:0:617:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff880809658be0(0:0:0) vm@ffffea001bfd54d8 e000000000000e3 8:0 0 51 lru
      00000008:00020000:21.0:1353473247.532044:0:617:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff88080b1c4b58
      00000008:00020000:21.0:1353473247.539101:0:617:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff8808176944b0: 1< 0x845fed 258 0 + - > 2< 208896 0 4096 0x0 0x520 | (null) ffff88041c1a6778 ffff8802072be888 > 3< + ffff8801fd445100 0 0 0 > 4< 0 0 8 31752192 - | - - + - > 5< - - + - | 0 - | 17 - ->
      00000008:00020000:21.0:1353473247.562323:0:617:0:(osc_page.c:411:osc_page_delete()) end page@ffff88080aa2f240
      00000008:00020000:21.0:1353473247.569375:0:617:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
      00000008:00040000:21.0:1353473247.576859:0:617:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
      00000008:00040000:21.0:1353473247.583736:0:617:0:(osc_page.c:412:osc_page_delete()) LBUG

      Customer commentary:
      > The symptom is similar to LU-1442/LU-1680 but we have the patch and apparently the problem still exists.

      A dump is in:
      ftp.cray.com:/outbound/791555-osc_cache_writeback_range-assert.tar.bz2
      More are available on request.

      Reference: LU-1030 – apparent source of problem

      Attachments

        Issue Links

          Activity

            [LU-2557] osc_page_delete()) Trying to teardown failed: -16 (EBUSY)
            pjones Peter Jones made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            duplicate of LU-2720

            pjones Peter Jones added a comment - duplicate of LU-2720
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-2720 [ LU-2720 ]

            CF: LU-2720

            Vitaly Fertman added a comment - 31/Jan/13 7:53 AM

            http://review.whamcloud.com/5222

            bkorb Bruce Korb (Inactive) added a comment - CF: LU-2720 Vitaly Fertman added a comment - 31/Jan/13 7:53 AM http://review.whamcloud.com/5222

            Hi, is the previouse patch(removing CILR_PEEK) still applied with removing the write_one_page?

            hongchao.zhang Hongchao Zhang added a comment - Hi, is the previouse patch(removing CILR_PEEK) still applied with removing the write_one_page?

            Sorry – should have read code before commenting. I knew Linus strongly favored locking and unlocking in the same function, ...

            Anyway, according to Cray, removing the "write_one_page" call still hangs:

            Wally Wang updated LELUS-103:
            -----------------------------
            
                Attachment: lelus103-sp1-nowrite-one-page.tar.bz2
            
            After removing the write_one_page(), it still hangs. See attachment
            bkorb Bruce Korb (Inactive) added a comment - Sorry – should have read code before commenting. I knew Linus strongly favored locking and unlocking in the same function, ... Anyway, according to Cray, removing the "write_one_page" call still hangs: Wally Wang updated LELUS-103: ----------------------------- Attachment: lelus103-sp1-nowrite-one-page.tar.bz2 After removing the write_one_page(), it still hangs. See attachment
            hongchao.zhang Hongchao Zhang added a comment - - edited

            the page is locked before calling "write_one_lock", and it will unlock the page after it exits, the modified codes like

            longer snippet from ll_page_mkwrite0
                    if (result == 0 || result == -ENODATA) {
                            lock_page(vmpage);
                            if (vmpage->mapping == NULL) {
                                    unlock_page(vmpage);
                                    if (result == 0)
                                            result = -ENODATA;
                            } else if (result == -ENODATA) {
                                    if (vmpage->mapping != NULL) {
                                            ll_invalidate_page(vmpage);
                                            LASSERT(vmpage->mapping == NULL);
                                    }
                                    unlock_page(vmpage);
            hongchao.zhang Hongchao Zhang added a comment - - edited the page is locked before calling "write_one_lock", and it will unlock the page after it exits, the modified codes like longer snippet from ll_page_mkwrite0 if (result == 0 || result == -ENODATA) { lock_page(vmpage); if (vmpage->mapping == NULL) { unlock_page(vmpage); if (result == 0) result = -ENODATA; } else if (result == -ENODATA) { if (vmpage->mapping != NULL) { ll_invalidate_page(vmpage); LASSERT(vmpage->mapping == NULL); } unlock_page(vmpage);
            bkorb Bruce Korb (Inactive) made changes -
            Attachment New: 9999-current.diff [ 12217 ]

            do not lock a page that is already locked

            bkorb Bruce Korb (Inactive) added a comment - do not lock a page that is already locked
            longer snippet from ll_page_mkwrite0
                    if (result == 0 || result == -ENODATA) {
                            lock_page(vmpage);
                            if (vmpage->mapping == NULL) {
                                    unlock_page(vmpage);
                                    if (result == 0)
                                            result = -ENODATA;
                            } else if (result == -ENODATA) {
                                    write_one_page(vmpage, 1);
                                    lock_page(vmpage);
                                    if (vmpage->mapping != NULL) {
                                            ll_invalidate_page(vmpage);
                                            LASSERT(vmpage->mapping == NULL);
                                    }
                                    unlock_page(vmpage);

            OK, I'm not understanding how the write_one_page is not within a "lock_page".
            In fact, it looks to me like the lock_page() call that follows it is wrong.
            I'm guessing that the thread is self-deadlocked.

            bkorb Bruce Korb (Inactive) added a comment - longer snippet from ll_page_mkwrite0 if (result == 0 || result == -ENODATA) { lock_page(vmpage); if (vmpage->mapping == NULL) { unlock_page(vmpage); if (result == 0) result = -ENODATA; } else if (result == -ENODATA) { write_one_page(vmpage, 1); lock_page(vmpage); if (vmpage->mapping != NULL) { ll_invalidate_page(vmpage); LASSERT(vmpage->mapping == NULL); } unlock_page(vmpage); OK, I'm not understanding how the write_one_page is not within a "lock_page". In fact, it looks to me like the lock_page() call that follows it is wrong. I'm guessing that the thread is self-deadlocked.

            People

              hongchao.zhang Hongchao Zhang
              bkorb Bruce Korb (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: