Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5393

LBUG: (ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.4.3
    • RHEL 6 w/ patched kernel for Lustre
    • 3
    • 15009

    Description

      We are hitting this LBUG on one of our production systems recently updated to Lustre 2.4.3 (on 4th of June).

      LustreError: 11020:0:(ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed:
      LustreError: 11020:0:(ost_handler.c:882:ost_brw_read()) LBUG
      Pid: 11020, comm: ll_ost_io03_084
      
      Call Trace:
       [<ffffffffa07e6895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa07e6e97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa04a0d44>] ost_brw_read+0x12d4/0x1340 [ost]
       [<ffffffff81282c09>] ? cpumask_next_and+0x29/0x50
       [<ffffffff8105bf64>] ? find_busiest_group+0x244/0x9f0
       [<ffffffffa0abbf0c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
       [<ffffffffa0abc068>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
       [<ffffffffa04a8038>] ost_handle+0x2ac8/0x48e0 [ost]
       [<ffffffffa0ac2c4b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
       [<ffffffffa0acb428>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
       [<ffffffffa07e75de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa07f8d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0ac2789>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
       [<ffffffff81058bd3>] ? __wake_up+0x53/0x70
       [<ffffffffa0acc7be>] ptlrpc_main+0xace/0x1700 [ptlrpc]
       [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      
      Kernel panic - not syncing: LBUG
      

      The assertion is here:

       753 static int ost_brw_read(struct ptlrpc_request *req, struct obd_trans_info *oti)
       754 {
      ...
       879                 if (page_rc != local_nb[i].len) { /* short read */
       880                         /* All subsequent pages should be 0 */
       881                         while(++i < npages)
       882                                 LASSERT(local_nb[i].rc == 0);
       883                         break;
       884                 }
      

      I was able to get the content of local_nb from the crash dump.

      crash> struct ost_thread_local_cache 0xffff88105da00000
      struct ost_thread_local_cache {
        local = {{
            lnb_file_offset = 6010437632,
            lnb_page_offset = 0,
            len = 4096,
            flags = 0,
            page = 0xffffea0037309c10,
            dentry = 0xffff8808048c9a80,
            lnb_grant_used = 0,
            rc = 4096
          }, {
      ...
          }, {
            lnb_file_offset = 6010757120,
            lnb_page_offset = 0,
            len = 4096,
            flags = 0,
            page = 0xffffea00372ccf80,
            dentry = 0xffff8808048c9a80,
            lnb_grant_used = 0,
            rc = 512   <======== local_nb[i].rc != local_nb[i].len /* short read */
          }, {
            lnb_file_offset = 6010761216,
            lnb_page_offset = 0,
            len = 4096,
            flags = 0,
            page = 0xffffea0037176e98,
            dentry = 0xffff8808048c9a80,
            lnb_grant_used = 0,
            rc = 4096
          }, {
            lnb_file_offset = 1411710976,
            lnb_page_offset = 0,
            len = 4096,
            flags = 1120,
            page = 0x0,
            dentry = 0xffff8803690c2780,
            lnb_grant_used = 0,
            rc = 0
          }, {
            lnb_file_offset = 1411715072,
            lnb_page_offset = 0,
            len = 4096,
            flags = 1120,
            page = 0x0,
            dentry = 0xffff8803690c2780,
            lnb_grant_used = 0,
            rc = 0
          }, {
      ...
      

      This LBUG occurred 5 times since 06/06/14. Each time, we have a short read followed by some non-empty pages (rc != 0). You will find attached some output for the first 3 occurences.

      FYI, we have Lustre routers between the servers and clients as they are on a different LNET. Clients, routers and servers are running Lustre 2.4.3.

      Attachments

        1. 1407594033.bf26.debug.log
          8 kB
        2. 1407598675.bf27.debug.log
          23 kB
        3. 1407800513.bf33.debug.log
          7 kB
        4. MTS1351.tgz
          204 kB

        Issue Links

          Activity

            [LU-5393] LBUG: (ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13707/
            Subject: LU-5393 osd-ldiskfs: read i_size once to protect against race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2267fc79d94dea3778cb4600027698e0155f1af3

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13707/ Subject: LU-5393 osd-ldiskfs: read i_size once to protect against race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2267fc79d94dea3778cb4600027698e0155f1af3

            http://review.whamcloud.com/13707 definitely appears to be the right way to fix, so I have abandoned http://review.whamcloud.com/11633.

            bfaccini Bruno Faccini (Inactive) added a comment - http://review.whamcloud.com/13707 definitely appears to be the right way to fix, so I have abandoned http://review.whamcloud.com/11633 .

            After having a look in the different crash-dumps for this problem where we have found that :
            _ inode i_size value, at the time of the crash-dump, is each time much higher than the offset where the short-read has occured.
            _ the length that has been read at short-read time/position is very frequently 512 but not always
            _ the number of pages being read in addition is frequently one, but not always, there is a case where the short-read has occured in the 1st page mapped by local_nb[] and then hundreds of page have been read ...

            According to the code of ldiskfs:osd_[read,write]_prep() and underlying/callee routines this can easily occur during a "legal" (both are using ofd_read_lock() protection) read vs write race, where the writer is growing object/i_size beyond current position that the reader also has reached, and because osd_read_prep() is using multiple (currently 4x when de-assembly indicates there are only 3 effectives, and one/"i_size_read(inode) - lnb[i].lnb_file_offset" has been optimized to re-use value, BTW this explais why we have never found local_nb[i].rc high values in crahs-dumps ...) calls of i_size_read() to implement its algorithm/loop regardless of a possible change/grow of i_size.

            Here is a patch/fix attempt, at http://review.whamcloud.com/13707, that only i_size_read() once in osd_read_prep().

            bfaccini Bruno Faccini (Inactive) added a comment - After having a look in the different crash-dumps for this problem where we have found that : _ inode i_size value, at the time of the crash-dump, is each time much higher than the offset where the short-read has occured. _ the length that has been read at short-read time/position is very frequently 512 but not always _ the number of pages being read in addition is frequently one, but not always, there is a case where the short-read has occured in the 1st page mapped by local_nb[] and then hundreds of page have been read ... According to the code of ldiskfs:osd_ [read,write] _prep() and underlying/callee routines this can easily occur during a "legal" (both are using ofd_read_lock() protection) read vs write race, where the writer is growing object/i_size beyond current position that the reader also has reached, and because osd_read_prep() is using multiple (currently 4x when de-assembly indicates there are only 3 effectives, and one/"i_size_read(inode) - lnb [i] .lnb_file_offset" has been optimized to re-use value, BTW this explais why we have never found local_nb [i] .rc high values in crahs-dumps ...) calls of i_size_read() to implement its algorithm/loop regardless of a possible change/grow of i_size. Here is a patch/fix attempt, at http://review.whamcloud.com/13707 , that only i_size_read() once in osd_read_prep().

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13707
            Subject: LU-5393 osd-ldiskfs: read i_size once to protect against race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bb93b6ed2cb72e894ce1c41e0a1f98321ad5cb4b

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13707 Subject: LU-5393 osd-ldiskfs: read i_size once to protect against race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bb93b6ed2cb72e894ce1c41e0a1f98321ad5cb4b

            Hi Bruno,

            Not yet. I should be able to build lustre with your debug patch in the next few days. I'll keep you in touch.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi Bruno, Not yet. I should be able to build lustre with your debug patch in the next few days. I'll keep you in touch.

            Did you have some time to give a try to my debug patch ? From my side, it has passed auto-tests and does not introduce any regression nor cause flooding msgs ...

            OTOH, my reproducer attempts, based on our findings+thought is still unsuccessful ...

            bfaccini Bruno Faccini (Inactive) added a comment - Did you have some time to give a try to my debug patch ? From my side, it has passed auto-tests and does not introduce any regression nor cause flooding msgs ... OTOH, my reproducer attempts, based on our findings+thought is still unsuccessful ...

            Yes, you are right and thanks to add this since I have forgotten to mention it. And hopefully this is the way we can use now to find in-memory/crash inode for new occurrences of problem!

            BTW, as already discussed, I have also pushed a debug patch (http://review.whamcloud.com/13448) that may allow to better highlight the possible race during short-read condition which we strongly suspect to cause the assert. Let's wait for it to be exposed to our auto-tests and see if it is not too intrusive. I have submitted it on top of current b2_5 but it should apply easily in 2.5.3.

            I will also try to work in-house on a reproducer, based on our latest findings/thoughts during our joint on-site debug and analysis.

            bfaccini Bruno Faccini (Inactive) added a comment - Yes, you are right and thanks to add this since I have forgotten to mention it. And hopefully this is the way we can use now to find in-memory/crash inode for new occurrences of problem! BTW, as already discussed, I have also pushed a debug patch ( http://review.whamcloud.com/13448 ) that may allow to better highlight the possible race during short-read condition which we strongly suspect to cause the assert. Let's wait for it to be exposed to our auto-tests and see if it is not too intrusive. I have submitted it on top of current b2_5 but it should apply easily in 2.5.3. I will also try to work in-house on a reproducer, based on our latest findings/thoughts during our joint on-site debug and analysis.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13448
            Subject: LU-5393 osd-ldiskfs: debug short-read vs possible race
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: f8805056590b6e0ea5cacf32179c0cf995811b3a

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13448 Subject: LU-5393 osd-ldiskfs: debug short-read vs possible race Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: f8805056590b6e0ea5cacf32179c0cf995811b3a

            I found an other way to have the inode pointer who do the LBUG
            On the niobuf_local structure the dentry field is finaly a pointer
            of osd_object structure and the field oo_inode in the osd_object
            is THE inode pointer.
            CF: lustre!osd_ldiskfs/osd_io.c: 472 lnb->dentry = (void *) obj;

            apercher Antoine Percher added a comment - I found an other way to have the inode pointer who do the LBUG On the niobuf_local structure the dentry field is finaly a pointer of osd_object structure and the field oo_inode in the osd_object is THE inode pointer. CF: lustre!osd_ldiskfs/osd_io.c: 472 lnb->dentry = (void *) obj;

            People

              bfaccini Bruno Faccini (Inactive)
              bruno.travouillon Bruno Travouillon (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: