[LU-5393] LBUG: (ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed Created: 22/Jul/14  Updated: 06/Dec/16  Resolved: 20/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Bruno Travouillon (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: p4b
Environment:

RHEL 6 w/ patched kernel for Lustre


Attachments: File 1407594033.bf26.debug.log     File 1407598675.bf27.debug.log     File 1407800513.bf33.debug.log     File MTS1351.tgz    
Issue Links:
Duplicate
is duplicated by LU-7322 (ost_handler.c:896:ost_brw_read()) AS... Resolved
Related
Severity: 3
Rank (Obsolete): 15009

 Description   

We are hitting this LBUG on one of our production systems recently updated to Lustre 2.4.3 (on 4th of June).

LustreError: 11020:0:(ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed:
LustreError: 11020:0:(ost_handler.c:882:ost_brw_read()) LBUG
Pid: 11020, comm: ll_ost_io03_084

Call Trace:
 [<ffffffffa07e6895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa07e6e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa04a0d44>] ost_brw_read+0x12d4/0x1340 [ost]
 [<ffffffff81282c09>] ? cpumask_next_and+0x29/0x50
 [<ffffffff8105bf64>] ? find_busiest_group+0x244/0x9f0
 [<ffffffffa0abbf0c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
 [<ffffffffa0abc068>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
 [<ffffffffa04a8038>] ost_handle+0x2ac8/0x48e0 [ost]
 [<ffffffffa0ac2c4b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
 [<ffffffffa0acb428>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa07e75de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa07f8d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa0ac2789>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81058bd3>] ? __wake_up+0x53/0x70
 [<ffffffffa0acc7be>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG

The assertion is here:

 753 static int ost_brw_read(struct ptlrpc_request *req, struct obd_trans_info *oti)
 754 {
...
 879                 if (page_rc != local_nb[i].len) { /* short read */
 880                         /* All subsequent pages should be 0 */
 881                         while(++i < npages)
 882                                 LASSERT(local_nb[i].rc == 0);
 883                         break;
 884                 }

I was able to get the content of local_nb from the crash dump.

crash> struct ost_thread_local_cache 0xffff88105da00000
struct ost_thread_local_cache {
  local = {{
      lnb_file_offset = 6010437632,
      lnb_page_offset = 0,
      len = 4096,
      flags = 0,
      page = 0xffffea0037309c10,
      dentry = 0xffff8808048c9a80,
      lnb_grant_used = 0,
      rc = 4096
    }, {
...
    }, {
      lnb_file_offset = 6010757120,
      lnb_page_offset = 0,
      len = 4096,
      flags = 0,
      page = 0xffffea00372ccf80,
      dentry = 0xffff8808048c9a80,
      lnb_grant_used = 0,
      rc = 512   <======== local_nb[i].rc != local_nb[i].len /* short read */
    }, {
      lnb_file_offset = 6010761216,
      lnb_page_offset = 0,
      len = 4096,
      flags = 0,
      page = 0xffffea0037176e98,
      dentry = 0xffff8808048c9a80,
      lnb_grant_used = 0,
      rc = 4096
    }, {
      lnb_file_offset = 1411710976,
      lnb_page_offset = 0,
      len = 4096,
      flags = 1120,
      page = 0x0,
      dentry = 0xffff8803690c2780,
      lnb_grant_used = 0,
      rc = 0
    }, {
      lnb_file_offset = 1411715072,
      lnb_page_offset = 0,
      len = 4096,
      flags = 1120,
      page = 0x0,
      dentry = 0xffff8803690c2780,
      lnb_grant_used = 0,
      rc = 0
    }, {
...

This LBUG occurred 5 times since 06/06/14. Each time, we have a short read followed by some non-empty pages (rc != 0). You will find attached some output for the first 3 occurences.

FYI, we have Lustre routers between the servers and clients as they are on a different LNET. Clients, routers and servers are running Lustre 2.4.3.



 Comments   
Comment by Bruno Faccini (Inactive) [ 22/Jul/14 ]

Bruno,
What was the debug level at the time of this crashes ? And could it be possible for you to extract the debug-log content of one of the crash and make it available ?

Comment by Andreas Dilger [ 22/Jul/14 ]

Is it possible to find the OST object ID for this file and then run debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/$ostdev for the affected object? ldiskfs itself isn't able to allocate blocks smaller than 4096 bytes, unless you have formatted the OST by hand with a smaller blocksize? Even then, the minimum ext4/ldiskfs blocksize is 1024 bytes, so it isn't at all clear to me why or how a read could complete with only 512 bytes unless there is some kind of error on the media.

Once you have extracted the block mapping for the object using debugfs (please also include it in a comment here), the file offset 6010757120 bytes = logical block offset 1467470 in the file, and that can be mapped to a physical block offset on the disk. You might try dd if=/dev/$ostdev of=/tmp/badblock bs=4k skip=$phys_block count=1 to see if it is possible to read it directly from the disk.

As for fixing this - it is never good to have an LASSERT() that depends on data from the disk or network. In this case, we've had this same LASSERT in place for many, many years (it first appeared in Lustre 0.7rc1) and I don't recall ever seeing this problem before (it doesn't appear in bugzilla either). So it is curious that this is just appearing now.

Comment by Bruno Travouillon (Inactive) [ 23/Jul/14 ]

Bruno,

The debug level is 0. Do you have any setting advice to better trace this issue?

Andreas,

The block size is 4096.

It appears that I have been unclear in the description of the issue. We hit this LASSERT 5 times on 4 OSS:

  • crash1:
    • oss: gl68
    • date: 2014-06-06 21:01:22
    • ost: store0-OST0080
    • diskarray.lun: storage82.volume8
  • crash2:
    • oss: gl68
    • date: 2014-06-18 03:29:07
    • ost: store0-OST0085
    • diskarray.lun: storage83.volume9
  • crash3:
    • oss: gl41
    • date: 2014-07-11 10:33:24
    • ost: store0-OST0238
    • diskarray.lun: storage48.vol8
  • crash4:
    • oss: gl27
    • date: 2014-07-19 13:56:49
    • ost: store0-OST02d5
    • diskarray.lun: storage31.volume1
  • crash5:
    • oss: gl31
    • date: 2014-07-21 10:55:29
    • ost: store0-OST02b6
    • diskarray.lun: storage35.volume6

Each time, it occurred on a different backend storage. Moreover, all our volumes are RAID10 protected with a pre-read redundancy check of the parity. I can't see anything in the log of the disk arrays at time of the crash.

However, I was able to check the objects on the relevant OSTs, but all files have been modified after the crash. I was able to successfully read the faulty blocks with dd. I then checked the content of the 4k blocks with xxd and I can say that there is data everywhere in the block.

If you really need the block mapping, please let me know. The customer is a blacksite and it can take some days to get data available.

Otherwise, we can see in the code that lnb[i].rc is the result of i_size_read(inode) - lnb[i].lnb_file_offset in case of short read (see code below).

 876 static int osd_read_prep() {

 898         cfs_gettimeofday(&start);
 899         for (i = 0; i < npages; i++) {
 900
 901             if (i_size_read(inode) <= lnb[i].lnb_file_offset)
 902                         /* If there's no more data, abort early.
 903                          * lnb->rc == 0, so it's easy to detect later. */
 904                         break;
 905
 906             if (i_size_read(inode) <
 907                 lnb[i].lnb_file_offset + lnb[i].len - 1)
 908                     lnb[i].rc = i_size_read(inode) - lnb[i].lnb_file_offset;
 909             else
 910                     lnb[i].rc = lnb[i].len;
 911             m += lnb[i].len;
 912
 913             lprocfs_counter_add(osd->od_stats, LPROC_OSD_CACHE_ACCESS, 1);
 914             if (PageUptodate(lnb[i].page)) {
 915                     lprocfs_counter_add(osd->od_stats,
 916                                         LPROC_OSD_CACHE_HIT, 1);
 917             } else {
 918                     lprocfs_counter_add(osd->od_stats,
 919                                         LPROC_OSD_CACHE_MISS, 1);
 920                     osd_iobuf_add_page(iobuf, lnb[i].page);
 921             }
 922             if (cache == 0)
 923                     generic_error_remove_page(inode->i_mapping,lnb[i].page);
 924         }
 925         cfs_gettimeofday(&end);

I'm wondering if adding a break at the end of the for loop in case of short read (line 906) could avoid this LASSERT? All subsequent pages should be 0 here.

Am I wrong?

Comment by Bruno Faccini (Inactive) [ 23/Jul/14 ]

Bruno,
If I check in your local_nb[] content from one crash-dump you attached in submission text, in case of a short read we should have break next loop at line 904. So may be a fix should be to change this break by setting "lnb[i].rc = 0;" and continue to loop over the remaining pages ?

About the debug level for next occurrence, "inode super ext2 neterror page warning error emerg vfstrace console" would be nice to have, at least.

Comment by Bruno Faccini (Inactive) [ 29/Aug/14 ]

I have pushed a b2_4 patch to fix this as per my previous comment, at http://review.whamcloud.com/11633.

Comment by Bruno Travouillon (Inactive) [ 03/Sep/14 ]

We had 4 more occurences during August. I should be able to provide the debug logs in the next few days.

Comment by Bruno Travouillon (Inactive) [ 07/Oct/14 ]

You will find attached the debug logs from 3 crashes.

For each, it seems that a read process starts while a write process is still running on the inode.

Comment by Bruno Faccini (Inactive) [ 31/Oct/14 ]

Hello Bruno,
At the very beginning of each of the 3 debug logs you have added debugfs infos about a specific inode and I wonder if it is the one you have identified as being the one used by the process that triggers the LBUG ??

Comment by Bruno Travouillon (Inactive) [ 31/Oct/14 ]

Hi Bruno,

Yes it is.

Comment by Bruno Faccini (Inactive) [ 27/Nov/14 ]

Hello Bruno,
Sorry to be late on this.
I am afraid that the latest 3 logs you provided do not help to identify problem's root cause.
BTW, I wonder if you had been able able to set the minimum "inode super ext2 neterror page warning error emerg vfstrace console" debug mask I had requested before ?
I also wonder how you identified the inode you point at the beginning of each log ?
OTOH, I would like to get the concerned inode to be extracted from a crash dump, so is the 1st one still available ? Or may be one of the latest ?
This looks a bit tricky to do because the inode has been retrieved in previous routines/stages starting from obd_preprw()/ofd_preprw() call in ost_brw_read(), but no reference has been kept then ... We may luckily retrieve it by browsing the whole LBUG/thread stack and try to resolve the stacked pointers/values as in an ldiskfs_inode, but if unsuccessful we will need to mimic the whole algorithm, including the hashing of lu_object !!

Comment by Bruno Faccini (Inactive) [ 28/Nov/14 ]

As a short-cut method to find the inode in-memory, man could also verify that all previous lnb[0..(i-1)].page are linked to the same mapping/address_space that will hopefully belong to the same ldiskfs_node.

Also I forgot to ask if the unexpected non-null lnb[i].rc value is always 512 in all crash-dumps ?

Comment by Bruno Travouillon (Inactive) [ 18/Dec/14 ]

We hit the same LBUG today, on the same filesystem, but with 2.5.3 this time. I can confirm that the debug mask was set as requested for the already provided debug logs.

The lnb[i].rc is not always 512. We had 396 in a previous crash. This time, rc = 1152.

I should be able to look deeper tomorrow.

Comment by Bruno Faccini (Inactive) [ 19/Dec/14 ]

Bruno, Can you check if you can resolve/find the concerned Inode in crash-dump memory using the methos I have described above?

Comment by Bruno Travouillon (Inactive) [ 19/Dec/14 ]

I tried to, but I was not able to find the inode in-memory. The customer site will be closed for the next two weeks. I will give a new try with Antoine next year.

> I also wonder how you identified the inode you point at the beginning of each log ?

We were able to find the OST and the oid of the inode in the stack. We then used debugfs on the OST to stat O/0/d(oid%32)/oid to get the inode.

Comment by Bruno Faccini (Inactive) [ 13/Jan/15 ]

After on-site, it appear that the method I thought to allow easy Inode in crash-dump retrieval was irrelevant because the OSTs are configured with read_cache disabled and thus none of the pages concerned by IO/niobufs are chached nor mapped to the inode.
Thus, to find the inode in-memory we had to to do a mass search of its ino/number on all the Inodes lists, and have found it on the default backing_device dirty-list. And BTW, the inode size is quite far from the current pages range.
Since we still suspect a race to cause the problem, I am currently working on a debug patch to help in debugging more closely.

Comment by Antoine Percher [ 16/Jan/15 ]

I found an other way to have the inode pointer who do the LBUG
On the niobuf_local structure the dentry field is finaly a pointer
of osd_object structure and the field oo_inode in the osd_object
is THE inode pointer.
CF: lustre!osd_ldiskfs/osd_io.c: 472 lnb->dentry = (void *) obj;

Comment by Gerrit Updater [ 18/Jan/15 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13448
Subject: LU-5393 osd-ldiskfs: debug short-read vs possible race
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: f8805056590b6e0ea5cacf32179c0cf995811b3a

Comment by Bruno Faccini (Inactive) [ 18/Jan/15 ]

Yes, you are right and thanks to add this since I have forgotten to mention it. And hopefully this is the way we can use now to find in-memory/crash inode for new occurrences of problem!

BTW, as already discussed, I have also pushed a debug patch (http://review.whamcloud.com/13448) that may allow to better highlight the possible race during short-read condition which we strongly suspect to cause the assert. Let's wait for it to be exposed to our auto-tests and see if it is not too intrusive. I have submitted it on top of current b2_5 but it should apply easily in 2.5.3.

I will also try to work in-house on a reproducer, based on our latest findings/thoughts during our joint on-site debug and analysis.

Comment by Bruno Faccini (Inactive) [ 27/Jan/15 ]

Did you have some time to give a try to my debug patch ? From my side, it has passed auto-tests and does not introduce any regression nor cause flooding msgs ...

OTOH, my reproducer attempts, based on our findings+thought is still unsuccessful ...

Comment by Bruno Travouillon (Inactive) [ 29/Jan/15 ]

Hi Bruno,

Not yet. I should be able to build lustre with your debug patch in the next few days. I'll keep you in touch.

Comment by Gerrit Updater [ 10/Feb/15 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13707
Subject: LU-5393 osd-ldiskfs: read i_size once to protect against race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bb93b6ed2cb72e894ce1c41e0a1f98321ad5cb4b

Comment by Bruno Faccini (Inactive) [ 10/Feb/15 ]

After having a look in the different crash-dumps for this problem where we have found that :
_ inode i_size value, at the time of the crash-dump, is each time much higher than the offset where the short-read has occured.
_ the length that has been read at short-read time/position is very frequently 512 but not always
_ the number of pages being read in addition is frequently one, but not always, there is a case where the short-read has occured in the 1st page mapped by local_nb[] and then hundreds of page have been read ...

According to the code of ldiskfs:osd_[read,write]_prep() and underlying/callee routines this can easily occur during a "legal" (both are using ofd_read_lock() protection) read vs write race, where the writer is growing object/i_size beyond current position that the reader also has reached, and because osd_read_prep() is using multiple (currently 4x when de-assembly indicates there are only 3 effectives, and one/"i_size_read(inode) - lnb[i].lnb_file_offset" has been optimized to re-use value, BTW this explais why we have never found local_nb[i].rc high values in crahs-dumps ...) calls of i_size_read() to implement its algorithm/loop regardless of a possible change/grow of i_size.

Here is a patch/fix attempt, at http://review.whamcloud.com/13707, that only i_size_read() once in osd_read_prep().

Comment by Bruno Faccini (Inactive) [ 25/Feb/15 ]

http://review.whamcloud.com/13707 definitely appears to be the right way to fix, so I have abandoned http://review.whamcloud.com/11633.

Comment by Gerrit Updater [ 03/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13707/
Subject: LU-5393 osd-ldiskfs: read i_size once to protect against race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2267fc79d94dea3778cb4600027698e0155f1af3

Comment by Peter Jones [ 20/May/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:51:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.