[LU-5393] LBUG: (ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed Created: 22/Jul/14 Updated: 06/Dec/16 Resolved: 20/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Bruno Travouillon (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4b | ||
| Environment: |
RHEL 6 w/ patched kernel for Lustre |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 15009 | ||||||||||||
| Description |
|
We are hitting this LBUG on one of our production systems recently updated to Lustre 2.4.3 (on 4th of June). LustreError: 11020:0:(ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed: LustreError: 11020:0:(ost_handler.c:882:ost_brw_read()) LBUG Pid: 11020, comm: ll_ost_io03_084 Call Trace: [<ffffffffa07e6895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa07e6e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa04a0d44>] ost_brw_read+0x12d4/0x1340 [ost] [<ffffffff81282c09>] ? cpumask_next_and+0x29/0x50 [<ffffffff8105bf64>] ? find_busiest_group+0x244/0x9f0 [<ffffffffa0abbf0c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] [<ffffffffa0abc068>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc] [<ffffffffa04a8038>] ost_handle+0x2ac8/0x48e0 [ost] [<ffffffffa0ac2c4b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] [<ffffffffa0acb428>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa07e75de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa07f8d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa0ac2789>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81058bd3>] ? __wake_up+0x53/0x70 [<ffffffffa0acc7be>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c200>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG The assertion is here: 753 static int ost_brw_read(struct ptlrpc_request *req, struct obd_trans_info *oti) 754 { ... 879 if (page_rc != local_nb[i].len) { /* short read */ 880 /* All subsequent pages should be 0 */ 881 while(++i < npages) 882 LASSERT(local_nb[i].rc == 0); 883 break; 884 } I was able to get the content of local_nb from the crash dump. crash> struct ost_thread_local_cache 0xffff88105da00000
struct ost_thread_local_cache {
local = {{
lnb_file_offset = 6010437632,
lnb_page_offset = 0,
len = 4096,
flags = 0,
page = 0xffffea0037309c10,
dentry = 0xffff8808048c9a80,
lnb_grant_used = 0,
rc = 4096
}, {
...
}, {
lnb_file_offset = 6010757120,
lnb_page_offset = 0,
len = 4096,
flags = 0,
page = 0xffffea00372ccf80,
dentry = 0xffff8808048c9a80,
lnb_grant_used = 0,
rc = 512 <======== local_nb[i].rc != local_nb[i].len /* short read */
}, {
lnb_file_offset = 6010761216,
lnb_page_offset = 0,
len = 4096,
flags = 0,
page = 0xffffea0037176e98,
dentry = 0xffff8808048c9a80,
lnb_grant_used = 0,
rc = 4096
}, {
lnb_file_offset = 1411710976,
lnb_page_offset = 0,
len = 4096,
flags = 1120,
page = 0x0,
dentry = 0xffff8803690c2780,
lnb_grant_used = 0,
rc = 0
}, {
lnb_file_offset = 1411715072,
lnb_page_offset = 0,
len = 4096,
flags = 1120,
page = 0x0,
dentry = 0xffff8803690c2780,
lnb_grant_used = 0,
rc = 0
}, {
...
This LBUG occurred 5 times since 06/06/14. Each time, we have a short read followed by some non-empty pages (rc != 0). You will find attached some output for the first 3 occurences. FYI, we have Lustre routers between the servers and clients as they are on a different LNET. Clients, routers and servers are running Lustre 2.4.3. |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 22/Jul/14 ] |
|
Bruno, |
| Comment by Andreas Dilger [ 22/Jul/14 ] |
|
Is it possible to find the OST object ID for this file and then run debugfs -c -R "stat O/0/d$((objid % 32))/$objid" /dev/$ostdev for the affected object? ldiskfs itself isn't able to allocate blocks smaller than 4096 bytes, unless you have formatted the OST by hand with a smaller blocksize? Even then, the minimum ext4/ldiskfs blocksize is 1024 bytes, so it isn't at all clear to me why or how a read could complete with only 512 bytes unless there is some kind of error on the media. Once you have extracted the block mapping for the object using debugfs (please also include it in a comment here), the file offset 6010757120 bytes = logical block offset 1467470 in the file, and that can be mapped to a physical block offset on the disk. You might try dd if=/dev/$ostdev of=/tmp/badblock bs=4k skip=$phys_block count=1 to see if it is possible to read it directly from the disk. As for fixing this - it is never good to have an LASSERT() that depends on data from the disk or network. In this case, we've had this same LASSERT in place for many, many years (it first appeared in Lustre 0.7rc1) and I don't recall ever seeing this problem before (it doesn't appear in bugzilla either). So it is curious that this is just appearing now. |
| Comment by Bruno Travouillon (Inactive) [ 23/Jul/14 ] |
|
Bruno, The debug level is 0. Do you have any setting advice to better trace this issue? Andreas, The block size is 4096. It appears that I have been unclear in the description of the issue. We hit this LASSERT 5 times on 4 OSS:
Each time, it occurred on a different backend storage. Moreover, all our volumes are RAID10 protected with a pre-read redundancy check of the parity. I can't see anything in the log of the disk arrays at time of the crash. However, I was able to check the objects on the relevant OSTs, but all files have been modified after the crash. I was able to successfully read the faulty blocks with dd. I then checked the content of the 4k blocks with xxd and I can say that there is data everywhere in the block. If you really need the block mapping, please let me know. The customer is a blacksite and it can take some days to get data available. Otherwise, we can see in the code that lnb[i].rc is the result of i_size_read(inode) - lnb[i].lnb_file_offset in case of short read (see code below). 876 static int osd_read_prep() { 898 cfs_gettimeofday(&start); 899 for (i = 0; i < npages; i++) { 900 901 if (i_size_read(inode) <= lnb[i].lnb_file_offset) 902 /* If there's no more data, abort early. 903 * lnb->rc == 0, so it's easy to detect later. */ 904 break; 905 906 if (i_size_read(inode) < 907 lnb[i].lnb_file_offset + lnb[i].len - 1) 908 lnb[i].rc = i_size_read(inode) - lnb[i].lnb_file_offset; 909 else 910 lnb[i].rc = lnb[i].len; 911 m += lnb[i].len; 912 913 lprocfs_counter_add(osd->od_stats, LPROC_OSD_CACHE_ACCESS, 1); 914 if (PageUptodate(lnb[i].page)) { 915 lprocfs_counter_add(osd->od_stats, 916 LPROC_OSD_CACHE_HIT, 1); 917 } else { 918 lprocfs_counter_add(osd->od_stats, 919 LPROC_OSD_CACHE_MISS, 1); 920 osd_iobuf_add_page(iobuf, lnb[i].page); 921 } 922 if (cache == 0) 923 generic_error_remove_page(inode->i_mapping,lnb[i].page); 924 } 925 cfs_gettimeofday(&end); I'm wondering if adding a break at the end of the for loop in case of short read (line 906) could avoid this LASSERT? All subsequent pages should be 0 here. Am I wrong? |
| Comment by Bruno Faccini (Inactive) [ 23/Jul/14 ] |
|
Bruno, About the debug level for next occurrence, "inode super ext2 neterror page warning error emerg vfstrace console" would be nice to have, at least. |
| Comment by Bruno Faccini (Inactive) [ 29/Aug/14 ] |
|
I have pushed a b2_4 patch to fix this as per my previous comment, at http://review.whamcloud.com/11633. |
| Comment by Bruno Travouillon (Inactive) [ 03/Sep/14 ] |
|
We had 4 more occurences during August. I should be able to provide the debug logs in the next few days. |
| Comment by Bruno Travouillon (Inactive) [ 07/Oct/14 ] |
|
You will find attached the debug logs from 3 crashes. For each, it seems that a read process starts while a write process is still running on the inode. |
| Comment by Bruno Faccini (Inactive) [ 31/Oct/14 ] |
|
Hello Bruno, |
| Comment by Bruno Travouillon (Inactive) [ 31/Oct/14 ] |
|
Hi Bruno, Yes it is. |
| Comment by Bruno Faccini (Inactive) [ 27/Nov/14 ] |
|
Hello Bruno, |
| Comment by Bruno Faccini (Inactive) [ 28/Nov/14 ] |
|
As a short-cut method to find the inode in-memory, man could also verify that all previous lnb[0..(i-1)].page are linked to the same mapping/address_space that will hopefully belong to the same ldiskfs_node. Also I forgot to ask if the unexpected non-null lnb[i].rc value is always 512 in all crash-dumps ? |
| Comment by Bruno Travouillon (Inactive) [ 18/Dec/14 ] |
|
We hit the same LBUG today, on the same filesystem, but with 2.5.3 this time. I can confirm that the debug mask was set as requested for the already provided debug logs. The lnb[i].rc is not always 512. We had 396 in a previous crash. This time, rc = 1152. I should be able to look deeper tomorrow. |
| Comment by Bruno Faccini (Inactive) [ 19/Dec/14 ] |
|
Bruno, Can you check if you can resolve/find the concerned Inode in crash-dump memory using the methos I have described above? |
| Comment by Bruno Travouillon (Inactive) [ 19/Dec/14 ] |
|
I tried to, but I was not able to find the inode in-memory. The customer site will be closed for the next two weeks. I will give a new try with Antoine next year. > I also wonder how you identified the inode you point at the beginning of each log ? We were able to find the OST and the oid of the inode in the stack. We then used debugfs on the OST to stat O/0/d(oid%32)/oid to get the inode. |
| Comment by Bruno Faccini (Inactive) [ 13/Jan/15 ] |
|
After on-site, it appear that the method I thought to allow easy Inode in crash-dump retrieval was irrelevant because the OSTs are configured with read_cache disabled and thus none of the pages concerned by IO/niobufs are chached nor mapped to the inode. |
| Comment by Antoine Percher [ 16/Jan/15 ] |
|
I found an other way to have the inode pointer who do the LBUG |
| Comment by Gerrit Updater [ 18/Jan/15 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13448 |
| Comment by Bruno Faccini (Inactive) [ 18/Jan/15 ] |
|
Yes, you are right and thanks to add this since I have forgotten to mention it. And hopefully this is the way we can use now to find in-memory/crash inode for new occurrences of problem! BTW, as already discussed, I have also pushed a debug patch (http://review.whamcloud.com/13448) that may allow to better highlight the possible race during short-read condition which we strongly suspect to cause the assert. Let's wait for it to be exposed to our auto-tests and see if it is not too intrusive. I have submitted it on top of current b2_5 but it should apply easily in 2.5.3. I will also try to work in-house on a reproducer, based on our latest findings/thoughts during our joint on-site debug and analysis. |
| Comment by Bruno Faccini (Inactive) [ 27/Jan/15 ] |
|
Did you have some time to give a try to my debug patch ? From my side, it has passed auto-tests and does not introduce any regression nor cause flooding msgs ... OTOH, my reproducer attempts, based on our findings+thought is still unsuccessful ... |
| Comment by Bruno Travouillon (Inactive) [ 29/Jan/15 ] |
|
Hi Bruno, Not yet. I should be able to build lustre with your debug patch in the next few days. I'll keep you in touch. |
| Comment by Gerrit Updater [ 10/Feb/15 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13707 |
| Comment by Bruno Faccini (Inactive) [ 10/Feb/15 ] |
|
After having a look in the different crash-dumps for this problem where we have found that : According to the code of ldiskfs:osd_[read,write]_prep() and underlying/callee routines this can easily occur during a "legal" (both are using ofd_read_lock() protection) read vs write race, where the writer is growing object/i_size beyond current position that the reader also has reached, and because osd_read_prep() is using multiple (currently 4x when de-assembly indicates there are only 3 effectives, and one/"i_size_read(inode) - lnb[i].lnb_file_offset" has been optimized to re-use value, BTW this explais why we have never found local_nb[i].rc high values in crahs-dumps ...) calls of i_size_read() to implement its algorithm/loop regardless of a possible change/grow of i_size. Here is a patch/fix attempt, at http://review.whamcloud.com/13707, that only i_size_read() once in osd_read_prep(). |
| Comment by Bruno Faccini (Inactive) [ 25/Feb/15 ] |
|
http://review.whamcloud.com/13707 definitely appears to be the right way to fix, so I have abandoned http://review.whamcloud.com/11633. |
| Comment by Gerrit Updater [ 03/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13707/ |
| Comment by Peter Jones [ 20/May/15 ] |
|
Landed for 2.8 |