<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:40:30 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4192] NFS server crash while fsx runs on clients</title>
                <link>https://jira.whamcloud.com/browse/LU-4192</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Running 2.5.0-RC1, with one NFS server exporting the Lustre file system with 4 clients running &quot;./fsx&quot;, &quot;./fsx&quot;, &quot;./fsx -c 3 -l 8000000 -r 75 -w 4096 /misc/export/nfs_x_4&quot; and &quot;./fsx -c 5 -l 500000 -o 2048&quot;. The export options are rw,async,insecure. After a few hours of running fsx, the NFS server crashed.&lt;/p&gt;

&lt;p&gt;From the console of the NFS server (c06):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Message from syslogd@c06 at Oct 30 14:52:40 ...
 kernel:LustreError: 6653:0:(rw.c:695:ll_read_ahead_pages()) ASSERTION( page_idx &amp;gt; ria-&amp;gt;ria_stoff ) failed: Invalid page_idx 1025rs 1025 re 1534 ro 1279 rl 256 rp 1

Message from syslogd@c06 at Oct 30 14:52:40 ...
 kernel:LustreError: 6653:0:(rw.c:695:ll_read_ahead_pages()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From vmcore-dmesg.txt on the NFS server (c06):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;0&amp;gt;LustreError: 6653:0:(rw.c:695:ll_read_ahead_pages()) ASSERTION( page_idx &amp;gt; ri
a-&amp;gt;ria_stoff ) failed: Invalid page_idx 1025rs 1025 re 1534 ro 1279 rl 256 rp 1
&amp;lt;0&amp;gt;LustreError: 6653:0:(rw.c:695:ll_read_ahead_pages()) LBUG
&amp;lt;4&amp;gt;Pid: 6653, comm: nfsd
&amp;lt;4&amp;gt;
&amp;lt;4&amp;gt;Call Trace:
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa044f895&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa044fe97&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a607d6&amp;gt;] ll_readahead+0x10c6/0x10f0 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8127d847&amp;gt;] ? radix_tree_insert+0x1c7/0x220
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a8cd75&amp;gt;] vvp_io_read_page+0x305/0x340 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05d5c7d&amp;gt;] cl_io_read_page+0x8d/0x170 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05c9a77&amp;gt;] ? cl_page_assume+0xf7/0x220 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a5f176&amp;gt;] ll_readpage+0x96/0x1a0 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811b0cf8&amp;gt;] __generic_file_splice_read+0x3a8/0x560
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81168712&amp;gt;] ? kmem_cache_alloc+0x182/0x190
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0464e47&amp;gt;] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0464e47&amp;gt;] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05cd6db&amp;gt;] ? cl_lock_fits_into+0x6b/0x190 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa09b38d9&amp;gt;] ? lov_lock_fits_into+0x409/0x560 [lov]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811aeb90&amp;gt;] ? spd_release_page+0x0/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811b0efa&amp;gt;] generic_file_splice_read+0x4a/0x90
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a8e905&amp;gt;] vvp_io_read_start+0x3c5/0x470 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05d37ea&amp;gt;] cl_io_start+0x6a/0x140 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05d7ef4&amp;gt;] cl_io_loop+0xb4/0x1b0 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a30f9f&amp;gt;] ll_file_io_generic+0x33f/0x610 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a319c0&amp;gt;] ll_file_splice_read+0xb0/0x1d0 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811af15b&amp;gt;] do_splice_to+0x6b/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811af45f&amp;gt;] splice_direct_to_actor+0xaf/0x1c0
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03aa3b0&amp;gt;] ? nfsd_direct_splice_actor+0x0/0x20 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03aae70&amp;gt;] nfsd_vfs_read+0x1a0/0x1c0 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03ac4a0&amp;gt;] nfsd_read_file+0x90/0xb0 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03c00ef&amp;gt;] nfsd4_encode_read+0x13f/0x240 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03c5dd4&amp;gt;] ? nfs4_preprocess_stateid_op+0x284/0x310 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03b9d35&amp;gt;] nfsd4_encode_operation+0x75/0x180 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03b7d35&amp;gt;] nfsd4_proc_compound+0x195/0x490 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03a543e&amp;gt;] nfsd_dispatch+0xfe/0x240 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa02d0614&amp;gt;] svc_process_common+0x344/0x640 [sunrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81063410&amp;gt;] ? default_wake_function+0x0/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa02d0c50&amp;gt;] svc_process+0x110/0x160 [sunrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03a5b62&amp;gt;] nfsd+0xc2/0x160 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03a5aa0&amp;gt;] ? nfsd+0x0/0x160 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81096a36&amp;gt;] kthread+0x96/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c0ca&amp;gt;] child_rip+0xa/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff810969a0&amp;gt;] ? kthread+0x0/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c0c0&amp;gt;] ? child_rip+0x0/0x20
&amp;lt;4&amp;gt;
&amp;lt;0&amp;gt;Kernel panic - not syncing: LBUG
&amp;lt;4&amp;gt;Pid: 6653, comm: nfsd Not tainted 2.6.32-358.18.1.el6.x86_64 #1
&amp;lt;4&amp;gt;Call Trace:
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8150da18&amp;gt;] ? panic+0xa7/0x16f
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa044feeb&amp;gt;] ? lbug_with_loc+0x9b/0xb0 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a607d6&amp;gt;] ? ll_readahead+0x10c6/0x10f0 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8127d847&amp;gt;] ? radix_tree_insert+0x1c7/0x220
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a8cd75&amp;gt;] ? vvp_io_read_page+0x305/0x340 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05d5c7d&amp;gt;] ? cl_io_read_page+0x8d/0x170 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05c9a77&amp;gt;] ? cl_page_assume+0xf7/0x220 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a5f176&amp;gt;] ? ll_readpage+0x96/0x1a0 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811b0cf8&amp;gt;] ? __generic_file_splice_read+0x3a8/0x560
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81168712&amp;gt;] ? kmem_cache_alloc+0x182/0x190
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0464e47&amp;gt;] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0464e47&amp;gt;] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05cd6db&amp;gt;] ? cl_lock_fits_into+0x6b/0x190 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa09b38d9&amp;gt;] ? lov_lock_fits_into+0x409/0x560 [lov]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811aeb90&amp;gt;] ? spd_release_page+0x0/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811b0efa&amp;gt;] ? generic_file_splice_read+0x4a/0x90
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a8e905&amp;gt;] ? vvp_io_read_start+0x3c5/0x470 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05d37ea&amp;gt;] ? cl_io_start+0x6a/0x140 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa05d7ef4&amp;gt;] ? cl_io_loop+0xb4/0x1b0 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a30f9f&amp;gt;] ? ll_file_io_generic+0x33f/0x610 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0a319c0&amp;gt;] ? ll_file_splice_read+0xb0/0x1d0 [lustre]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811af15b&amp;gt;] ? do_splice_to+0x6b/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff811af45f&amp;gt;] ? splice_direct_to_actor+0xaf/0x1c0
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03aa3b0&amp;gt;] ? nfsd_direct_splice_actor+0x0/0x20 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03aae70&amp;gt;] ? nfsd_vfs_read+0x1a0/0x1c0 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03ac4a0&amp;gt;] ? nfsd_read_file+0x90/0xb0 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03c00ef&amp;gt;] ? nfsd4_encode_read+0x13f/0x240 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03c5dd4&amp;gt;] ? nfs4_preprocess_stateid_op+0x284/0x310 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03b9d35&amp;gt;] ? nfsd4_encode_operation+0x75/0x180 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03b7d35&amp;gt;] ? nfsd4_proc_compound+0x195/0x490 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03a543e&amp;gt;] ? nfsd_dispatch+0xfe/0x240 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa02d0614&amp;gt;] ? svc_process_common+0x344/0x640 [sunrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81063410&amp;gt;] ? default_wake_function+0x0/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa02d0c50&amp;gt;] ? svc_process+0x110/0x160 [sunrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03a5b62&amp;gt;] ? nfsd+0xc2/0x160 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa03a5aa0&amp;gt;] ? nfsd+0x0/0x160 [nfsd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81096a36&amp;gt;] ? kthread+0x96/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c0ca&amp;gt;] ? child_rip+0xa/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff810969a0&amp;gt;] ? kthread+0x0/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c0c0&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From dmesg on the MDS/MGS (c04):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: MGS: haven&apos;t heard from client 69d18973-27a2-0441-1720-b772711670c2 (at 192.168.2.106@o2ib) in 236 seconds. I think it&apos;s dead, and I am evicting it. exp ffff880664582400, cur 1383170178 expire 1383170028 last 1383169942
Lustre: scratch-MDT0000: haven&apos;t heard from client 63f330f5-0403-f077-ec5e-2a25cace70a6 (at 192.168.2.106@o2ib) in 239 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88082e2c9c00, cur 1383170203 expire 1383170053 last 138316996
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;dmesg on one of the clients(c03) running fsx:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
nfs: server c06-ib.lab.opensfs.org not responding, still trying
nfs: server c06-ib.lab.opensfs.org OK
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;dmesg on another client(c11) running fsx:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
ip_tables: (C) 2000-2006 Netfilter Core Team
nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
INFO: task fsx:7077 blocked for more than 120 seconds.
&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
fsx           D 0000000000000002     0  7077   6831 0x00000080
 ffff880822b21c28 0000000000000082 ffff880822b21ba8 ffffffffa02cbcd0
 ffff880823ec7c00 ffff880822b21bd8 ffff880823ec7cb0 ffff880823161408
 ffff88082482a638 ffff880822b21fd8 000000000000fb88 ffff88082482a638
Call Trace:
 [&amp;lt;ffffffffa02cbcd0&amp;gt;] ? rpc_execute+0x50/0xa0 [sunrpc]
 [&amp;lt;ffffffff810a2431&amp;gt;] ? ktime_get_ts+0xb1/0xf0
 [&amp;lt;ffffffff81119e10&amp;gt;] ? sync_page+0x0/0x50
 [&amp;lt;ffffffff8150e8c3&amp;gt;] io_schedule+0x73/0xc0
 [&amp;lt;ffffffff81119e4d&amp;gt;] sync_page+0x3d/0x50
 [&amp;lt;ffffffff8150f27f&amp;gt;] __wait_on_bit+0x5f/0x90
 [&amp;lt;ffffffff8111a083&amp;gt;] wait_on_page_bit+0x73/0x80
 [&amp;lt;ffffffff81096de0&amp;gt;] ? wake_bit_function+0x0/0x50
 [&amp;lt;ffffffff8112f0e5&amp;gt;] ? pagevec_lookup_tag+0x25/0x40
 [&amp;lt;ffffffff8111a4ab&amp;gt;] wait_on_page_writeback_range+0xfb/0x190
 [&amp;lt;ffffffff8111a56f&amp;gt;] filemap_fdatawait+0x2f/0x40
 [&amp;lt;ffffffff8111ab94&amp;gt;] filemap_write_and_wait+0x44/0x60
 [&amp;lt;ffffffffa032ddfd&amp;gt;] nfs_getattr+0x10d/0x120 [nfs]
 [&amp;lt;ffffffff81186d21&amp;gt;] vfs_getattr+0x51/0x80
 [&amp;lt;ffffffff81186fdf&amp;gt;] vfs_fstat+0x3f/0x60
 [&amp;lt;ffffffff81187024&amp;gt;] sys_newfstat+0x24/0x40
 [&amp;lt;ffffffff810dc937&amp;gt;] ? audit_syscall_entry+0x1d7/0x200
 [&amp;lt;ffffffff810dc685&amp;gt;] ? __audit_syscall_exit+0x265/0x290
 [&amp;lt;ffffffff8100b072&amp;gt;] system_call_fastpath+0x16/0x1b
nfs: server c06-ib.lab.opensfs.org not responding, still trying
INFO: task fsx:7077 blocked for more than 120 seconds.
&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
fsx           D 0000000000000002     0  7077   6831 0x00000080
 ffff880822b21c28 0000000000000082 ffff880822b21ba8 ffffffffa02cbcd0
 ffff880823ec7c00 ffff880822b21bd8 ffff880823ec7cb0 ffff880823161408
 ffff88082482a638 ffff880822b21fd8 000000000000fb88 ffff88082482a638
Call Trace:
 [&amp;lt;ffffffffa02cbcd0&amp;gt;] ? rpc_execute+0x50/0xa0 [sunrpc]
 [&amp;lt;ffffffff810a2431&amp;gt;] ? ktime_get_ts+0xb1/0xf0
 [&amp;lt;ffffffff81119e10&amp;gt;] ? sync_page+0x0/0x50
 [&amp;lt;ffffffff8150e8c3&amp;gt;] io_schedule+0x73/0xc0
 [&amp;lt;ffffffff81119e4d&amp;gt;] sync_page+0x3d/0x50
 [&amp;lt;ffffffff8150f27f&amp;gt;] __wait_on_bit+0x5f/0x90
 [&amp;lt;ffffffff8111a083&amp;gt;] wait_on_page_bit+0x73/0x80
 [&amp;lt;ffffffff81096de0&amp;gt;] ? wake_bit_function+0x0/0x50
 [&amp;lt;ffffffff8112f0e5&amp;gt;] ? pagevec_lookup_tag+0x25/0x40
 [&amp;lt;ffffffff8111a4ab&amp;gt;] wait_on_page_writeback_range+0xfb/0x190
 [&amp;lt;ffffffff8111a56f&amp;gt;] filemap_fdatawait+0x2f/0x40
 [&amp;lt;ffffffff8111ab94&amp;gt;] filemap_write_and_wait+0x44/0x60
 [&amp;lt;ffffffffa032ddfd&amp;gt;] nfs_getattr+0x10d/0x120 [nfs]
 [&amp;lt;ffffffff81186d21&amp;gt;] vfs_getattr+0x51/0x80
 [&amp;lt;ffffffff81186fdf&amp;gt;] vfs_fstat+0x3f/0x60
 [&amp;lt;ffffffff81187024&amp;gt;] sys_newfstat+0x24/0x40
 [&amp;lt;ffffffff810dc937&amp;gt;] ? audit_syscall_entry+0x1d7/0x200
 [&amp;lt;ffffffff810dc685&amp;gt;] ? __audit_syscall_exit+0x265/0x290
 [&amp;lt;ffffffff8100b072&amp;gt;] system_call_fastpath+0x16/0x1b
INFO: task fsx:7077 blocked for more than 120 seconds.
&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
fsx           D 0000000000000002     0  7077   6831 0x00000080
 ffff880822b21c28 0000000000000082 ffff880822b21ba8 ffffffffa02cbcd0
 ffff880823ec7c00 ffff880822b21bd8 ffff880823ec7cb0 ffff880823161408
 ffff88082482a638 ffff880822b21fd8 000000000000fb88 ffff88082482a638
Call Trace:
 [&amp;lt;ffffffffa02cbcd0&amp;gt;] ? rpc_execute+0x50/0xa0 [sunrpc]
 [&amp;lt;ffffffff810a2431&amp;gt;] ? ktime_get_ts+0xb1/0xf0
 [&amp;lt;ffffffff81119e10&amp;gt;] ? sync_page+0x0/0x50
 [&amp;lt;ffffffff8150e8c3&amp;gt;] io_schedule+0x73/0xc0
 [&amp;lt;ffffffff81119e4d&amp;gt;] sync_page+0x3d/0x50
 [&amp;lt;ffffffff8150f27f&amp;gt;] __wait_on_bit+0x5f/0x90
 [&amp;lt;ffffffff8111a083&amp;gt;] wait_on_page_bit+0x73/0x80
 [&amp;lt;ffffffff81096de0&amp;gt;] ? wake_bit_function+0x0/0x50
 [&amp;lt;ffffffff8112f0e5&amp;gt;] ? pagevec_lookup_tag+0x25/0x40
 [&amp;lt;ffffffff8111a4ab&amp;gt;] wait_on_page_writeback_range+0xfb/0x190
 [&amp;lt;ffffffff8111a56f&amp;gt;] filemap_fdatawait+0x2f/0x40
 [&amp;lt;ffffffff8111ab94&amp;gt;] filemap_write_and_wait+0x44/0x60
 [&amp;lt;ffffffffa032ddfd&amp;gt;] nfs_getattr+0x10d/0x120 [nfs]
 [&amp;lt;ffffffff81186d21&amp;gt;] vfs_getattr+0x51/0x80
 [&amp;lt;ffffffff81186fdf&amp;gt;] vfs_fstat+0x3f/0x60
 [&amp;lt;ffffffff81187024&amp;gt;] sys_newfstat+0x24/0x40
 [&amp;lt;ffffffff810dc937&amp;gt;] ? audit_syscall_entry+0x1d7/0x200
 [&amp;lt;ffffffff810dc685&amp;gt;] ? __audit_syscall_exit+0x265/0x290
 [&amp;lt;ffffffff8100b072&amp;gt;] system_call_fastpath+0x16/0x1b
nfs: server c06-ib.lab.opensfs.org OK
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the above two clients running fsx, there is some information from fsx on the console. &lt;/p&gt;

&lt;p&gt;From c11:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;21166179: 1383169965.726634 TRUNCATE DOWN       from 0x3f79b to 0x1faa0
21166180: 1383169965.727509 WRITE    0xf5 thru 0xe785 (0xe691 bytes)
21166181: 1383169965.728551 READ     0xcc40 thru 0x1048d (0x384e bytes)
21166182: 1383169965.728579 TRUNCATE UP from 0x1faa0 to 0x3d30f
21166183: 1383169965.729270 WRITE    0x1efa1 thru 0x1ff22 (0xf82 bytes)
21166184: 1383169965.729569 MAPWRITE 0x24b11 thru 0x3137f (0xc86f bytes)
21166185: 1383169965.732139 WRITE    0x3d715 thru 0x3ffff (0x28eb bytes) HOLE
21166186: 1383169965.732569 READ     0x36ae3 thru 0x3bebc (0x53da bytes)
21166187: 1383169965.732607 WRITE    0x2bde3 thru 0x2edce (0x2fec bytes)
21166188: 1383169965.733019 TRUNCATE DOWN       from 0x40000 to 0x22c5a
21166189: 1383169965.733905 READ     0x5aa2 thru 0x111fd (0xb75c bytes)
21166190: 1383169965.733984 MAPREAD  0x8954 thru 0xf80c (0x6eb9 bytes)
21166191: 1383169965.734046 TRUNCATE UP from 0x22c5a to 0x2d1ff
21166192: 1383169965.734796 TRUNCATE DOWN       from 0x2d1ff to 0x14b1c
21166193: 1383169965.735632 READ     0x13752 thru 0x14b1b (0x13ca bytes)
21166194: 1383169965.735643 READ     0xec90 thru 0x14b1b (0x5e8c bytes)
21166195: 1383169965.735682 MAPWRITE 0x16fcc thru 0x22e5b (0xbe90 bytes)
21166196: 1383169965.738445 TRUNCATE DOWN       from 0x22e5c to 0x946a
21166197: 1383169965.739279 MAPWRITE 0x1f64b thru 0x23e6a (0x4820 bytes)
21166198: 1383169965.742263 MAPREAD  0x14df4 thru 0x1855c (0x3769 bytes)
21166199: 1383169965.742303 MAPWRITE 0x11972 thru 0x21497 (0xfb26 bytes)
Correct content saved for comparison
(maybe hexdump &quot;/misc/export/nfs_x_2&quot; vs &quot;/misc/export/nfs_x_2.fsxgood&quot;)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;on c03 a very long listing ending in:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&#8230;
2170216: 1383169960.525813 MAPWRITE 0x32e000 thru 0x33c2c1 (0xe2c2 bytes)
2170217: 1383169960.526777 TRUNCATE DOWN	from 0x743f40 to 0x1fe27c
2170218: 1383169960.531574 MAPREAD  0x1c02b4 thru 0x1cc215 (0xbf62 bytes)
2170219: 1383169960.531687 CLOSE/OPEN
2170220: 1383169960.532293 TRUNCATE UP	from 0x1fe27c to 0x68dab9
2170221: 1383169960.533532 MAPWRITE 0x343000 thru 0x34c4d4 (0x94d5 bytes)
2170222: 1383169960.800438 CLOSE/OPEN
2170223: 1383169960.800480 MAPREAD  0x61c245 thru 0x620e19 (0x4bd5 bytes)
2170224: 1383169960.800536 CLOSE/OPEN
2170225: 1383169960.800563 WRITE    0x7c000 thru 0x8b4de (0xf4df bytes)
2170226: 1383169960.801359 CLOSE/OPEN
2170227: 1383169960.801386 TRUNCATE DOWN	from 0x68dab9 to 0x206e69
Correct content saved for comparison
(maybe hexdump &quot;/misc/export/nfs_x_4&quot; vs &quot;/misc/export/nfs_x_4.fsxgood&quot;)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
</description>
                <environment>Lustre 2.5.0-RC1 build #2, RHEL 6, OpenSFS cluster with one MDS/MGS, one OST with two OSSs, one NFS server, four clients.</environment>
        <key id="21744">LU-4192</key>
            <summary>NFS server crash while fsx runs on clients</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                    </labels>
                <created>Wed, 30 Oct 2013 23:07:56 +0000</created>
                <updated>Fri, 7 Oct 2016 20:56:56 +0000</updated>
                            <resolved>Fri, 7 Oct 2016 20:56:56 +0000</resolved>
                                    <version>Lustre 2.5.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>12</watches>
                                                                            <comments>
                            <comment id="74143" author="phils@dugeo.com" created="Sat, 28 Dec 2013 23:57:01 +0000"  >&lt;p&gt;We&apos;re seeing this &amp;#8211; often &amp;#8211; in a non-NFS-server situation, just with standard pread64() calls.&lt;/p&gt;

&lt;p&gt;LustreError: 18743:0:(rw.c:695:ll_read_ahead_pages()) ASSERTION( page_idx &amp;gt; ria-&amp;gt;ria_stoff ) failed: invalid page_idx 176690rs 176690 re 176895 ro 176779 rl 89 rp 1&lt;br/&gt;
LustreError: 18743:0:(rw.c:695:ll_read_ahead_pages()) LBUG&lt;/p&gt;

&lt;p&gt;Call Trace:&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa00c4905&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa00c4f07&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06ba4c6&amp;gt;&amp;#93;&lt;/span&gt; ll_readahead+0x10c6/0x10f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8126efe7&amp;gt;&amp;#93;&lt;/span&gt; ? radix_tree_insert&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06e6d45&amp;gt;&amp;#93;&lt;/span&gt; vvp_io_read_page+0x305/0x340 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa023ba8d&amp;gt;&amp;#93;&lt;/span&gt; cl_io_read_page+0x8d/0x170 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa022f887&amp;gt;&amp;#93;&lt;/span&gt; ? cl_read_page_assume+0xf7/0x220 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06b8e66&amp;gt;&amp;#93;&lt;/span&gt; ll_readpage+0x96/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811117ec&amp;gt;&amp;#93;&lt;/span&gt; generic_file_aio_read+0x1fc/0x700&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06e8767&amp;gt;&amp;#93;&lt;/span&gt; vvp_io_read_start+0x257/0x470 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02395fa&amp;gt;&amp;#93;&lt;/span&gt; cl_io_start+0x6a/0x140 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa023dd04&amp;gt;&amp;#93;&lt;/span&gt; cl_io_loop+0xb4/0x1b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa068a3cf&amp;gt;&amp;#93;&lt;/span&gt; ll_file_io_generic+0x3ff/0x610 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa068a7df&amp;gt;&amp;#93;&lt;/span&gt; ll_file_aio_read+0x13f/0x2c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa068aeac&amp;gt;&amp;#93;&lt;/span&gt; ll_file_read+0x16c/0x2a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81176cb5&amp;gt;&amp;#93;&lt;/span&gt; vfs_read&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81176fe2&amp;gt;&amp;#93;&lt;/span&gt; sys_pread64&lt;/p&gt;

&lt;p&gt;One particular application seems to reproduce it fairly reliably; in 8 hours running on 35 nodes, 4 nodes crashed in this way.&lt;/p&gt;

&lt;p&gt;I don&apos;t think this is very minor &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Lustre: Lustre: Build Version: 2.5.0-RC1--PRISTINE-2.6.32-279.14.1.el6.x86_64&lt;br/&gt;
Linux hnod0032 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux&lt;/p&gt;</comment>
                            <comment id="74145" author="phils@dugeo.com" created="Sun, 29 Dec 2013 01:06:07 +0000"  >&lt;p&gt;We&apos;ve lost 10 more nodes to this.&lt;/p&gt;

&lt;p&gt;I&apos;m going to try setting max_read_ahead_mb to 0, in the meantime, just to see if it keeps the cluster alive.&lt;/p&gt;</comment>
                            <comment id="74156" author="phils@dugeo.com" created="Mon, 30 Dec 2013 10:01:59 +0000"  >&lt;p&gt;Well, the good news is that it does indeed stop crashing if I disable readahead.  I guess that&apos;s predictable.&lt;/p&gt;

&lt;p&gt;The bad news is our apps were getting a lot of juice from readahead: a part that used to take 4 minutes and represent about 3% of the runtime now takes 80 minutes and represents 40% of runtime.  So that&apos;s not really tenable...&lt;/p&gt;</comment>
                            <comment id="82608" author="hdoreau" created="Mon, 28 Apr 2014 14:39:43 +0000"  >&lt;p&gt;Was this further investigated? We see it regularly at CEA on a 2.1 instance which is re-exported to NFS.&lt;/p&gt;</comment>
                            <comment id="83506" author="dtascione" created="Thu, 8 May 2014 14:26:08 +0000"  >&lt;p&gt;We are seeing the same problem daily on our Lustre 2.4.2 setup. This is also in a non-NFS situation, and we are just doing standard pread calls.&lt;/p&gt;

&lt;p&gt;The interesting part is that it appears to only be happening for us when we access one particular file. Every time a client panics, it is in relation to reading from this file. However, it doesn&apos;t seem to happen every time we access the file, just occasionally. Only a single node reads or writes to this file at a time, and the read/write load (for that file) is typically very small.&lt;/p&gt;

&lt;p&gt;I haven&apos;t tried turning off read ahead yet &amp;#8211; that has had a very negative impact on performance for us in the past.&lt;/p&gt;</comment>
                            <comment id="87868" author="bobijam" created="Tue, 1 Jul 2014 08:45:16 +0000"  >&lt;p&gt;Would you please try this patch &lt;a href=&quot;http://review.whamcloud.com/10915&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/10915&lt;/a&gt; , it can also apply to 2.5 lustre code.&lt;/p&gt;</comment>
                            <comment id="91766" author="dtascione" created="Fri, 15 Aug 2014 19:24:10 +0000"  >&lt;p&gt;We applied that patch here (on 2.4.2) and have been running for a few weeks &amp;#8211; so far, so good!&lt;/p&gt;</comment>
                            <comment id="168730" author="adilger" created="Fri, 7 Oct 2016 20:56:56 +0000"  >&lt;p&gt;Fixed as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5263&quot; title=&quot;ll_read_ahead_pages() ASSERTION( page_idx &amp;gt; ria-&amp;gt;ria_stoff )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5263&quot;&gt;&lt;del&gt;LU-5263&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="25340">LU-5263</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw7fb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>11347</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>