Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.6
-
None
-
3
-
9223372036854775807
Description
obdfilter-survey test_1c hangs. Looking at the logs at https://testing.whamcloud.com/test_sets/77a3ca20-ef6b-11e8-86c0-52540065bddc , the last output in the client test_log is
== obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 19:09:08 (1543000148) CMD: trevis-54vm3 lctl dl | grep obdfilter CMD: trevis-54vm3 /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @ + NETTYPE=tcp thrlo=32 nobjhi=1 thrhi=32 size=8192 case=disk rslt_loc=/tmp targets="10.9.6.102:lustre-OST0000 10.9.6.102:lustre-OST0001 10.9.6.102:lustre-OST0002 10.9.6.102:lustre-OST0003 10.9.6.102:lustre-OST0004 10.9.6.102:lustre-OST0005 10.9.6.102:lustre-OST0006 10.9.6.102:lustre-OST0007" /usr/bin/obdfilter-survey Fri Nov 23 19:09:22 UTC 2018 Obdfilter-survey for case=disk from trevis-54vm1.trevis.whamcloud.com ost 8 sz 67108864K rsz 1024K obj 8 thr 256 write
Looking at the OSS (vm3) console log, we see lctl blocked and it looks like the/a problem started with test 1a
[128050.794039] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 18:04:49 \(1542996289\) [128050.999311] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 18:04:49 (1542996289) [128051.201155] Lustre: DEBUG MARKER: lctl dl | grep obdfilter [128051.547065] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @ [131760.431733] INFO: task lctl:20810 blocked for more than 120 seconds. [131760.433054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [131760.434345] lctl D ffff95bdb658bf40 0 20810 20695 0x00000080 [131760.435608] Call Trace: [131760.436105] [<ffffffffb8d18f39>] schedule+0x29/0x70 [131760.437209] [<ffffffffc0c54577>] osd_do_bio.isra.25+0x717/0x8d0 [osd_ldiskfs] [131760.438430] [<ffffffffb86bef10>] ? wake_up_atomic_t+0x30/0x30 [131760.439484] [<ffffffffc0c54a17>] osd_read_prep+0x2e7/0x3f0 [osd_ldiskfs] [131760.440721] [<ffffffffc0d90979>] ofd_preprw+0x809/0x1170 [ofd] [131760.441782] [<ffffffffb879bf3e>] ? __get_free_pages+0xe/0x40 [131760.442759] [<ffffffffb87f835e>] ? kmalloc_order_trace+0x2e/0xa0 [131760.443851] [<ffffffffb87fbf01>] ? __kmalloc+0x211/0x230 [131760.444968] [<ffffffffc0f7d17a>] echo_client_prep_commit.isra.49+0x33a/0xc30 [obdecho] [131760.446338] [<ffffffffc0f84ebf>] echo_client_iocontrol+0x95f/0x1be0 [obdecho] [131760.447782] [<ffffffffc07bd7a9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [131760.449040] [<ffffffffc07a8609>] class_handle_ioctl+0x1929/0x1dc0 [obdclass] [131760.450245] [<ffffffffb8706640>] ? futex_wake+0x90/0x180 [131760.451231] [<ffffffffb87caefd>] ? handle_mm_fault+0x39d/0x9b0 [131760.452288] [<ffffffffb88d4bfe>] ? security_capable+0x1e/0x20 [131760.453313] [<ffffffffc078d5d2>] obd_class_ioctl+0xd2/0x170 [obdclass] [131760.454479] [<ffffffffb8834100>] do_vfs_ioctl+0x360/0x550 [131760.455413] [<ffffffffb8d2056c>] ? __do_page_fault+0x1bc/0x4f0 [131760.456411] [<ffffffffb8834391>] SyS_ioctl+0xa1/0xc0 [131760.457266] [<ffffffffb8d256d5>] ? system_call_after_swapgs+0xa2/0x146 [131760.458374] [<ffffffffb8d2579b>] system_call_fastpath+0x22/0x27 [131760.459380] [<ffffffffb8d256e1>] ? system_call_after_swapgs+0xae/0x146 [131760.460481] INFO: task lctl:20813 blocked for more than 120 seconds. [131760.461538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [131760.462838] lctl D ffff95bdb658eeb0 0 20813 20695 0x00000080 [131760.464077] Call Trace: [131760.464576] [<ffffffffb8d18f39>] schedule+0x29/0x70 [131760.465481] [<ffffffffc0c54577>] osd_do_bio.isra.25+0x717/0x8d0 [osd_ldiskfs] [131760.466745] [<ffffffffb86bef10>] ? wake_up_atomic_t+0x30/0x30 [131760.467729] [<ffffffffc0c54a17>] osd_read_prep+0x2e7/0x3f0 [osd_ldiskfs] [131760.468876] [<ffffffffc0d90979>] ofd_preprw+0x809/0x1170 [ofd] [131760.469860] [<ffffffffb879bf3e>] ? __get_free_pages+0xe/0x40 [131760.470831] [<ffffffffb87f835e>] ? kmalloc_order_trace+0x2e/0xa0 [131760.472017] [<ffffffffb87fbf01>] ? __kmalloc+0x211/0x230 [131760.473021] [<ffffffffc0f7d17a>] echo_client_prep_commit.isra.49+0x33a/0xc30 [obdecho] [131760.474345] [<ffffffffc0f84ebf>] echo_client_iocontrol+0x95f/0x1be0 [obdecho] [131760.475556] [<ffffffffc07bd7a9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [131760.476755] [<ffffffffc07a8609>] class_handle_ioctl+0x1929/0x1dc0 [obdclass] [131760.477947] [<ffffffffb87caefd>] ? handle_mm_fault+0x39d/0x9b0 [131760.478941] [<ffffffffb88d4bfe>] ? security_capable+0x1e/0x20 [131760.479986] [<ffffffffc078d5d2>] obd_class_ioctl+0xd2/0x170 [obdclass] [131760.481127] [<ffffffffb8834100>] do_vfs_ioctl+0x360/0x550 [131760.482055] [<ffffffffb8834391>] SyS_ioctl+0xa1/0xc0 [131760.482916] [<ffffffffb8d256d5>] ? system_call_after_swapgs+0xa2/0x146 [131760.484012] [<ffffffffb8d2579b>] system_call_fastpath+0x22/0x27 [131760.485016] [<ffffffffb8d256e1>] ? system_call_after_swapgs+0xae/0x146 [131760.486125] INFO: task lctl:20815 blocked for more than 120 seconds. [131760.487186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [131760.488468] lctl D ffff95bd48ab8fd0 0 20815 20695 0x00000080 [131760.489714] Call Trace: [131760.490151] [<ffffffffb8d18f39>] schedule+0x29/0x70 [131760.490993] [<ffffffffc0c54577>] osd_do_bio.isra.25+0x717/0x8d0 [osd_ldiskfs] [131760.492185] [<ffffffffb86bef10>] ? wake_up_atomic_t+0x30/0x30 [131760.493174] [<ffffffffc0c54a17>] osd_read_prep+0x2e7/0x3f0 [osd_ldiskfs] [131760.494309] [<ffffffffc0d90979>] ofd_preprw+0x809/0x1170 [ofd] [131760.495314] [<ffffffffb879bf3e>] ? __get_free_pages+0xe/0x40 [131760.496277] [<ffffffffb87f835e>] ? kmalloc_order_trace+0x2e/0xa0 [131760.497298] [<ffffffffb87fbf01>] ? __kmalloc+0x211/0x230 [131760.498209] [<ffffffffc0f7d17a>] echo_client_prep_commit.isra.49+0x33a/0xc30 [obdecho] [131760.499522] [<ffffffffc0f84ebf>] echo_client_iocontrol+0x95f/0x1be0 [obdecho] [131760.500725] [<ffffffffc07bd7a9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [131760.501935] [<ffffffffc07a8609>] class_handle_ioctl+0x1929/0x1dc0 [obdclass] [131760.503119] [<ffffffffb88d4bfe>] ? security_capable+0x1e/0x20 [131760.504107] [<ffffffffc078d5d2>] obd_class_ioctl+0xd2/0x170 [obdclass] [131760.505205] [<ffffffffb8834100>] do_vfs_ioctl+0x360/0x550 [131760.506134] [<ffffffffb8d2056c>] ? __do_page_fault+0x1bc/0x4f0 [131760.507129] [<ffffffffb8834391>] SyS_ioctl+0xa1/0xc0 [131760.507990] [<ffffffffb8d256d5>] ? system_call_after_swapgs+0xa2/0x146 [131760.509101] [<ffffffffb8d2579b>] system_call_fastpath+0x22/0x27 [131760.510154] [<ffffffffb8d256e1>] ? system_call_after_swapgs+0xae/0x146 [131906.715976] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [131907.490234] Lustre: DEBUG MARKER: rc=0; [131907.490234] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [131907.490234] if [[ $? -eq 0 && $val -ne 0 ]]; then [131907.490234] echo $(hostname -s): $val; [131907.490234] rc=$val; [131907.490234] fi; [131907.490234] exit $rc [131908.529978] Lustre: DEBUG MARKER: dmesg [131909.803409] Lustre: DEBUG MARKER: /usr/sbin/lctl mark SKIP: obdfilter-survey test_1b skipping ALWAYS excluded test 1b [131910.013295] Lustre: DEBUG MARKER: SKIP: obdfilter-survey test_1b skipping ALWAYS excluded test 1b [131910.263609] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 19:09:08 \(1543000148\) [131910.479290] Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 19:09:08 (1543000148) [131910.678546] Lustre: DEBUG MARKER: lctl dl | grep obdfilter [131911.110787] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @ [132377.393312] SysRq : Changing Loglevel [132377.394371] Loglevel set to 8 [132605.646160] SysRq : Show State [132605.647102] task PC stack pid father [132605.648386] systemd S ffff95bdbc140000 0 1 0 0x00000000 [132605.650160] Call Trace: [132605.650768] [<ffffffffb8d18f39>] schedule+0x29/0x70 [132605.651958] [<ffffffffb8d17fdd>] schedule_hrtimeout_range_clock+0x12d/0x150 [132605.653477] [<ffffffffb886bdc9>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0 [132605.655070] [<ffffffffb8d18013>] schedule_hrtimeout_range+0x13/0x20 [132605.656442] [<ffffffffb886c05e>] ep_poll+0x23e/0x360 [132605.657646] [<ffffffffb86d2010>] ? wake_up_state+0x20/0x20 [132605.658875] [<ffffffffb886d50d>] SyS_epoll_wait+0xed/0x120 [132605.660172] [<ffffffffb8d256d5>] ? system_call_after_swapgs+0xa2/0x146 [132605.661598] [<ffffffffb8d2579b>] system_call_fastpath+0x22/0x27 [132605.662973] [<ffffffffb8d256e1>] ? system_call_after_swapgs+0xae/0x146 … [132606.070217] jbd2/vda1-8 D ffff95bd7612af70 0 262 2 0x00000000 [132606.071560] Call Trace: [132606.072001] [<ffffffffb8d16ec0>] ? bit_wait+0x50/0x50 [132606.072969] [<ffffffffb8d18f39>] schedule+0x29/0x70 [132606.073822] [<ffffffffb8d168a9>] schedule_timeout+0x239/0x2c0 [132606.074887] [<ffffffffb866a14e>] ? kvm_clock_get_cycles+0x1e/0x20 [132606.075920] [<ffffffffb8d16ec0>] ? bit_wait+0x50/0x50 [132606.076878] [<ffffffffb8d1844d>] io_schedule_timeout+0xad/0x130 [132606.077880] [<ffffffffb8d184e8>] io_schedule+0x18/0x20 [132606.078903] [<ffffffffb8d16ed1>] bit_wait_io+0x11/0x50 [132606.079799] [<ffffffffb8d169f7>] __wait_on_bit+0x67/0x90 [132606.080788] [<ffffffffb8d16ec0>] ? bit_wait+0x50/0x50 [132606.081665] [<ffffffffb8d16b61>] out_of_line_wait_on_bit+0x81/0xb0 [132606.082798] [<ffffffffb86befd0>] ? wake_bit_function+0x40/0x40 [132606.083808] [<ffffffffb885621a>] __wait_on_buffer+0x2a/0x30 [132606.357324] [<ffffffffc00cd7f1>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2] [132606.359015] [<ffffffffb86cc7f0>] ? finish_task_switch+0x50/0x170 [132606.368501] [<ffffffffc00d2ab9>] kjournald2+0xc9/0x260 [jbd2] [132606.369630] [<ffffffffb86bef10>] ? wake_up_atomic_t+0x30/0x30 [132606.370614] [<ffffffffc00d29f0>] ? commit_timeout+0x10/0x10 [jbd2] [132606.371738] [<ffffffffb86bdf21>] kthread+0xd1/0xe0 [132606.372563] [<ffffffffb86bde50>] ? insert_kthread_work+0x40/0x40 [132606.373577] [<ffffffffb8d255f7>] ret_from_fork_nospec_begin+0x21/0x21 [132606.374762] [<ffffffffb86bde50>] ? insert_kthread_work+0x40/0x40
We have seen a similar OSS call trace where lctl is blocked in LU-10872, but that ticket is for servers running ZFS.
I can’t find any other examples of this failure for the past three months.