Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11707

obdfilter-survey test 1c hangs with lctl blocked

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.6
    • None
    • 3
    • 9223372036854775807

    Description

      obdfilter-survey test_1c hangs. Looking at the logs at https://testing.whamcloud.com/test_sets/77a3ca20-ef6b-11e8-86c0-52540065bddc , the last output in the client test_log is

      == obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 19:09:08 (1543000148)
      CMD: trevis-54vm3 lctl dl | grep obdfilter
      CMD: trevis-54vm3 /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @
      + NETTYPE=tcp thrlo=32 nobjhi=1 thrhi=32 size=8192 case=disk rslt_loc=/tmp targets="10.9.6.102:lustre-OST0000 10.9.6.102:lustre-OST0001 10.9.6.102:lustre-OST0002 10.9.6.102:lustre-OST0003 10.9.6.102:lustre-OST0004 10.9.6.102:lustre-OST0005 10.9.6.102:lustre-OST0006 10.9.6.102:lustre-OST0007" /usr/bin/obdfilter-survey
      Fri Nov 23 19:09:22 UTC 2018 Obdfilter-survey for case=disk from trevis-54vm1.trevis.whamcloud.com
      ost  8 sz 67108864K rsz 1024K obj    8 thr  256 write 
      

      Looking at the OSS (vm3) console log, we see lctl blocked and it looks like the/a problem started with test 1a

      [128050.794039] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 18:04:49 \(1542996289\)
      [128050.999311] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 18:04:49 (1542996289)
      [128051.201155] Lustre: DEBUG MARKER: lctl dl | grep obdfilter
      [128051.547065] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @
      [131760.431733] INFO: task lctl:20810 blocked for more than 120 seconds.
      [131760.433054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [131760.434345] lctl            D ffff95bdb658bf40     0 20810  20695 0x00000080
      [131760.435608] Call Trace:
      [131760.436105]  [<ffffffffb8d18f39>] schedule+0x29/0x70
      [131760.437209]  [<ffffffffc0c54577>] osd_do_bio.isra.25+0x717/0x8d0 [osd_ldiskfs]
      [131760.438430]  [<ffffffffb86bef10>] ? wake_up_atomic_t+0x30/0x30
      [131760.439484]  [<ffffffffc0c54a17>] osd_read_prep+0x2e7/0x3f0 [osd_ldiskfs]
      [131760.440721]  [<ffffffffc0d90979>] ofd_preprw+0x809/0x1170 [ofd]
      [131760.441782]  [<ffffffffb879bf3e>] ? __get_free_pages+0xe/0x40
      [131760.442759]  [<ffffffffb87f835e>] ? kmalloc_order_trace+0x2e/0xa0
      [131760.443851]  [<ffffffffb87fbf01>] ? __kmalloc+0x211/0x230
      [131760.444968]  [<ffffffffc0f7d17a>] echo_client_prep_commit.isra.49+0x33a/0xc30 [obdecho]
      [131760.446338]  [<ffffffffc0f84ebf>] echo_client_iocontrol+0x95f/0x1be0 [obdecho]
      [131760.447782]  [<ffffffffc07bd7a9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [131760.449040]  [<ffffffffc07a8609>] class_handle_ioctl+0x1929/0x1dc0 [obdclass]
      [131760.450245]  [<ffffffffb8706640>] ? futex_wake+0x90/0x180
      [131760.451231]  [<ffffffffb87caefd>] ? handle_mm_fault+0x39d/0x9b0
      [131760.452288]  [<ffffffffb88d4bfe>] ? security_capable+0x1e/0x20
      [131760.453313]  [<ffffffffc078d5d2>] obd_class_ioctl+0xd2/0x170 [obdclass]
      [131760.454479]  [<ffffffffb8834100>] do_vfs_ioctl+0x360/0x550
      [131760.455413]  [<ffffffffb8d2056c>] ? __do_page_fault+0x1bc/0x4f0
      [131760.456411]  [<ffffffffb8834391>] SyS_ioctl+0xa1/0xc0
      [131760.457266]  [<ffffffffb8d256d5>] ? system_call_after_swapgs+0xa2/0x146
      [131760.458374]  [<ffffffffb8d2579b>] system_call_fastpath+0x22/0x27
      [131760.459380]  [<ffffffffb8d256e1>] ? system_call_after_swapgs+0xae/0x146
      [131760.460481] INFO: task lctl:20813 blocked for more than 120 seconds.
      [131760.461538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [131760.462838] lctl            D ffff95bdb658eeb0     0 20813  20695 0x00000080
      [131760.464077] Call Trace:
      [131760.464576]  [<ffffffffb8d18f39>] schedule+0x29/0x70
      [131760.465481]  [<ffffffffc0c54577>] osd_do_bio.isra.25+0x717/0x8d0 [osd_ldiskfs]
      [131760.466745]  [<ffffffffb86bef10>] ? wake_up_atomic_t+0x30/0x30
      [131760.467729]  [<ffffffffc0c54a17>] osd_read_prep+0x2e7/0x3f0 [osd_ldiskfs]
      [131760.468876]  [<ffffffffc0d90979>] ofd_preprw+0x809/0x1170 [ofd]
      [131760.469860]  [<ffffffffb879bf3e>] ? __get_free_pages+0xe/0x40
      [131760.470831]  [<ffffffffb87f835e>] ? kmalloc_order_trace+0x2e/0xa0
      [131760.472017]  [<ffffffffb87fbf01>] ? __kmalloc+0x211/0x230
      [131760.473021]  [<ffffffffc0f7d17a>] echo_client_prep_commit.isra.49+0x33a/0xc30 [obdecho]
      [131760.474345]  [<ffffffffc0f84ebf>] echo_client_iocontrol+0x95f/0x1be0 [obdecho]
      [131760.475556]  [<ffffffffc07bd7a9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [131760.476755]  [<ffffffffc07a8609>] class_handle_ioctl+0x1929/0x1dc0 [obdclass]
      [131760.477947]  [<ffffffffb87caefd>] ? handle_mm_fault+0x39d/0x9b0
      [131760.478941]  [<ffffffffb88d4bfe>] ? security_capable+0x1e/0x20
      [131760.479986]  [<ffffffffc078d5d2>] obd_class_ioctl+0xd2/0x170 [obdclass]
      [131760.481127]  [<ffffffffb8834100>] do_vfs_ioctl+0x360/0x550
      [131760.482055]  [<ffffffffb8834391>] SyS_ioctl+0xa1/0xc0
      [131760.482916]  [<ffffffffb8d256d5>] ? system_call_after_swapgs+0xa2/0x146
      [131760.484012]  [<ffffffffb8d2579b>] system_call_fastpath+0x22/0x27
      [131760.485016]  [<ffffffffb8d256e1>] ? system_call_after_swapgs+0xae/0x146
      [131760.486125] INFO: task lctl:20815 blocked for more than 120 seconds.
      [131760.487186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [131760.488468] lctl            D ffff95bd48ab8fd0     0 20815  20695 0x00000080
      [131760.489714] Call Trace:
      [131760.490151]  [<ffffffffb8d18f39>] schedule+0x29/0x70
      [131760.490993]  [<ffffffffc0c54577>] osd_do_bio.isra.25+0x717/0x8d0 [osd_ldiskfs]
      [131760.492185]  [<ffffffffb86bef10>] ? wake_up_atomic_t+0x30/0x30
      [131760.493174]  [<ffffffffc0c54a17>] osd_read_prep+0x2e7/0x3f0 [osd_ldiskfs]
      [131760.494309]  [<ffffffffc0d90979>] ofd_preprw+0x809/0x1170 [ofd]
      [131760.495314]  [<ffffffffb879bf3e>] ? __get_free_pages+0xe/0x40
      [131760.496277]  [<ffffffffb87f835e>] ? kmalloc_order_trace+0x2e/0xa0
      [131760.497298]  [<ffffffffb87fbf01>] ? __kmalloc+0x211/0x230
      [131760.498209]  [<ffffffffc0f7d17a>] echo_client_prep_commit.isra.49+0x33a/0xc30 [obdecho]
      [131760.499522]  [<ffffffffc0f84ebf>] echo_client_iocontrol+0x95f/0x1be0 [obdecho]
      [131760.500725]  [<ffffffffc07bd7a9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [131760.501935]  [<ffffffffc07a8609>] class_handle_ioctl+0x1929/0x1dc0 [obdclass]
      [131760.503119]  [<ffffffffb88d4bfe>] ? security_capable+0x1e/0x20
      [131760.504107]  [<ffffffffc078d5d2>] obd_class_ioctl+0xd2/0x170 [obdclass]
      [131760.505205]  [<ffffffffb8834100>] do_vfs_ioctl+0x360/0x550
      [131760.506134]  [<ffffffffb8d2056c>] ? __do_page_fault+0x1bc/0x4f0
      [131760.507129]  [<ffffffffb8834391>] SyS_ioctl+0xa1/0xc0
      [131760.507990]  [<ffffffffb8d256d5>] ? system_call_after_swapgs+0xa2/0x146
      [131760.509101]  [<ffffffffb8d2579b>] system_call_fastpath+0x22/0x27
      [131760.510154]  [<ffffffffb8d256e1>] ? system_call_after_swapgs+0xae/0x146
      [131906.715976] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [131907.490234] Lustre: DEBUG MARKER: rc=0;
      [131907.490234] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      [131907.490234] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
      [131907.490234] 				echo $(hostname -s): $val;
      [131907.490234] 				rc=$val;
      [131907.490234] 			fi;
      [131907.490234] 			exit $rc
      [131908.529978] Lustre: DEBUG MARKER: dmesg
      [131909.803409] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  SKIP: obdfilter-survey test_1b skipping ALWAYS excluded test 1b
      [131910.013295] Lustre: DEBUG MARKER: SKIP: obdfilter-survey test_1b skipping ALWAYS excluded test 1b
      [131910.263609] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 19:09:08 \(1543000148\)
      [131910.479290] Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 19:09:08 (1543000148)
      [131910.678546] Lustre: DEBUG MARKER: lctl dl | grep obdfilter
      [131911.110787] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @
      [132377.393312] SysRq : Changing Loglevel
      [132377.394371] Loglevel set to 8
      [132605.646160] SysRq : Show State
      [132605.647102]   task                        PC stack   pid father
      [132605.648386] systemd         S ffff95bdbc140000     0     1      0 0x00000000
      [132605.650160] Call Trace:
      [132605.650768]  [<ffffffffb8d18f39>] schedule+0x29/0x70
      [132605.651958]  [<ffffffffb8d17fdd>] schedule_hrtimeout_range_clock+0x12d/0x150
      [132605.653477]  [<ffffffffb886bdc9>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
      [132605.655070]  [<ffffffffb8d18013>] schedule_hrtimeout_range+0x13/0x20
      [132605.656442]  [<ffffffffb886c05e>] ep_poll+0x23e/0x360
      [132605.657646]  [<ffffffffb86d2010>] ? wake_up_state+0x20/0x20
      [132605.658875]  [<ffffffffb886d50d>] SyS_epoll_wait+0xed/0x120
      [132605.660172]  [<ffffffffb8d256d5>] ? system_call_after_swapgs+0xa2/0x146
      [132605.661598]  [<ffffffffb8d2579b>] system_call_fastpath+0x22/0x27
      [132605.662973]  [<ffffffffb8d256e1>] ? system_call_after_swapgs+0xae/0x146
      …
      [132606.070217] jbd2/vda1-8     D ffff95bd7612af70     0   262      2 0x00000000
      [132606.071560] Call Trace:
      [132606.072001]  [<ffffffffb8d16ec0>] ? bit_wait+0x50/0x50
      [132606.072969]  [<ffffffffb8d18f39>] schedule+0x29/0x70
      [132606.073822]  [<ffffffffb8d168a9>] schedule_timeout+0x239/0x2c0
      [132606.074887]  [<ffffffffb866a14e>] ? kvm_clock_get_cycles+0x1e/0x20
      [132606.075920]  [<ffffffffb8d16ec0>] ? bit_wait+0x50/0x50
      [132606.076878]  [<ffffffffb8d1844d>] io_schedule_timeout+0xad/0x130
      [132606.077880]  [<ffffffffb8d184e8>] io_schedule+0x18/0x20
      [132606.078903]  [<ffffffffb8d16ed1>] bit_wait_io+0x11/0x50
      [132606.079799]  [<ffffffffb8d169f7>] __wait_on_bit+0x67/0x90
      [132606.080788]  [<ffffffffb8d16ec0>] ? bit_wait+0x50/0x50
      [132606.081665]  [<ffffffffb8d16b61>] out_of_line_wait_on_bit+0x81/0xb0
      [132606.082798]  [<ffffffffb86befd0>] ? wake_bit_function+0x40/0x40
      [132606.083808]  [<ffffffffb885621a>] __wait_on_buffer+0x2a/0x30
      [132606.357324]  [<ffffffffc00cd7f1>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2]
      [132606.359015]  [<ffffffffb86cc7f0>] ? finish_task_switch+0x50/0x170
      [132606.368501]  [<ffffffffc00d2ab9>] kjournald2+0xc9/0x260 [jbd2]
      [132606.369630]  [<ffffffffb86bef10>] ? wake_up_atomic_t+0x30/0x30
      [132606.370614]  [<ffffffffc00d29f0>] ? commit_timeout+0x10/0x10 [jbd2]
      [132606.371738]  [<ffffffffb86bdf21>] kthread+0xd1/0xe0
      [132606.372563]  [<ffffffffb86bde50>] ? insert_kthread_work+0x40/0x40
      [132606.373577]  [<ffffffffb8d255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [132606.374762]  [<ffffffffb86bde50>] ? insert_kthread_work+0x40/0x40
      

      We have seen a similar OSS call trace where lctl is blocked in LU-10872, but that ticket is for servers running ZFS.

      I can’t find any other examples of this failure for the past three months.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: