Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-547

1.8<->2.1 / 2.3<->2.4 interop: dbench kept doing cleanup

Details

    • 3
    • 6058

    Description

      While running sanity-benchmark test suite, the dbench test kept doing cleanup and could not stop:

      <~snip~>
         1    737819     5.51 MB/sec  execute 598 sec  latency 38819.862 ms
         1    737819     5.50 MB/sec  execute 599 sec  latency 39820.034 ms
         1  cleanup 600 sec
         1  cleanup 601 sec
         ......
         1  cleanup 2897 sec
         1  cleanup 2898 sec
         ......
      

      On client node fat-amd-3-ib:

      # ps auxww
      <~snip~>
      500      18628  0.0  0.0 107264  2552 pts/0    S+   00:50   0:00 bash rundbench -D /mnt/lustre/d0.fat-amd-3-ib 1
      500      18636  0.0  0.0   6092   596 pts/0    S+   00:50   0:00 dbench -c client.txt 1
      500      18637  2.2  0.0   8140  2540 pts/0    S+   00:50   2:00 dbench -c client.txt 1
      
      # echo t > /proc/sysrq-trigger
      <~snip~>
      dbench        S 0000000000000009     0 18636  18628 0x00000080
       ffff880219239e78 0000000000000086 0000000000000000 ffffffff812688c5
       ffff880323c50f98 ffff880218872068 0000000000000000 0000000100b91979
       ffff880218af3a78 ffff880219239fd8 000000000000f598 ffff880218af3a78
      Call Trace:
       [<ffffffff812688c5>] ? rb_insert_color+0x125/0x160
       [<ffffffff8106b325>] do_wait+0x1c5/0x250
       [<ffffffff8106b453>] sys_wait4+0xa3/0x100
       [<ffffffff810699b0>] ? child_wait_callback+0x0/0x70
       [<ffffffff8100b425>] ? int_check_syscall_exit_work+0x34/0x3d
       [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
      dbench        S 000000000000000c     0 18637  18636 0x00000080
       ffff880419933a98 0000000000000082 0000000000000000 000000004e31176c
       ffff880419933a78 ffffffffa04a2cac ffff880419933a78 ffff8803191cb000
       ffff8804197b05f8 ffff880419933fd8 000000000000f598 ffff8804197b05f8
      Call Trace:
       [<ffffffffa04a2cac>] ? ptlrpc_at_adj_net_latency+0x7c/0x230 [ptlrpc]
       [<ffffffffa049c00d>] ldlm_flock_completion_ast+0x61d/0x9f0 [ptlrpc]
       [<ffffffffa046c15f>] ? lock_res_and_lock+0x5f/0xe0 [ptlrpc]
       [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
       [<ffffffffa048a565>] ldlm_cli_enqueue_fini+0x6c5/0xba0 [ptlrpc]
       [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
       [<ffffffffa048e074>] ldlm_cli_enqueue+0x344/0x7a0 [ptlrpc]
       [<ffffffffa0677edd>] ll_file_flock+0x47d/0x6b0 [lustre]
       [<ffffffffa049b9f0>] ? ldlm_flock_completion_ast+0x0/0x9f0 [ptlrpc]
       [<ffffffff811bc243>] vfs_lock_file+0x23/0x40
       [<ffffffff811bc497>] fcntl_setlk+0x177/0x320
       [<ffffffff8107ff06>] ? group_send_sig_info+0x26/0x70
       [<ffffffff81080212>] ? sys_kill+0x92/0x190
       [<ffffffff811845f7>] sys_fcntl+0x197/0x530
       [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
      

      Maloo report: https://maloo.whamcloud.com/test_sets/4f04bb24-b8f5-11e0-8bdf-52540025f9af

      Please find the attached dbench-debug_log.tar.bz2 for debug logs.

      Attachments

        Activity

          [LU-547] 1.8<->2.1 / 2.3<->2.4 interop: dbench kept doing cleanup

          Old blocker for unsupported version

          simmonsja James A Simmons added a comment - Old blocker for unsupported version
          sarah Sarah Liu added a comment - - edited

          Looks like the same issue seen in 2.1.3<->2.4 interop testing:
          https://maloo.whamcloud.com/test_sets/07ffc460-397f-11e2-9fda-52540035b04c

          server: 2.3 RHEL6
          client: lustre master build #1065

          test log shows:

             1  cleanup 7149 sec
             1  cleanup 7150 sec
             1  cleanup 7151 sec
             1  cleanup 7152 sec
             1  cleanup 7153 sec
          

          client console shows:

          17:38:39:Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench == 17:38:24 (1353980304)
          17:39:21:Lustre: 3001:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980345/real 1353980345]  req@ffff880074163400 x1419751045755358/t0(0) o2->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 440/432 e 0 to 1 dl 1353980352 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
          17:39:21:Lustre: lustre-OST0002-osc-ffff8800749b5c00: Connection to lustre-OST0002 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete
          17:39:21:Lustre: 3002:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980345/real 1353980345]  req@ffff880075a1d000 x1419751045755356/t0(0) o4->lustre-OST0005-osc-ffff8800749b5c00@10.10.4.155@tcp:6/4 lens 488/448 e 0 to 1 dl 1353980352 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
          17:39:21:Lustre: lustre-OST0005-osc-ffff8800749b5c00: Connection to lustre-OST0005 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete
          17:39:22:Lustre: 3002:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980350/real 0]  req@ffff880075a1d800 x1419751045755403/t0(0) o4->lustre-OST0005-osc-ffff8800749b5c00@10.10.4.155@tcp:6/4 lens 488/448 e 0 to 1 dl 1353980357 ref 3 fl Rpc:X/0/ffffffff rc 0/-1
          17:39:22:Lustre: lustre-OST0003-osc-ffff8800749b5c00: Connection to lustre-OST0003 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete
          17:39:22:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980352/real 0]  req@ffff880075de8000 x1419751045755406/t0(0) o8->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980358 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          17:39:22:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
          17:39:22:Lustre: lustre-OST0004-osc-ffff8800749b5c00: Connection to lustre-OST0004 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete
          17:39:22:Lustre: Skipped 3 previous similar messages
          17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980358/real 0]  req@ffff880037692000 x1419751045755416/t0(0) o8->lustre-OST0004-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980364 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 25 previous similar messages
          17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980358/real 0]  req@ffff880037730000 x1419751045755415/t0(0) o8->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980369 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
          17:39:44:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980368/real 0]  req@ffff880037739000 x1419751045755422/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980379 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          17:39:44:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 1 previous similar message
          17:40:26:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980393/real 1353980394]  req@ffff8800376fd000 x1419751045755451/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980418 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          17:40:26:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 18 previous similar messages
          17:41:07:INFO: task dbench:8110 blocked for more than 120 seconds.
          17:41:07:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          17:41:07:dbench        D 0000000000000000     0  8110   8109 0x00000080
          17:41:07: ffff880075d8ba58 0000000000000082 ffff880075d8ba08 ffff8800746ac0d0
          17:41:07: 0000000000000286 0000000000000003 0000000000000001 0000000000000286
          17:41:07: ffff880076192638 ffff880075d8bfd8 000000000000fb88 ffff880076192638
          17:41:07:Call Trace:
          17:41:07: [<ffffffffa04367da>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
          17:41:07: [<ffffffff814fe7d5>] schedule_timeout+0x215/0x2e0
          17:41:08: [<ffffffffa077bfbc>] ? ptlrpc_request_bufs_pack+0x5c/0x80 [ptlrpc]
          17:41:08: [<ffffffffa0791c30>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
          17:41:08: [<ffffffff814fe453>] wait_for_common+0x123/0x180
          17:41:08: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
          17:41:08: [<ffffffff814fe56d>] wait_for_completion+0x1d/0x20
          17:41:08: [<ffffffffa099b1d4>] osc_io_setattr_end+0xc4/0x1a0 [osc]
          17:41:08: [<ffffffffa0a1dc50>] ? lov_io_end_wrapper+0x0/0x100 [lov]
          17:41:08: [<ffffffffa0628a80>] cl_io_end+0x60/0x150 [obdclass]
          17:41:08: [<ffffffffa0629350>] ? cl_io_start+0x0/0x140 [obdclass]
          17:41:08: [<ffffffffa0a1dd41>] lov_io_end_wrapper+0xf1/0x100 [lov]
          17:41:08: [<ffffffffa0a1d7ce>] lov_io_call+0x8e/0x130 [lov]
          17:41:08: [<ffffffffa0a1f42c>] lov_io_end+0x4c/0x110 [lov]
          17:41:08: [<ffffffffa0628a80>] cl_io_end+0x60/0x150 [obdclass]
          17:41:08: [<ffffffffa062dce2>] cl_io_loop+0xc2/0x1b0 [obdclass]
          17:41:08: [<ffffffffa0ad3148>] cl_setattr_ost+0x208/0x2d0 [lustre]
          17:41:08: [<ffffffffa0aa2472>] ll_setattr_raw+0x752/0xfd0 [lustre]
          17:41:08: [<ffffffffa0aa2d4b>] ll_setattr+0x5b/0xf0 [lustre]
          17:41:08: [<ffffffff81197368>] notify_change+0x168/0x340
          17:41:08: [<ffffffff81187e25>] ? putname+0x35/0x50
          17:41:08: [<ffffffff811aacdc>] utimes_common+0xdc/0x1b0
          17:41:08: [<ffffffff811aae8e>] do_utimes+0xde/0xf0
          17:41:08: [<ffffffff811ab023>] sys_utime+0x23/0x90
          17:41:08: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          17:41:20:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980453/real 1353980454]  req@ffff8800376fd000 x1419751045755487/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980478 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          17:41:20:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
          17:42:25:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1353980543/real 1353980543]  req@ffff8800756cc400 x1419751045755544/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980568 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          17:42:25:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
          17:43:07:INFO: task dbench:8110 blocked for more than 120 seconds.
          
          sarah Sarah Liu added a comment - - edited Looks like the same issue seen in 2.1.3<->2.4 interop testing: https://maloo.whamcloud.com/test_sets/07ffc460-397f-11e2-9fda-52540035b04c server: 2.3 RHEL6 client: lustre master build #1065 test log shows: 1 cleanup 7149 sec 1 cleanup 7150 sec 1 cleanup 7151 sec 1 cleanup 7152 sec 1 cleanup 7153 sec client console shows: 17:38:39:Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench == 17:38:24 (1353980304) 17:39:21:Lustre: 3001:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980345/real 1353980345] req@ffff880074163400 x1419751045755358/t0(0) o2->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 440/432 e 0 to 1 dl 1353980352 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 17:39:21:Lustre: lustre-OST0002-osc-ffff8800749b5c00: Connection to lustre-OST0002 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete 17:39:21:Lustre: 3002:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980345/real 1353980345] req@ffff880075a1d000 x1419751045755356/t0(0) o4->lustre-OST0005-osc-ffff8800749b5c00@10.10.4.155@tcp:6/4 lens 488/448 e 0 to 1 dl 1353980352 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 17:39:21:Lustre: lustre-OST0005-osc-ffff8800749b5c00: Connection to lustre-OST0005 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete 17:39:22:Lustre: 3002:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980350/real 0] req@ffff880075a1d800 x1419751045755403/t0(0) o4->lustre-OST0005-osc-ffff8800749b5c00@10.10.4.155@tcp:6/4 lens 488/448 e 0 to 1 dl 1353980357 ref 3 fl Rpc:X/0/ffffffff rc 0/-1 17:39:22:Lustre: lustre-OST0003-osc-ffff8800749b5c00: Connection to lustre-OST0003 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete 17:39:22:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980352/real 0] req@ffff880075de8000 x1419751045755406/t0(0) o8->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980358 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 17:39:22:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 19 previous similar messages 17:39:22:Lustre: lustre-OST0004-osc-ffff8800749b5c00: Connection to lustre-OST0004 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete 17:39:22:Lustre: Skipped 3 previous similar messages 17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980358/real 0] req@ffff880037692000 x1419751045755416/t0(0) o8->lustre-OST0004-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980364 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 25 previous similar messages 17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980358/real 0] req@ffff880037730000 x1419751045755415/t0(0) o8->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980369 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 5 previous similar messages 17:39:44:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980368/real 0] req@ffff880037739000 x1419751045755422/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980379 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 17:39:44:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 1 previous similar message 17:40:26:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980393/real 1353980394] req@ffff8800376fd000 x1419751045755451/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980418 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 17:40:26:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 18 previous similar messages 17:41:07:INFO: task dbench:8110 blocked for more than 120 seconds. 17:41:07:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 17:41:07:dbench D 0000000000000000 0 8110 8109 0x00000080 17:41:07: ffff880075d8ba58 0000000000000082 ffff880075d8ba08 ffff8800746ac0d0 17:41:07: 0000000000000286 0000000000000003 0000000000000001 0000000000000286 17:41:07: ffff880076192638 ffff880075d8bfd8 000000000000fb88 ffff880076192638 17:41:07:Call Trace: 17:41:07: [<ffffffffa04367da>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 17:41:07: [<ffffffff814fe7d5>] schedule_timeout+0x215/0x2e0 17:41:08: [<ffffffffa077bfbc>] ? ptlrpc_request_bufs_pack+0x5c/0x80 [ptlrpc] 17:41:08: [<ffffffffa0791c30>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc] 17:41:08: [<ffffffff814fe453>] wait_for_common+0x123/0x180 17:41:08: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 17:41:08: [<ffffffff814fe56d>] wait_for_completion+0x1d/0x20 17:41:08: [<ffffffffa099b1d4>] osc_io_setattr_end+0xc4/0x1a0 [osc] 17:41:08: [<ffffffffa0a1dc50>] ? lov_io_end_wrapper+0x0/0x100 [lov] 17:41:08: [<ffffffffa0628a80>] cl_io_end+0x60/0x150 [obdclass] 17:41:08: [<ffffffffa0629350>] ? cl_io_start+0x0/0x140 [obdclass] 17:41:08: [<ffffffffa0a1dd41>] lov_io_end_wrapper+0xf1/0x100 [lov] 17:41:08: [<ffffffffa0a1d7ce>] lov_io_call+0x8e/0x130 [lov] 17:41:08: [<ffffffffa0a1f42c>] lov_io_end+0x4c/0x110 [lov] 17:41:08: [<ffffffffa0628a80>] cl_io_end+0x60/0x150 [obdclass] 17:41:08: [<ffffffffa062dce2>] cl_io_loop+0xc2/0x1b0 [obdclass] 17:41:08: [<ffffffffa0ad3148>] cl_setattr_ost+0x208/0x2d0 [lustre] 17:41:08: [<ffffffffa0aa2472>] ll_setattr_raw+0x752/0xfd0 [lustre] 17:41:08: [<ffffffffa0aa2d4b>] ll_setattr+0x5b/0xf0 [lustre] 17:41:08: [<ffffffff81197368>] notify_change+0x168/0x340 17:41:08: [<ffffffff81187e25>] ? putname+0x35/0x50 17:41:08: [<ffffffff811aacdc>] utimes_common+0xdc/0x1b0 17:41:08: [<ffffffff811aae8e>] do_utimes+0xde/0xf0 17:41:08: [<ffffffff811ab023>] sys_utime+0x23/0x90 17:41:08: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 17:41:20:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980453/real 1353980454] req@ffff8800376fd000 x1419751045755487/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980478 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 17:41:20:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 13 previous similar messages 17:42:25:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1353980543/real 1353980543] req@ffff8800756cc400 x1419751045755544/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980568 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 17:42:25:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 20 previous similar messages 17:43:07:INFO: task dbench:8110 blocked for more than 120 seconds.
          green Oleg Drokin added a comment -

          It seems since we don't enable flock support by default, let's not include mounting with -o flock for interop testing for now.
          I don't think this is such an important use case anyway. We can add some extra interop code in 1.8.7 if this poses a problem for real customers in the future.

          green Oleg Drokin added a comment - It seems since we don't enable flock support by default, let's not include mounting with -o flock for interop testing for now. I don't think this is such an important use case anyway. We can add some extra interop code in 1.8.7 if this poses a problem for real customers in the future.
          yujian Jian Yu added a comment - - edited

          Hi Oleg,
          I reproduced the issue and gathered more debug logs while running the dbench test with "PTLDEBUG=-1":
          https://maloo.whamcloud.com/test_sets/94f6dce0-be57-11e0-8bdf-52540025f9af

          yujian Jian Yu added a comment - - edited Hi Oleg, I reproduced the issue and gathered more debug logs while running the dbench test with "PTLDEBUG=-1": https://maloo.whamcloud.com/test_sets/94f6dce0-be57-11e0-8bdf-52540025f9af
          yujian Jian Yu added a comment -

          Unfortunately debug logs don't shed enough light on the situation. client log is empty.
          MDS log only lists current lock that we are blocking due to conflicts, but I see no other lock activity so I have noway to know what was it conflicting with.

          The issue could be reproduced consistently under the 1.8.6-wc1 client and 2.0.66.0 server configuration. After the Toro cluster is back, I'd reproduce the issue and gather more debug logs.

          I wonder if this dbench test works between 1.8 client and 1.8 server if you mount client with -o flock? (I don't think we mount with flock by default in 1.8 nor 2.1).

          "flock" option is used by default in our test suite on both b1_8 and master branches:
          MOUNTOPT=${MOUNTOPT:-"-o user_xattr,acl,flock"}

          and dbench test passed on 1.8.6-wc1 with flock option:
          https://maloo.whamcloud.com/test_sets/7deb786c-a269-11e0-aee5-52540025f9af

          yujian Jian Yu added a comment - Unfortunately debug logs don't shed enough light on the situation. client log is empty. MDS log only lists current lock that we are blocking due to conflicts, but I see no other lock activity so I have noway to know what was it conflicting with. The issue could be reproduced consistently under the 1.8.6-wc1 client and 2.0.66.0 server configuration. After the Toro cluster is back, I'd reproduce the issue and gather more debug logs. I wonder if this dbench test works between 1.8 client and 1.8 server if you mount client with -o flock? (I don't think we mount with flock by default in 1.8 nor 2.1). "flock" option is used by default in our test suite on both b1_8 and master branches: MOUNTOPT=${MOUNTOPT:-"-o user_xattr,acl,flock"} and dbench test passed on 1.8.6-wc1 with flock option: https://maloo.whamcloud.com/test_sets/7deb786c-a269-11e0-aee5-52540025f9af
          green Oleg Drokin added a comment -

          I now think this is in fact different than LU-451 and probably related to LU-104 interop, even though that seemed to work well for me at the time.

          Unfortunately debug logs don't shed enough light on the situation. client log is empty.
          MDS log only lists current lock that we are blocking due to conflicts, but I see no other lock activity so I have noway to know what was it conflicting with.

          I wonder if this dbench test works between 1.8 client and 1.8 server if you mount client with -o flock? (I don't think we mount with flock by default in 1.8 nor 2.1).

          green Oleg Drokin added a comment - I now think this is in fact different than LU-451 and probably related to LU-104 interop, even though that seemed to work well for me at the time. Unfortunately debug logs don't shed enough light on the situation. client log is empty. MDS log only lists current lock that we are blocking due to conflicts, but I see no other lock activity so I have noway to know what was it conflicting with. I wonder if this dbench test works between 1.8 client and 1.8 server if you mount client with -o flock? (I don't think we mount with flock by default in 1.8 nor 2.1).
          pjones Peter Jones added a comment -

          Oleg will look into this one. He suspects the same root cause as LU451

          pjones Peter Jones added a comment - Oleg will look into this one. He suspects the same root cause as LU451
          yujian Jian Yu added a comment -
          yujian Jian Yu added a comment - sanity-quota test 8 also hit the same issue: https://maloo.whamcloud.com/test_sets/700c103e-b8cc-11e0-8bdf-52540025f9af

          People

            green Oleg Drokin
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: