Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9429

parallel-scale test_parallel_grouplock: test failed to respond and timed out

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.0, Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      Seen everywhere test_parallel_grouplock was tested in tag 56 testing (2.9.56):
      https://testing.hpdd.intel.com/test_sessions/732dde3a-7e28-437b-8865-c350e9438ee4
      https://testing.hpdd.intel.com/test_sessions/f14b71e9-9eda-4053-814d-fdf644925d29
      https://testing.hpdd.intel.com/test_sessions/cb12c60c-613a-44b3-bfef-03c0651d2607
      https://testing.hpdd.intel.com/test_sessions/30cc75b6-594f-4255-accf-24fe11bdd565
      https://testing.hpdd.intel.com/test_sessions/20ddc92f-b9fe-482d-ac1b-1602a513c824
      https://testing.hpdd.intel.com/test_sessions/4f7e260b-bce2-4834-b77c-a1b47527d05a

      With tag 56 testing, parallel-scale had two subtests (test_cascading_rw & test_parallel_grouplock) that failed 6 of 6 times.

      With tag 52-55 testing, some instances of test_cascading_rw failing were seen, but test_parallel_grouplock passed 100% of the time.

      With all 6 failures, we saw this sequence:

      test_cascading_rw: cascading_rw failed! 1 (covered by LU-9367):

      From test_log:

      /usr/lib64/lustre/tests/cascading_rw is running with 4 process(es) in DEBUG mode
      22:55:22: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 0)
      [onyx-48vm1:12185] *** Process received signal ***
      [onyx-48vm1:12185] Signal: Floating point exception (8)
      [onyx-48vm1:12185] Signal code: Integer divide-by-zero (1)
      [onyx-48vm1:12185] Failing at address: 0x4024c8
      [onyx-48vm1:12185] [ 0] /lib64/libpthread.so.0(+0xf370) [0x7f6060bb8370]
      [onyx-48vm1:12185] [ 1] /usr/lib64/lustre/tests/cascading_rw() [0x4024c8]
      [onyx-48vm1:12185] [ 2] /usr/lib64/lustre/tests/cascading_rw() [0x402be0]
      [onyx-48vm1:12185] [ 3] /usr/lib64/lustre/tests/cascading_rw() [0x40158e]
      [onyx-48vm1:12185] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6060809b35]
      [onyx-48vm1:12185] [ 5] /usr/lib64/lustre/tests/cascading_rw() [0x40169d]
      [onyx-48vm1:12185] *** End of error message ***
      [onyx-48vm2.onyx.hpdd.intel.com][[59688,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      --------------------------------------------------------------------------
      mpirun noticed that process rank 0 with PID 12185 on node onyx-48vm1.onyx.hpdd.intel.com exited on signal 8 (Floating point exception).
      --------------------------------------------------------------------------
       parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4931:error()
        = /usr/lib64/lustre/tests/functions.sh:740:run_cascading_rw()
        = /usr/lib64/lustre/tests/parallel-scale.sh:130:test_cascading_rw()
        = /usr/lib64/lustre/tests/test-framework.sh:5207:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test()
        = /usr/lib64/lustre/tests/parallel-scale.sh:132:main()
      

      The test_cascading_rw failure was then followed by:

      test_parallel_grouplock: test failed to respond and timed out

      From test_log:

      parallel_grouplock subtests -t 11 PASS
      

      Note: Only for non-DNE configs did subtest 11 pass.

      Also from test_log:

      CMD: trevis-52vm1.trevis.hpdd.intel.com,trevis-52vm2,trevis-52vm7,trevis-52vm8 lctl clear
      + /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12
      + chmod 0777 /mnt/lustre
      drwxrwxrwx 5 root root 4096 Apr 25 13:00 /mnt/lustre
      + su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale.machines -np 5 /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12 "
      /usr/lib64/lustre/tests/parallel_grouplock is running with 5 task(es) in DEBUG mode
      23:38:55: Running test #/usr/lib64/lustre/tests/parallel_grouplock(iter 0)
      23:38:55:	Beginning subtest 12
      

      This was the last activity seen before the test_parallel_grouplock timeout. Nothing obvious was found in any of the console or dmesg logs.

      Attachments

        Issue Links

          Activity

            [LU-9429] parallel-scale test_parallel_grouplock: test failed to respond and timed out

            Bobijam, could you please provide a brief description of what is needed to fix this problem, and how much work that will be to implement?

            adilger Andreas Dilger added a comment - Bobijam, could you please provide a brief description of what is needed to fix this problem, and how much work that will be to implement?
            bobijam Zhenyu Xu added a comment -

            I tried it on my VM, it shows that the hung is not because of the PFL issue. It looks like those write threads are waiting for extent locks which is blocked by the group lock.

            parallel_grou S 0000000000000001     0 21418  21416 0x00000080
             ffff8800296cf788 0000000000000086 ccf27c28fd06e26f 000006b4ffffff9d
             593905fb000053aa 0000000000000000 ffff880000000001 00000000fffffffc
             ffff8800296cf798 ffff88001b92b0c4 ffff88001cf59060 ffff8800296cffd8
            Call Trace:
             [<ffffffffa064e38d>] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc]
             [<ffffffff810640e0>] ? default_wake_function+0x0/0x20
             [<ffffffffa0648506>] ldlm_cli_enqueue_fini+0x936/0xe30 [ptlrpc]
             [<ffffffffa06699d1>] ? ptlrpc_set_destroy+0x2d1/0x450 [ptlrpc]
             [<ffffffffa064c88d>] ldlm_cli_enqueue+0x3ad/0x7d0 [ptlrpc]
             [<ffffffffa064dd10>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
             [<ffffffffa098fab0>] ? osc_ldlm_blocking_ast+0x0/0x3c0 [osc]
             [<ffffffffa098f510>] ? osc_ldlm_glimpse_ast+0x0/0x340 [osc]
             [<ffffffffa097edff>] osc_enqueue_base+0x1ff/0x630 [osc]
             [<ffffffffa099056d>] osc_lock_enqueue+0x2bd/0xa00 [osc]
             [<ffffffffa0991f90>] ? osc_lock_upcall+0x0/0x530 [osc]
             [<ffffffffa04d8b9b>] cl_lock_enqueue+0x6b/0x120 [obdclass]
             [<ffffffffa0407e17>] lov_lock_enqueue+0x97/0x140 [lov]
             [<ffffffffa04d8b9b>] cl_lock_enqueue+0x6b/0x120 [obdclass]
             [<ffffffffa04d958b>] cl_lock_request+0x7b/0x200 [obdclass]
             [<ffffffffa04dd301>] cl_io_lock+0x381/0x3d0 [obdclass]
             [<ffffffffa04dd466>] cl_io_loop+0x116/0xb20 [obdclass]
             [<ffffffffa0663066>] ? interval_insert+0x296/0x410 [ptlrpc]
             [<ffffffffa0e7d9e1>] ll_file_io_generic+0x231/0xaa0 [lustre]
             [<ffffffffa0e8021d>] ll_file_aio_write+0x13d/0x280 [lustre]
             [<ffffffffa0e8049a>] ll_file_write+0x13a/0x270 [lustre]
             [<ffffffff81189ef8>] vfs_write+0xb8/0x1a0
             [<ffffffff8118b40f>] ? fget_light_pos+0x3f/0x50
             [<ffffffff8118aa31>] sys_write+0x51/0xb0
             [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
            
            bobijam Zhenyu Xu added a comment - I tried it on my VM, it shows that the hung is not because of the PFL issue. It looks like those write threads are waiting for extent locks which is blocked by the group lock. parallel_grou S 0000000000000001 0 21418 21416 0x00000080 ffff8800296cf788 0000000000000086 ccf27c28fd06e26f 000006b4ffffff9d 593905fb000053aa 0000000000000000 ffff880000000001 00000000fffffffc ffff8800296cf798 ffff88001b92b0c4 ffff88001cf59060 ffff8800296cffd8 Call Trace: [<ffffffffa064e38d>] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc] [<ffffffff810640e0>] ? default_wake_function+0x0/0x20 [<ffffffffa0648506>] ldlm_cli_enqueue_fini+0x936/0xe30 [ptlrpc] [<ffffffffa06699d1>] ? ptlrpc_set_destroy+0x2d1/0x450 [ptlrpc] [<ffffffffa064c88d>] ldlm_cli_enqueue+0x3ad/0x7d0 [ptlrpc] [<ffffffffa064dd10>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] [<ffffffffa098fab0>] ? osc_ldlm_blocking_ast+0x0/0x3c0 [osc] [<ffffffffa098f510>] ? osc_ldlm_glimpse_ast+0x0/0x340 [osc] [<ffffffffa097edff>] osc_enqueue_base+0x1ff/0x630 [osc] [<ffffffffa099056d>] osc_lock_enqueue+0x2bd/0xa00 [osc] [<ffffffffa0991f90>] ? osc_lock_upcall+0x0/0x530 [osc] [<ffffffffa04d8b9b>] cl_lock_enqueue+0x6b/0x120 [obdclass] [<ffffffffa0407e17>] lov_lock_enqueue+0x97/0x140 [lov] [<ffffffffa04d8b9b>] cl_lock_enqueue+0x6b/0x120 [obdclass] [<ffffffffa04d958b>] cl_lock_request+0x7b/0x200 [obdclass] [<ffffffffa04dd301>] cl_io_lock+0x381/0x3d0 [obdclass] [<ffffffffa04dd466>] cl_io_loop+0x116/0xb20 [obdclass] [<ffffffffa0663066>] ? interval_insert+0x296/0x410 [ptlrpc] [<ffffffffa0e7d9e1>] ll_file_io_generic+0x231/0xaa0 [lustre] [<ffffffffa0e8021d>] ll_file_aio_write+0x13d/0x280 [lustre] [<ffffffffa0e8049a>] ll_file_write+0x13a/0x270 [lustre] [<ffffffff81189ef8>] vfs_write+0xb8/0x1a0 [<ffffffff8118b40f>] ? fget_light_pos+0x3f/0x50 [<ffffffff8118aa31>] sys_write+0x51/0xb0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

            Will this test program be fixed by patch https://review.whamcloud.com/26646 "LU-9344 test: hung with sendfile_grouplock test12()" because group locking always instantiates the PFL components?

            adilger Andreas Dilger added a comment - Will this test program be fixed by patch https://review.whamcloud.com/26646 " LU-9344 test: hung with sendfile_grouplock test12()" because group locking always instantiates the PFL components?
            bobijam Zhenyu Xu added a comment -

            I think the group lock test hung up won't be fixed by the #27183.

            Partial OST object instantiation will trigger layout change when un-init component extent is written, and layout will be fetched and refreshed to continue the IO; while prior getting group lock increases lov_io::lo_active_ios (would be decreased in cl_put_grouplock), so the refreshing-layout IO will come across this lo_active_ios in lov_conf_set() and wait for the lo_active_ios drops to 0. This implies that there should be no race competing getting group lock and IO caring for layout change (vvp_io_init() will call ll_layout_refresh() if !io->ci_ignore_layout).

            bobijam Zhenyu Xu added a comment - I think the group lock test hung up won't be fixed by the #27183. Partial OST object instantiation will trigger layout change when un-init component extent is written, and layout will be fetched and refreshed to continue the IO; while prior getting group lock increases lov_io::lo_active_ios (would be decreased in cl_put_grouplock), so the refreshing-layout IO will come across this lo_active_ios in lov_conf_set() and wait for the lo_active_ios drops to 0. This implies that there should be no race competing getting group lock and IO caring for layout change (vvp_io_init() will call ll_layout_refresh() if !io->ci_ignore_layout).

            Will this MPI application failure be fixed by LU-9490 patch https://review.whamcloud.com/27183
            "LU-9490 llite: return v1/v3 layout for legacy app" ? We really shouldn't be breaking MPI applications due to PFL, and I think a workaround in the Lustre code is warranted.

            adilger Andreas Dilger added a comment - Will this MPI application failure be fixed by LU-9490 patch https://review.whamcloud.com/27183 " LU-9490 llite: return v1/v3 layout for legacy app " ? We really shouldn't be breaking MPI applications due to PFL, and I think a workaround in the Lustre code is warranted.

            parallel-scale test_parallel_grouplock TIMEOUTs in last 16 months (in master):
            > 2016-01-01 to 2017-02-02: 0 occurances
            > 2017-02-03 to 2017-04-05: 18 occurances
            > since 2017-04-05: 87 occurances
            >
            > tag 55 test (b3550): 2017-04-05 (parallel_grouplock 100% passing)
            >
            > landed 2017-04-08:
            > LU-8998 pfl: Basic data structures for composite layout — jinshan.xiong / gitweb
            > LU-8998 pfl: enhance PFID EA for PFL — jinshan.xiong / gitweb
            > LU-8998 pfl: layout LFSCK handles PFL file — jinshan.xiong / gitweb
            > LU-8998 pfl: test cases for lfsck on PFL — jinshan.xiong / gitweb
            > LU-9008 pfl: dynamic layout modification with write/truncate — jinshan.xiong / gitweb
            > LU-9165 pfl: MDS handling of write intent IT_LAYOUT RPC — jinshan.xiong / gitweb
            >
            > LU-8998 clio: Client side implementation for PFL — jinshan.xiong / gitweb
            > LU-8998 lfs: user space tools for PFL — jinshan.xiong / gitweb
            > LU-8998 docs: man pages for tools of PFL — jinshan.xiong / gitweb
            > LU-8998 tests: test scripts for PFL — jinshan.xiong / gitweb
            >
            > tag 56 test (b3565): 2017-04-23 (parallel_grouplock 100% failing)

            jcasper James Casper (Inactive) added a comment - parallel-scale test_parallel_grouplock TIMEOUTs in last 16 months (in master): > 2016-01-01 to 2017-02-02: 0 occurances > 2017-02-03 to 2017-04-05: 18 occurances > since 2017-04-05: 87 occurances > > tag 55 test (b3550): 2017-04-05 (parallel_grouplock 100% passing) > > landed 2017-04-08: > LU-8998 pfl: Basic data structures for composite layout — jinshan.xiong / gitweb > LU-8998 pfl: enhance PFID EA for PFL — jinshan.xiong / gitweb > LU-8998 pfl: layout LFSCK handles PFL file — jinshan.xiong / gitweb > LU-8998 pfl: test cases for lfsck on PFL — jinshan.xiong / gitweb > LU-9008 pfl: dynamic layout modification with write/truncate — jinshan.xiong / gitweb > LU-9165 pfl: MDS handling of write intent IT_LAYOUT RPC — jinshan.xiong / gitweb > > LU-8998 clio: Client side implementation for PFL — jinshan.xiong / gitweb > LU-8998 lfs: user space tools for PFL — jinshan.xiong / gitweb > LU-8998 docs: man pages for tools of PFL — jinshan.xiong / gitweb > LU-8998 tests: test scripts for PFL — jinshan.xiong / gitweb > > tag 56 test (b3565): 2017-04-23 (parallel_grouplock 100% failing)

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26943
            Subject: LU-9429 mpi: parallel_grouplock.c group_test4 hung
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cd96ac0012419916adb473cb9e2a7e21c6f8c963

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26943 Subject: LU-9429 mpi: parallel_grouplock.c group_test4 hung Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cd96ac0012419916adb473cb9e2a7e21c6f8c963
            pjones Peter Jones added a comment -

            Bobijam

            Could you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please advise on this one? Thanks Peter

            People

              bobijam Zhenyu Xu
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: