Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11183

sanity test 244 hangs with no information in the logs

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity test 244 hangs in recent testing. The last thing seen in the test_log is a hang during test 10 or 11

      == sanity test 244: sendfile with group lock tests =================================================== 18:46:27 (1531334787)
      35+0 records in
      35+0 records out
      36700160 bytes (37 MB) copied, 0.482129 s, 76.1 MB/s
      Starting test test10 at 1531334788
      
      

      In all cases, the stack_dump is empty and the only thing seen in the console logs and dmesg is the test starting, rebooting and start testing sanity-sec. There’s about a one hour gap between when the test last reports in and when the node is rebooted

      [ 5769.663115] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 244: sendfile with group lock tests =================================================== 18:46:27 \(1531334787\)
      [ 5769.858388] Lustre: DEBUG MARKER: == sanity test 244: sendfile with group lock tests =================================================== 18:46:27 (1531334787)
      
      <ConMan> Console [trevis-12vm4] disconnected from <trevis-12:6003> at 07-11 19:49.
      
      <ConMan> Console [trevis-12vm4] connected to <trevis-12:6003> at 07-11 19:49.
      

      It looks like there is no information on why this test hung.

      We have several instances of this with logs at
      https://testing.whamcloud.com/test_sets/fca99f92-6fcd-11e8-aa24-52540065bddc
      https://testing.whamcloud.com/test_sets/553d0058-80cd-11e8-b441-52540065bddc
      https://testing.whamcloud.com/test_sets/f52eaf70-8d67-11e8-87f3-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11183] sanity test 244 hangs with no information in the logs

            Probably a dupe of LU-11128

            paf0186 Patrick Farrell added a comment - Probably a dupe of LU-11128

            This looks like a duplicate of LU-11128, which has a fix about to land. I've also got an additional debugging patch that should land, in case Alex's patch doesn't fix this, and to improve debugability beyond "deafening silence in the logs".

            adilger Andreas Dilger added a comment - This looks like a duplicate of LU-11128 , which has a fix about to land. I've also got an additional debugging patch that should land, in case Alex's patch doesn't fix this, and to improve debugability beyond "deafening silence in the logs".

            master review-ldiskfs

            https://testing.whamcloud.com/test_sets/8eb7ed38-b7c4-11e8-a7de-52540065bddc

            In the console log for client 1 (onyx-30vm1) is the stack traces of all running processes including sendfile_grouplock:

            [ 9676.429140] sendfile_groupl S ffff974e39416eb0     0 22342  22160 0x00000080
            [ 9676.429921] Call Trace:
            [ 9676.430178]  [<ffffffffb8714029>] schedule+0x29/0x70
            [ 9676.430681]  [<ffffffffb87118d4>] schedule_timeout+0x174/0x2c0
            [ 9676.431389]  [<ffffffffb80a3750>] ? internal_add_timer+0x70/0x70
            [ 9676.431985]  [<ffffffffc09c2650>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc]
            [ 9676.432705]  [<ffffffffc09cc3b0>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
            [ 9676.433440]  [<ffffffffb80cf670>] ? wake_up_state+0x20/0x20
            [ 9676.433990]  [<ffffffffc09cc73d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            [ 9676.434666]  [<ffffffffc09b16a2>] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc]
            [ 9676.435414]  [<ffffffffc09ac7c0>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
            [ 9676.436210]  [<ffffffffc0ba2d50>] ? osc_lock_lockless_cancel+0xe0/0xe0 [osc]
            [ 9676.436929]  [<ffffffffc0ba1ad0>] ? osc_lock_upcall+0x580/0x580 [osc]
            [ 9676.437653]  [<ffffffffc0b98965>] osc_enqueue_base+0x2b5/0x6a0 [osc]
            [ 9676.438288]  [<ffffffffc0ba1550>] ? osc_lock_lvb_update+0x330/0x330 [osc]
            [ 9676.438979]  [<ffffffffc0ba37bb>] osc_lock_enqueue+0x38b/0x840 [osc]
            [ 9676.439738]  [<ffffffffc0ba1550>] ? osc_lock_lvb_update+0x330/0x330 [osc]
            [ 9676.440443]  [<ffffffffc0802d95>] cl_lock_enqueue+0x65/0x120 [obdclass]
            [ 9676.441176]  [<ffffffffc0bfc285>] lov_lock_enqueue+0x95/0x150 [lov]
            [ 9676.441817]  [<ffffffffc0802d95>] cl_lock_enqueue+0x65/0x120 [obdclass]
            [ 9676.442567]  [<ffffffffc0803327>] cl_lock_request+0x67/0x1f0 [obdclass]
            [ 9676.443227]  [<ffffffffc080721b>] cl_io_lock+0x2bb/0x3d0 [obdclass]
            [ 9676.443866]  [<ffffffffc08075ab>] cl_io_loop+0x11b/0xc70 [obdclass]
            [ 9676.444634]  [<ffffffffc0c54db2>] ll_file_io_generic+0x4e2/0xd10 [lustre]
            [ 9676.445301]  [<ffffffffc0695395>] ? cfs_trace_unlock_tcd+0x35/0x90 [libcfs]
            [ 9676.445990]  [<ffffffffb8357e01>] ? vsnprintf+0x1c1/0x6a0
            [ 9676.446644]  [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120
            [ 9676.447271]  [<ffffffffc0c55b32>] ll_file_aio_write+0x372/0x540 [lustre]
            [ 9676.447945]  [<ffffffffc0c55da4>] ll_file_write+0xa4/0x170 [lustre]
            [ 9676.448648]  [<ffffffffb821c082>] __kernel_write+0x72/0x140
            [ 9676.449194]  [<ffffffffb824ccfe>] write_pipe_buf+0x5e/0xa0
            [ 9676.449748]  [<ffffffffb824c386>] splice_from_pipe_feed+0x86/0x130
            [ 9676.450446]  [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120
            [ 9676.451054]  [<ffffffffb824c96e>] __splice_from_pipe+0x6e/0x90
            [ 9676.451639]  [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120
            [ 9676.452363]  [<ffffffffb824e34e>] splice_from_pipe+0x5e/0x90
            [ 9676.452960]  [<ffffffffb824e399>] default_file_splice_write+0x19/0x30
            [ 9676.453610]  [<ffffffffb824d190>] do_splice_from+0xb0/0xf0
            [ 9676.454208]  [<ffffffffb824d1f0>] direct_splice_actor+0x20/0x30
            [ 9676.454793]  [<ffffffffb824cf27>] splice_direct_to_actor+0xd7/0x200
            [ 9676.455486]  [<ffffffffb824d1d0>] ? do_splice_from+0xf0/0xf0
            [ 9676.456039]  [<ffffffffb824d0b2>] do_splice_direct+0x62/0x90
            [ 9676.456611]  [<ffffffffb821bb38>] do_sendfile+0x1d8/0x3c0
            [ 9676.457212]  [<ffffffffb821d1ea>] SyS_sendfile64+0x9a/0xb0
            [ 9676.457764]  [<ffffffffb87206d5>] ? system_call_after_swapgs+0xa2/0x146
            [ 9676.458508]  [<ffffffffb8720795>] system_call_fastpath+0x1c/0x21
            [ 9676.459099]  [<ffffffffb87206e1>] ? system_call_after_swapgs+0xae/0x146
            utopiabound Nathaniel Clark added a comment - master review-ldiskfs https://testing.whamcloud.com/test_sets/8eb7ed38-b7c4-11e8-a7de-52540065bddc In the console log for client 1 (onyx-30vm1) is the stack traces of all running processes including sendfile_grouplock: [ 9676.429140] sendfile_groupl S ffff974e39416eb0 0 22342 22160 0x00000080 [ 9676.429921] Call Trace: [ 9676.430178] [<ffffffffb8714029>] schedule+0x29/0x70 [ 9676.430681] [<ffffffffb87118d4>] schedule_timeout+0x174/0x2c0 [ 9676.431389] [<ffffffffb80a3750>] ? internal_add_timer+0x70/0x70 [ 9676.431985] [<ffffffffc09c2650>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc] [ 9676.432705] [<ffffffffc09cc3b0>] ptlrpc_set_wait+0x480/0x790 [ptlrpc] [ 9676.433440] [<ffffffffb80cf670>] ? wake_up_state+0x20/0x20 [ 9676.433990] [<ffffffffc09cc73d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [ 9676.434666] [<ffffffffc09b16a2>] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc] [ 9676.435414] [<ffffffffc09ac7c0>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc] [ 9676.436210] [<ffffffffc0ba2d50>] ? osc_lock_lockless_cancel+0xe0/0xe0 [osc] [ 9676.436929] [<ffffffffc0ba1ad0>] ? osc_lock_upcall+0x580/0x580 [osc] [ 9676.437653] [<ffffffffc0b98965>] osc_enqueue_base+0x2b5/0x6a0 [osc] [ 9676.438288] [<ffffffffc0ba1550>] ? osc_lock_lvb_update+0x330/0x330 [osc] [ 9676.438979] [<ffffffffc0ba37bb>] osc_lock_enqueue+0x38b/0x840 [osc] [ 9676.439738] [<ffffffffc0ba1550>] ? osc_lock_lvb_update+0x330/0x330 [osc] [ 9676.440443] [<ffffffffc0802d95>] cl_lock_enqueue+0x65/0x120 [obdclass] [ 9676.441176] [<ffffffffc0bfc285>] lov_lock_enqueue+0x95/0x150 [lov] [ 9676.441817] [<ffffffffc0802d95>] cl_lock_enqueue+0x65/0x120 [obdclass] [ 9676.442567] [<ffffffffc0803327>] cl_lock_request+0x67/0x1f0 [obdclass] [ 9676.443227] [<ffffffffc080721b>] cl_io_lock+0x2bb/0x3d0 [obdclass] [ 9676.443866] [<ffffffffc08075ab>] cl_io_loop+0x11b/0xc70 [obdclass] [ 9676.444634] [<ffffffffc0c54db2>] ll_file_io_generic+0x4e2/0xd10 [lustre] [ 9676.445301] [<ffffffffc0695395>] ? cfs_trace_unlock_tcd+0x35/0x90 [libcfs] [ 9676.445990] [<ffffffffb8357e01>] ? vsnprintf+0x1c1/0x6a0 [ 9676.446644] [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120 [ 9676.447271] [<ffffffffc0c55b32>] ll_file_aio_write+0x372/0x540 [lustre] [ 9676.447945] [<ffffffffc0c55da4>] ll_file_write+0xa4/0x170 [lustre] [ 9676.448648] [<ffffffffb821c082>] __kernel_write+0x72/0x140 [ 9676.449194] [<ffffffffb824ccfe>] write_pipe_buf+0x5e/0xa0 [ 9676.449748] [<ffffffffb824c386>] splice_from_pipe_feed+0x86/0x130 [ 9676.450446] [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120 [ 9676.451054] [<ffffffffb824c96e>] __splice_from_pipe+0x6e/0x90 [ 9676.451639] [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120 [ 9676.452363] [<ffffffffb824e34e>] splice_from_pipe+0x5e/0x90 [ 9676.452960] [<ffffffffb824e399>] default_file_splice_write+0x19/0x30 [ 9676.453610] [<ffffffffb824d190>] do_splice_from+0xb0/0xf0 [ 9676.454208] [<ffffffffb824d1f0>] direct_splice_actor+0x20/0x30 [ 9676.454793] [<ffffffffb824cf27>] splice_direct_to_actor+0xd7/0x200 [ 9676.455486] [<ffffffffb824d1d0>] ? do_splice_from+0xf0/0xf0 [ 9676.456039] [<ffffffffb824d0b2>] do_splice_direct+0x62/0x90 [ 9676.456611] [<ffffffffb821bb38>] do_sendfile+0x1d8/0x3c0 [ 9676.457212] [<ffffffffb821d1ea>] SyS_sendfile64+0x9a/0xb0 [ 9676.457764] [<ffffffffb87206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 9676.458508] [<ffffffffb8720795>] system_call_fastpath+0x1c/0x21 [ 9676.459099] [<ffffffffb87206e1>] ? system_call_after_swapgs+0xae/0x146

            16 failures in the past 4 weeks, of 1160 test runs, or 0.13% failure rate. Definitely not one of the top 10 at this point.

            adilger Andreas Dilger added a comment - 16 failures in the past 4 weeks, of 1160 test runs, or 0.13% failure rate. Definitely not one of the top 10 at this point.
            tappro Mikhail Pershin added a comment - +1 on master https://testing.whamcloud.com/test_sets/b74c176a-aacc-11e8-80f7-52540065bddc
            adilger Andreas Dilger added a comment - +1 on master https://testing.whamcloud.com/test_sets/3ef30c02-97ed-11e8-b0aa-52540065bddc

            Stack dumps were empty due to a bug in AT (ATM-1046). Are there new instances of this issue after 7/23?

            colmstea Charlie Olmstead added a comment - Stack dumps were empty due to a bug in AT (ATM-1046). Are there new instances of this issue after 7/23?

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: