Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12985

sanity test_60g: Timeout occurred after 140 mins

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for liuying <emoly.liu@intel.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b24183b8-0ab1-11ea-bbc3-52540065bddc

      test_60g failed with the following error:

      Timeout occurred after 140 mins, last suite running was sanity, restarting cluster to continue tests
      

      <<Please provide additional information about the failure here>>
      The dmesg on MDS1 showed at https://testing.whamcloud.com/test_logs/bd780c0c-0ab1-11ea-bbc3-52540065bddc/show_text:

      [ 4806.548607] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung =========================================== 05:13:15 (1574140395)
      [ 4806.746423] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 4807.187510] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 4807.774430] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 4808.259070] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 4808.472999] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 4809.183645] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 4809.491198] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 4809.492745] Lustre: Skipped 1 previous similar message
      [ 4809.493924] LustreError: 16969:0:(llog_cat.c:744:llog_cat_cancel_arr_rec()) lustre-MDT0000-osd: fail to cancel 1 llog-records: rc = -5
      [ 4809.496232] LustreError: 16969:0:(llog_cat.c:781:llog_cat_cancel_records()) lustre-MDT0000-osd: fail to cancel 1 of 1 llog-records: rc = -5
      [ 4809.880666] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 4810.586256] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 4810.762596] LustreError: 19797:0:(llog_cat.c:744:llog_cat_cancel_arr_rec()) lustre-MDT0001-osp-MDT0002: fail to cancel 1 llog-records: rc = -116
      [ 4810.765173] LustreError: 19797:0:(llog_cat.c:781:llog_cat_cancel_records()) lustre-MDT0001-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116
      [ 4811.290016] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 4811.828688] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 4811.829600] Lustre: Skipped 2 previous similar messages
      [ 4811.842495] LustreError: 21702:0:(llog_osd.c:248:llog_osd_read_header()) lustre-MDT0001-osp-MDT0002: error reading [0x240000402:0x3:0x0] log header size 92: rc = -14
      ...
      [ 4851.890821] Lustre: mdt00_003: service thread pid 21702 was inactive for 40.059 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 4851.893891] Pid: 21702, comm: mdt00_003 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Fri Nov 8 17:23:54 UTC 2019
      [ 4851.895471] Call Trace:
      [ 4851.896004]  [<ffffffffa1788ce8>] call_rwsem_down_read_failed+0x18/0x30
      [ 4851.897198]  [<ffffffffc0ae01de>] llog_cat_prep_log+0x33e/0x360 [obdclass]
      [ 4851.898438]  [<ffffffffc0ae0279>] llog_cat_declare_add_rec+0x79/0x260 [obdclass]
      [ 4851.899728]  [<ffffffffc0ad7178>] llog_declare_add+0x78/0x1a0 [obdclass]
      [ 4851.900974]  [<ffffffffc0e85cde>] top_trans_start+0x17e/0x940 [ptlrpc]
      [ 4851.902438]  [<ffffffffc12f32b4>] lod_trans_start+0x34/0x40 [lod]
      [ 4851.903651]  [<ffffffffc13aef9a>] mdd_trans_start+0x1a/0x20 [mdd]
      [ 4851.904822]  [<ffffffffc13932f2>] mdd_create+0xbe2/0x1630 [mdd]
      [ 4851.905840]  [<ffffffffc121ce84>] mdt_create+0xb54/0x10e0 [mdt]
      [ 4851.906916]  [<ffffffffc121d57b>] mdt_reint_create+0x16b/0x360 [mdt]
      [ 4851.908040]  [<ffffffffc1222ab3>] mdt_reint_rec+0x83/0x210 [mdt]
      [ 4851.909115]  [<ffffffffc11fc9e0>] mdt_reint_internal+0x7b0/0xba0 [mdt]
      [ 4851.910272]  [<ffffffffc12086d7>] mdt_reint+0x67/0x140 [mdt]
      [ 4851.911292]  [<ffffffffc0e6ffda>] tgt_request_handle+0x97a/0x1620 [ptlrpc]
      [ 4851.912527]  [<ffffffffc0e16ab6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [ 4851.913969]  [<ffffffffc0e1a5ec>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [ 4851.915083]  [<ffffffffa14c2e81>] kthread+0xd1/0xe0
      [ 4851.915998]  [<ffffffffa1b77c37>] ret_from_fork_nospec_end+0x0/0x39
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_60g - Timeout occurred after 140 mins, last suite running was sanity, restarting cluster to continue tests

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: