Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13813

conf-sanity test_112: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.2
    • Lustre 2.14.0, Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for vik <vlapa@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c03460d0-489d-4fbb-84e5-d7df9a59d9dc

      test_112 failed with the following error:

      import is not in FULL state
      

      Investigation showed that MDS was not connected during start_mds test step in _wait_import_state()

      can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN
      

      debug log from vm2 contains messages about import state changes.
      MDT0000_UUID: changing import state from DISCONN to CONNECTING and
      MDT0000_UUID: changing import state from CONNECTING to DISCONN

      00000100:00080000:1.0:1594563480.058266:0:25586:0:(import.c:86:import_set_state_nolock()) ffff9a285a1e5000 lustre-MDT0000_UUID: changing import state from DISCONN to CONNECTING
      00000100:00080000:1.0:1594563480.058358:0:25586:0:(pinger.c:242:ptlrpc_pinger_process_import()) 2d768d9f-aa8c-42e0-a75c-fa93c720357c->lustre-MDT0000_UUID: level DISCONN/3 force 0 force_next 0 deactive 0 pingable 1 suppress 0
      00000100:00080000:1.0:1594563480.058361:0:25586:0:(import.c:86:import_set_state_nolock()) ffff9a287aae0000 lustre-MDT0000_UUID: changing import state from DISCONN to CONNECTING
      00000100:00080000:1.0:1594563480.059863:0:10210:0:(import.c:86:import_set_state_nolock()) ffff9a285a1e5000 lustre-MDT0000_UUID: changing import state from CONNECTING to DISCONN
      00000100:00080000:1.0:1594563480.059906:0:10210:0:(import.c:86:import_set_state_nolock()) ffff9a287aae0000 lustre-MDT0000_UUID: changing import state from CONNECTING to DISCONN
      00000001:02000400:0.0:1594563481.248258:0:24754:0:(debug.c:712:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN
      00000001:02000400:1.0:1594563481.249585:0:24755:0:(debug.c:712:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN
      
      Started lustre-MDT0000
      CMD: onyx-42vm1.onyx.whamcloud.com,onyx-42vm2 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid 
      onyx-42vm1: == rpc test complete, duration -o sec ================================================================ 13:53:10 (1594561990)
      onyx-42vm2: == rpc test complete, duration -o sec ================================================================ 13:53:10 (1594561990)
      onyx-42vm1: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
      onyx-42vm2: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
      onyx-42vm1: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
      onyx-42vm2: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
      onyx-42vm1: CMD: onyx-42vm3 /usr/sbin/lctl get_param -n version 2>/dev/null
      onyx-42vm2: CMD: onyx-42vm3 /usr/sbin/lctl get_param -n version 2>/dev/null
      onyx-42vm1: CMD: onyx-42vm1.onyx.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
      onyx-42vm2: CMD: onyx-42vm2.onyx.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
      onyx-42vm1: onyx-42vm1.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      onyx-42vm2: onyx-42vm2.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      onyx-42vm2: CMD: onyx-42vm2.onyx.whamcloud.com lctl get_param -n at_max
      onyx-42vm2:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN 
      onyx-42vm2:   Trace dump:
      onyx-42vm2:   = /usr/lib64/lustre/tests/test-framework.sh:6216:error()
      onyx-42vm2:   = /usr/lib64/lustre/tests/test-framework.sh:7395:_wait_import_state()
      onyx-42vm2:   = /usr/lib64/lustre/tests/test-framework.sh:7417:wait_import_state()
      onyx-42vm2:   = /usr/lib64/lustre/tests/test-framework.sh:7426:wait_import_state_mount()
      onyx-42vm2:   = rpc.sh:18:main()
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_112 - import is not in FULL state

      Attachments

        Issue Links

          Activity

            [LU-13813] conf-sanity test_112: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48021/
            Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 0331d908abcb026fb450241ac8566d1a2bf425bc

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48021/ Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111 Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 0331d908abcb026fb450241ac8566d1a2bf425bc
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48022/
            Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0109cee2610b8dfeaaca25c3eb1e805e033c593d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48022/ Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0109cee2610b8dfeaaca25c3eb1e805e033c593d

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48022
            Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 77c40a5864758698bf92f74ff419b00491d402c8

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48022 Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 77c40a5864758698bf92f74ff419b00491d402c8

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48021
            Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: f6ad9782eed05eaa9b2af29e70080652e0d03e10

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48021 Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: f6ad9782eed05eaa9b2af29e70080652e0d03e10
            yujian Jian Yu added a comment -

            The failure can be reproduced by running test 112 after test 111. It passed in review test sessions because test 111 was skipped with SLOW=no.
            I'm working on a patch to fix the test script.
             

            yujian Jian Yu added a comment - The failure can be reproduced by running test 112 after test 111. It passed in review test sessions because test 111 was skipped with SLOW=no. I'm working on a patch to fix the test script.  
            adilger Andreas Dilger added a comment - - edited

            It seems all PASS cases are with DNE mode, and all failed for single MDS.

            Should we just add an ((MDSCOUNT >= 2)) check? I don't know why this is failing only for single-MDS tests and passing for DNE, but at least we wouldn't have continuous failures.

            adilger Andreas Dilger added a comment - - edited It seems all PASS cases are with DNE mode, and all failed for single MDS. Should we just add an ((MDSCOUNT >= 2)) check? I don't know why this is failing only for single-MDS tests and passing for DNE, but at least we wouldn't have continuous failures.
            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sets/7fb8c261-d3c5-4f43-af97-6cb1e10724ad

            It looks like we’re seeing this issue or something similar for 2.14.0-RC3 ldiskfs non-DNE testing; https://testing.whamcloud.com/test_sets/14efbbbc-a9fd-4ed7-999c-71e1f38af9a6 . If this is a new issue, I can close this ticket and open a new one.

            The client error messages look the same, but on the MDS (vm4), we see the DISCONN to CONNECTING state changes:

            00000100:00080000:0.0:1613560151.829979:0:3006202:0:(import.c:86:import_set_state_nolock()) 0000000059d48be4 lustre-OST0000_UUID: changing import state from CONNECTING to DISCONN
            00000100:00000001:1.0:1613560151.829981:0:3128009:0:(pinger.c:158:ptlrpc_ping()) Process leaving (rc=0 : 0 : 0)
            00000020:00000001:1.0:1613560151.829983:0:3128009:0:(lprocfs_status.c:1609:lprocfs_read_helper()) Process leaving (rc=276728730 : 276728730 : 107e8b9a)
            00000020:00000001:1.0:1613560151.829984:0:3128009:0:(lprocfs_status.c:1609:lprocfs_read_helper()) Process leaving (rc=18446744073561016351 : -148535265 : fffffffff725881f)
            00000100:00080000:0.0:1613560151.829984:0:3006202:0:(import.c:1440:ptlrpc_connect_interpret()) recovery of lustre-OST0000_UUID on 10.2.8.108@tcp failed (-110)
            00000100:00000040:1.0:1613560151.829985:0:3128009:0:(pinger.c:322:ptlrpc_pinger_main()) next wakeup in 5 (51652)
            00000100:00000001:0.0:1613560151.829985:0:3006202:0:(import.c:1446:ptlrpc_connect_interpret()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
            00000100:00080000:1.0:1613560151.829986:0:3128009:0:(pinger.c:249:ptlrpc_pinger_process_import()) lustre-MDT0000-mdtlov_UUID->lustre-OST0000_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 0 suppress 1
            00000100:00000040:0.0:1613560151.829986:0:3006202:0:(lustre_net.h:2470:ptlrpc_rqphase_move()) @@@ move request phase from Interpret to Complete  req@00000000e533b5b3 x1691936139685568/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.8.108@tcp:28/4 lens 520/544 e 0 to 1 dl 1613560151 ref 1 fl Interpret:EXNQU/0/ffffffff rc -110/-1 job:'kworker/u4:1.0'
            00000100:00000001:1.0:1613560151.829988:0:3128009:0:(recover.c:56:ptlrpc_initiate_recovery()) Process entered
            00000100:00080000:1.0:1613560151.829989:0:3128009:0:(recover.c:58:ptlrpc_initiate_recovery()) lustre-OST0000_UUID: starting recovery
            00000100:00000001:1.0:1613560151.829990:0:3128009:0:(import.c:697:ptlrpc_connect_import_locked()) Process entered
            00000100:00080000:1.0:1613560151.829990:0:3128009:0:(import.c:86:import_set_state_nolock()) 0000000059d48be4 lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING
            
            jamesanunez James Nunez (Inactive) added a comment - It looks like we’re seeing this issue or something similar for 2.14.0-RC3 ldiskfs non-DNE testing; https://testing.whamcloud.com/test_sets/14efbbbc-a9fd-4ed7-999c-71e1f38af9a6 . If this is a new issue, I can close this ticket and open a new one. The client error messages look the same, but on the MDS (vm4), we see the DISCONN to CONNECTING state changes: 00000100:00080000:0.0:1613560151.829979:0:3006202:0:(import.c:86:import_set_state_nolock()) 0000000059d48be4 lustre-OST0000_UUID: changing import state from CONNECTING to DISCONN 00000100:00000001:1.0:1613560151.829981:0:3128009:0:(pinger.c:158:ptlrpc_ping()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:1.0:1613560151.829983:0:3128009:0:(lprocfs_status.c:1609:lprocfs_read_helper()) Process leaving (rc=276728730 : 276728730 : 107e8b9a) 00000020:00000001:1.0:1613560151.829984:0:3128009:0:(lprocfs_status.c:1609:lprocfs_read_helper()) Process leaving (rc=18446744073561016351 : -148535265 : fffffffff725881f) 00000100:00080000:0.0:1613560151.829984:0:3006202:0:(import.c:1440:ptlrpc_connect_interpret()) recovery of lustre-OST0000_UUID on 10.2.8.108@tcp failed (-110) 00000100:00000040:1.0:1613560151.829985:0:3128009:0:(pinger.c:322:ptlrpc_pinger_main()) next wakeup in 5 (51652) 00000100:00000001:0.0:1613560151.829985:0:3006202:0:(import.c:1446:ptlrpc_connect_interpret()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92) 00000100:00080000:1.0:1613560151.829986:0:3128009:0:(pinger.c:249:ptlrpc_pinger_process_import()) lustre-MDT0000-mdtlov_UUID->lustre-OST0000_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 0 suppress 1 00000100:00000040:0.0:1613560151.829986:0:3006202:0:(lustre_net.h:2470:ptlrpc_rqphase_move()) @@@ move request phase from Interpret to Complete req@00000000e533b5b3 x1691936139685568/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.8.108@tcp:28/4 lens 520/544 e 0 to 1 dl 1613560151 ref 1 fl Interpret:EXNQU/0/ffffffff rc -110/-1 job:'kworker/u4:1.0' 00000100:00000001:1.0:1613560151.829988:0:3128009:0:(recover.c:56:ptlrpc_initiate_recovery()) Process entered 00000100:00080000:1.0:1613560151.829989:0:3128009:0:(recover.c:58:ptlrpc_initiate_recovery()) lustre-OST0000_UUID: starting recovery 00000100:00000001:1.0:1613560151.829990:0:3128009:0:(import.c:697:ptlrpc_connect_import_locked()) Process entered 00000100:00080000:1.0:1613560151.829990:0:3128009:0:(import.c:86:import_set_state_nolock()) 0000000059d48be4 lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING

            Fixed via LU-13184.

            adilger Andreas Dilger added a comment - Fixed via LU-13184 .

            People

              yujian Jian Yu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: