Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7617

replay-single test_70b hangs ASSERTION( lo_header->loh_attr & LOHA_EXISTS ) in rename

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.8.0, Lustre 2.9.0
    • None
    • autotest review-dne-part-2
    • 3
    • 9223372036854775807

    Description

      replay-single test_70b hangs with the following error from the test_log

      shadow-23vm1:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY 
      shadow-23vm1:   Trace dump:
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5860:_wait_import_state()
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5882:wait_import_state()
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5891:wait_import_state_mount()
      shadow-23vm1:   = rpc.sh:20:main()
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:./../utils:/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /tmp/test_logs/1451294686 
      shadow-23vm5:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY 
      shadow-23vm5:   Trace dump:
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5860:_wait_import_state()
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5882:wait_import_state()
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5891:wait_import_state_mount()
      shadow-23vm5:   = rpc.sh:20:main()
      shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:./../utils:/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /tmp/test_logs/1451294686 
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com uname -n
      shadow-23vm1: CMD: shadow-23vm3 uname -n
      shadow-23vm1: Dumping lctl log to /tmp/test_logs/1451294686/rpc..*.1451296167.log
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 /usr/sbin/lctl dk > /tmp/test_logs/1451294686/rpc..debug_log.\$(hostname -s).1451296167.log;
      shadow-23vm1:          dmesg > /tmp/test_logs/1451294686/rpc..dmesg.\$(hostname -s).1451296167.log
      shadow-23vm1: shadow-23vm4: open /proc/sys/lnet/dump_kernel failed: No such file or directory
      shadow-23vm1: shadow-23vm4: open(dump_kernel) failed: No such file or directory
      shadow-23vm5: CMD: shadow-23vm3 uname -n
      shadow-23vm5: Dumping lctl log to /tmp/test_logs/1451294686/rpc..*.1451296168.log
      shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 /usr/sbin/lctl dk > /tmp/test_logs/1451294686/rpc..debug_log.\$(hostname -s).1451296168.log;
      shadow-23vm5:          dmesg > /tmp/test_logs/1451294686/rpc..dmesg.\$(hostname -s).1451296168.log
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 rsync -az /tmp/test_logs/1451294686/rpc..*.1451296167.log shadow-23vm1.shadow.whamcloud.com:/tmp/test_logs/1451294686
      shadow-23vm5: shadow-23vm4: open /proc/sys/lnet/dump_kernel failed: No such file or directory
      shadow-23vm5: shadow-23vm4: open(dump_kernel) failed: No such file or directory
      shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 rsync -az /tmp/test_logs/1451294686/rpc..*.1451296168.log shadow-23vm5.shadow.whamcloud.com:/tmp/test_logs/1451294686
      replay-single test_70b: @@@@@@ FAIL: import is not in FULL state 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
        = /usr/lib64/lustre/tests/test-framework.sh:6114:wait_clients_import_state()
        = /usr/lib64/lustre/tests/test-framework.sh:2621:fail()
        = /usr/lib64/lustre/tests/replay-single.sh:2088:test_70b()
        = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5082:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4947:run_test()
        = /usr/lib64/lustre/tests/replay-single.sh:2099:main()
      Dumping lctl log to /logdir/test_logs/2015-12-28/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__36516__-70048159267740-004235/replay-single.test_70b.*.1451296171.log
      CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 /usr/sbin/lctl dk > /logdir/test_logs/2015-12-28/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__36516__-70048159267740-004235/replay-single.test_70b.debug_log.\$(hostname -s).1451296171.log;
               dmesg > /logdir/test_logs/2015-12-28/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__36516__-70048159267740-004235/replay-single.test_70b.dmesg.\$(hostname -s).1451296171.log
      shadow-23vm4: open /proc/sys/lnet/dump_kernel failed: No such file or directory
      shadow-23vm4: open(dump_kernel) failed: No such file or directory
      

      From the console on MDS0, we see the following stack trace:

      02:24:52:LustreError: 8467:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 
      02:24:52:LustreError: 8467:0:(lu_object.h:862:lu_object_attr()) LBUG
      02:24:52:Pid: 8467, comm: mdt00_000
      02:24:52:
      02:24:52:Call Trace:
      02:24:52: [<ffffffffa0467875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      02:24:52: [<ffffffffa0467e77>] lbug_with_loc+0x47/0xb0 [libcfs]
      02:24:52: [<ffffffffa0fb3ed3>] mdd_rename+0x1713/0x1dc0 [mdd]
      02:24:52: [<ffffffffa0fa2ecc>] ? __mdd_lookup+0x28c/0x450 [mdd]
      02:24:52: [<ffffffffa0e66c45>] mdt_reint_rename_internal+0x1305/0x1a60 [mdt]
      02:24:52: [<ffffffffa075da76>] ? ldlm_lock_enqueue+0x2d6/0x970 [ptlrpc]
      02:24:52: [<ffffffffa0e6759d>] mdt_reint_rename_or_migrate+0x1fd/0x7e0 [mdt]
      02:24:52: [<ffffffffa077d7a0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
      02:24:52: [<ffffffffa077f110>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
      02:24:52: [<ffffffffa0e5bdaf>] ? ucred_set_jobid+0x5f/0x70 [mdt]
      02:24:52: [<ffffffffa0e67bb3>] mdt_reint_rename+0x13/0x20 [mdt]
      02:24:52: [<ffffffffa0e5fa5d>] mdt_reint_rec+0x5d/0x200 [mdt]
      02:24:52: [<ffffffffa0e4b84b>] mdt_reint_internal+0x62b/0xb80 [mdt]
      02:24:52: [<ffffffffa0e4c23b>] mdt_reint+0x6b/0x120 [mdt]
      02:24:52: [<ffffffffa08174cc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
      02:24:52: [<ffffffffa07beb41>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      02:24:52: [<ffffffffa07bdd00>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      02:24:52: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      02:24:52: [<ffffffff8100c28a>] child_rip+0xa/0x20
      02:24:52: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      02:24:52: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      02:24:52:
      02:24:52:Kernel panic - not syncing: LBUG
      02:24:52:Pid: 8467, comm: mdt00_000 Not tainted 2.6.32-573.8.1.el6_lustre.gbd3d354.x86_64 #1
      

      We’ve seen this failure about eight times since early October. Logs for this failure are at:
      2015-10-23 18:45:53 - https://testing.hpdd.intel.com/test_sets/f4909c1c-7a10-11e5-b618-5254006e85c2
      2015-10-24 09:06:03 - https://testing.hpdd.intel.com/test_sets/2e3667b2-7a75-11e5-b618-5254006e85c2
      2015-11-27 08:05:42 - https://testing.hpdd.intel.com/test_sets/77835c1e-9522-11e5-bdeb-5254006e85c2
      2015-11-29 04:56:08 - https://testing.hpdd.intel.com/test_sets/139e4958-96a2-11e5-aba1-5254006e85c2
      2015-12-10 13:56:36 - https://testing.hpdd.intel.com/test_sets/ccf059d4-9f7e-11e5-9f74-5254006e85c2
      2015-12-14 03:15:43 - https://testing.hpdd.intel.com/test_sets/0dd3b23a-a258-11e5-afd0-5254006e85c2
      2015-12-20 06:59:17 - https://testing.hpdd.intel.com/test_sets/5517883e-a72f-11e5-a6c1-5254006e85c2
      2015-12-28 08:43:43 - https://testing.hpdd.intel.com/test_sets/21213400-ad81-11e5-bf32-5254006e85c2

      There are several tickets open for replay-single test 70b failing with the same assertion, but the stack trace for those include mdd_migrate and were seen running racer. Two examples are LU-5163 and LU-6963.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: