[LU-7617] replay-single test_70b hangs ASSERTION( lo_header->loh_attr & LOHA_EXISTS ) in rename Created: 29/Dec/15  Updated: 22/Dec/21  Resolved: 22/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

autotest review-dne-part-2


Issue Links:
Related
is related to LU-6963 racer test_1: lu_object_attr() ASSERT... Open
is related to LU-7145 mdd_object_type() uses need audit Open
is related to LU-6844 replay-single test 70b failure: 'rund... Resolved
is related to LU-8004 replay-single test_70b: ASSERTION( ((... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_70b hangs with the following error from the test_log

shadow-23vm1:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY 
shadow-23vm1:   Trace dump:
shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5860:_wait_import_state()
shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5882:wait_import_state()
shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5891:wait_import_state_mount()
shadow-23vm1:   = rpc.sh:20:main()
shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:./../utils:/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /tmp/test_logs/1451294686 
shadow-23vm5:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY 
shadow-23vm5:   Trace dump:
shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5860:_wait_import_state()
shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5882:wait_import_state()
shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5891:wait_import_state_mount()
shadow-23vm5:   = rpc.sh:20:main()
shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:./../utils:/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /tmp/test_logs/1451294686 
shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com uname -n
shadow-23vm1: CMD: shadow-23vm3 uname -n
shadow-23vm1: Dumping lctl log to /tmp/test_logs/1451294686/rpc..*.1451296167.log
shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 /usr/sbin/lctl dk > /tmp/test_logs/1451294686/rpc..debug_log.\$(hostname -s).1451296167.log;
shadow-23vm1:          dmesg > /tmp/test_logs/1451294686/rpc..dmesg.\$(hostname -s).1451296167.log
shadow-23vm1: shadow-23vm4: open /proc/sys/lnet/dump_kernel failed: No such file or directory
shadow-23vm1: shadow-23vm4: open(dump_kernel) failed: No such file or directory
shadow-23vm5: CMD: shadow-23vm3 uname -n
shadow-23vm5: Dumping lctl log to /tmp/test_logs/1451294686/rpc..*.1451296168.log
shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 /usr/sbin/lctl dk > /tmp/test_logs/1451294686/rpc..debug_log.\$(hostname -s).1451296168.log;
shadow-23vm5:          dmesg > /tmp/test_logs/1451294686/rpc..dmesg.\$(hostname -s).1451296168.log
shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 rsync -az /tmp/test_logs/1451294686/rpc..*.1451296167.log shadow-23vm1.shadow.whamcloud.com:/tmp/test_logs/1451294686
shadow-23vm5: shadow-23vm4: open /proc/sys/lnet/dump_kernel failed: No such file or directory
shadow-23vm5: shadow-23vm4: open(dump_kernel) failed: No such file or directory
shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 rsync -az /tmp/test_logs/1451294686/rpc..*.1451296168.log shadow-23vm5.shadow.whamcloud.com:/tmp/test_logs/1451294686
replay-single test_70b: @@@@@@ FAIL: import is not in FULL state 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
  = /usr/lib64/lustre/tests/test-framework.sh:6114:wait_clients_import_state()
  = /usr/lib64/lustre/tests/test-framework.sh:2621:fail()
  = /usr/lib64/lustre/tests/replay-single.sh:2088:test_70b()
  = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5082:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4947:run_test()
  = /usr/lib64/lustre/tests/replay-single.sh:2099:main()
Dumping lctl log to /logdir/test_logs/2015-12-28/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__36516__-70048159267740-004235/replay-single.test_70b.*.1451296171.log
CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 /usr/sbin/lctl dk > /logdir/test_logs/2015-12-28/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__36516__-70048159267740-004235/replay-single.test_70b.debug_log.\$(hostname -s).1451296171.log;
         dmesg > /logdir/test_logs/2015-12-28/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__36516__-70048159267740-004235/replay-single.test_70b.dmesg.\$(hostname -s).1451296171.log
shadow-23vm4: open /proc/sys/lnet/dump_kernel failed: No such file or directory
shadow-23vm4: open(dump_kernel) failed: No such file or directory

From the console on MDS0, we see the following stack trace:

02:24:52:LustreError: 8467:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 
02:24:52:LustreError: 8467:0:(lu_object.h:862:lu_object_attr()) LBUG
02:24:52:Pid: 8467, comm: mdt00_000
02:24:52:
02:24:52:Call Trace:
02:24:52: [<ffffffffa0467875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
02:24:52: [<ffffffffa0467e77>] lbug_with_loc+0x47/0xb0 [libcfs]
02:24:52: [<ffffffffa0fb3ed3>] mdd_rename+0x1713/0x1dc0 [mdd]
02:24:52: [<ffffffffa0fa2ecc>] ? __mdd_lookup+0x28c/0x450 [mdd]
02:24:52: [<ffffffffa0e66c45>] mdt_reint_rename_internal+0x1305/0x1a60 [mdt]
02:24:52: [<ffffffffa075da76>] ? ldlm_lock_enqueue+0x2d6/0x970 [ptlrpc]
02:24:52: [<ffffffffa0e6759d>] mdt_reint_rename_or_migrate+0x1fd/0x7e0 [mdt]
02:24:52: [<ffffffffa077d7a0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
02:24:52: [<ffffffffa077f110>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
02:24:52: [<ffffffffa0e5bdaf>] ? ucred_set_jobid+0x5f/0x70 [mdt]
02:24:52: [<ffffffffa0e67bb3>] mdt_reint_rename+0x13/0x20 [mdt]
02:24:52: [<ffffffffa0e5fa5d>] mdt_reint_rec+0x5d/0x200 [mdt]
02:24:52: [<ffffffffa0e4b84b>] mdt_reint_internal+0x62b/0xb80 [mdt]
02:24:52: [<ffffffffa0e4c23b>] mdt_reint+0x6b/0x120 [mdt]
02:24:52: [<ffffffffa08174cc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
02:24:52: [<ffffffffa07beb41>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
02:24:52: [<ffffffffa07bdd00>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
02:24:52: [<ffffffff810a0fce>] kthread+0x9e/0xc0
02:24:52: [<ffffffff8100c28a>] child_rip+0xa/0x20
02:24:52: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
02:24:52: [<ffffffff8100c280>] ? child_rip+0x0/0x20
02:24:52:
02:24:52:Kernel panic - not syncing: LBUG
02:24:52:Pid: 8467, comm: mdt00_000 Not tainted 2.6.32-573.8.1.el6_lustre.gbd3d354.x86_64 #1

We’ve seen this failure about eight times since early October. Logs for this failure are at:
2015-10-23 18:45:53 - https://testing.hpdd.intel.com/test_sets/f4909c1c-7a10-11e5-b618-5254006e85c2
2015-10-24 09:06:03 - https://testing.hpdd.intel.com/test_sets/2e3667b2-7a75-11e5-b618-5254006e85c2
2015-11-27 08:05:42 - https://testing.hpdd.intel.com/test_sets/77835c1e-9522-11e5-bdeb-5254006e85c2
2015-11-29 04:56:08 - https://testing.hpdd.intel.com/test_sets/139e4958-96a2-11e5-aba1-5254006e85c2
2015-12-10 13:56:36 - https://testing.hpdd.intel.com/test_sets/ccf059d4-9f7e-11e5-9f74-5254006e85c2
2015-12-14 03:15:43 - https://testing.hpdd.intel.com/test_sets/0dd3b23a-a258-11e5-afd0-5254006e85c2
2015-12-20 06:59:17 - https://testing.hpdd.intel.com/test_sets/5517883e-a72f-11e5-a6c1-5254006e85c2
2015-12-28 08:43:43 - https://testing.hpdd.intel.com/test_sets/21213400-ad81-11e5-bf32-5254006e85c2

There are several tickets open for replay-single test 70b failing with the same assertion, but the stack trace for those include mdd_migrate and were seen running racer. Two examples are LU-5163 and LU-6963.



 Comments   
Comment by James Nunez (Inactive) [ 31/Dec/15 ]

Another failure on master:
2015-12-30 04:13:47 - https://testing.hpdd.intel.com/test_sets/dc7e85aa-aeed-11e5-9134-5254006e85c2
2016-01-25 21:46:23 - https://testing.hpdd.intel.com/test_sets/96d1b54c-c408-11e5-8866-5254006e85c2
2016-02-02 10:14:26 - https://testing.hpdd.intel.com/test_sets/380e4988-c9d7-11e5-9e6a-5254006e85c2

Comment by Andreas Dilger [ 04/Jan/16 ]

There are a number of different failures for replay-single 70b, which may be related if there is some kind of random memory corruption during this test - LU-7309, LU-7298, LU-7265, LU-6935, LU-6919, LU-6844

Generated at Sat Feb 10 02:10:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.