[LU-3316] ASSERTION(list_empty(&ls->ls_los_list)) failure on test suite sanity-quota / test_7c Created: 10/May/13  Updated: 09/Dec/13  Resolved: 16/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3347 (local_storage.c:872:local_oid_storag... Resolved
Severity: 3
Rank (Obsolete): 8206

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/570b8f14-b713-11e2-bd0f-52540035b04c.

The sub-test test_7c failed with the following error:

test failed to respond and timed out

Info required for matching: sanity-quota 7c

Console log from mds:

02:42:32:Lustre: DEBUG MARKER: == sanity-quota test 7c: Quota reintegration (restart mds during reintegration) == 02:41:58 (1367919718)
02:42:32:Lustre: DEBUG MARKER: lctl get_param -n osc.*MDT*.sync_*
02:42:32:Lustre: DEBUG MARKER: lctl set_param fail_val=0
02:42:32:Lustre: DEBUG MARKER: lctl set_param fail_loc=0
02:42:32:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=none
02:42:32:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=ug
02:42:32:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
02:42:32:Lustre: DEBUG MARKER: umount -d /mnt/mds1
02:42:32:Lustre: Failing over lustre-MDT0000
02:42:32:Lustre: Skipped 1 previous similar message
02:42:32:LustreError: 21170:0:(local_storage.c:184:ls_device_put()) ASSERTION( list_empty(&ls->ls_los_list) ) failed: 
02:42:32:LustreError: 21170:0:(local_storage.c:184:ls_device_put()) LBUG
02:42:32:Pid: 21170, comm: umount
02:42:32:
02:42:32:Call Trace:
02:42:32: [<ffffffffa0590895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
02:42:32: [<ffffffffa0590e97>] lbug_with_loc+0x47/0xb0 [libcfs]
02:42:32: [<ffffffffa06e4859>] ls_device_put+0x1a9/0x1e0 [obdclass]
02:42:32: [<ffffffffa06dc6a5>] llog_osd_cleanup+0xc5/0x140 [obdclass]
02:42:32: [<ffffffffa06b772a>] __llog_ctxt_put+0xca/0x140 [obdclass]
02:42:32: [<ffffffffa06b7854>] llog_cleanup+0xb4/0x440 [obdclass]
02:42:32: [<ffffffffa06d0f31>] ? lprocfs_remove+0x31/0x40 [obdclass]
02:42:32: [<ffffffffa06d13ed>] ? lprocfs_obd_cleanup+0x5d/0xb0 [obdclass]
02:42:32: [<ffffffffa0cd7ad5>] mgs_device_fini+0x1c5/0x5a0 [mgs]
02:42:32: [<ffffffffa06f1907>] class_cleanup+0x577/0xda0 [obdclass]
02:42:32: [<ffffffffa06c6ac6>] ? class_name2dev+0x56/0xe0 [obdclass]
02:42:32: [<ffffffffa06f31ec>] class_process_config+0x10bc/0x1c80 [obdclass]
02:42:32: [<ffffffffa06eca13>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
02:42:32: [<ffffffffa06f3f29>] class_manual_cleanup+0x179/0x6f0 [obdclass]
02:42:32: [<ffffffffa06c6ac6>] ? class_name2dev+0x56/0xe0 [obdclass]
02:42:32: [<ffffffffa072961d>] server_put_super+0x46d/0xf00 [obdclass]
02:42:32: [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0
02:42:32: [<ffffffff81183436>] kill_anon_super+0x16/0x60
02:42:32: [<ffffffffa06f5d86>] lustre_kill_super+0x36/0x60 [obdclass]
02:42:32: [<ffffffff81183bd7>] deactivate_super+0x57/0x80
02:42:32: [<ffffffff811a1c4f>] mntput_no_expire+0xbf/0x110
02:42:32: [<ffffffff811a26bb>] sys_umount+0x7b/0x3a0
02:42:32: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Nathaniel Clark [ 10/May/13 ]

I can't find another failure quite like this one but there are others on the same test with a different ASSERTION crash:

16:28:34:Lustre: DEBUG MARKER: == sanity-quota test 7c: Quota reintegration (restart mds during reintegration) == 16:27:54 (1364858874)
16:28:34:Lustre: DEBUG MARKER: lctl get_param -n osc.*MDT*.sync_*
16:28:34:Lustre: DEBUG MARKER: lctl set_param fail_val=0
16:28:34:Lustre: DEBUG MARKER: lctl set_param fail_loc=0
16:28:34:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=none
16:28:34:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=ug
16:28:34:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
16:28:34:Lustre: DEBUG MARKER: umount -d /mnt/mds1
16:28:34:Lustre: Failing over lustre-MDT0000
16:28:34:LustreError: 3036:0:(lod_dev.c:813:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: 
16:28:34:LustreError: 3036:0:(lod_dev.c:813:lod_device_free()) LBUG
16:28:34:Pid: 3036, comm: obd_zombid
16:28:34:
16:28:34:Call Trace:
16:28:34: [<ffffffffa05bd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
16:28:34: [<ffffffffa05bde97>] lbug_with_loc+0x47/0xb0 [libcfs]
16:28:34: [<ffffffffa0e434bb>] lod_device_free+0x1eb/0x220 [lod]
16:28:34: [<ffffffffa0725e4d>] class_decref+0x46d/0x580 [obdclass]
16:28:34: [<ffffffffa0703399>] obd_zombie_impexp_cull+0x309/0x5d0 [obdclass]
16:28:34: [<ffffffffa0703725>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
16:28:34: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
16:28:34: [<ffffffffa0703660>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
16:28:34: [<ffffffff8100c0ca>] child_rip+0xa/0x20
16:28:34: [<ffffffffa0703660>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
16:28:34: [<ffffffffa0703660>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
16:28:34: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

https://maloo.whamcloud.com/test_sets/4518d960-9d2d-11e2-a280-52540035b04c
https://maloo.whamcloud.com/test_sets/acbc4838-841c-11e2-b461-52540035b04c
https://maloo.whamcloud.com/test_sets/f8326724-8189-11e2-9f6b-52540035b04c
https://maloo.whamcloud.com/test_sets/327878b4-5eb3-11e2-ba27-52540035b04c

Comment by Jodi Levi (Inactive) [ 10/May/13 ]

Mike,
Could you please comment on this one?
Thank you!

Comment by Mikhail Pershin [ 14/May/13 ]

http://review.whamcloud.com/#change,6334

The ls_device_put() might be called wrongly if local_oid_storage struct is not removed due to race.

As for second call traces in comment #1, it doesn't look related.

Comment by Andreas Dilger [ 08/Jul/13 ]

Recent failure:
https://maloo.whamcloud.com/test_sets/e2388b22-e6d0-11e2-8d9a-52540035b04c

Comment by Bruno Faccini (Inactive) [ 04/Aug/13 ]

+1 at https://maloo.whamcloud.com/test_sets/715bb308-fc05-11e2-9222-52540035b04c

Comment by Bob Glossman (Inactive) [ 14/Aug/13 ]

another: https://maloo.whamcloud.com/test_sets/c52a6b34-04e6-11e3-b035-52540035b04c

This test set was on o2ib, not tcp. I wonder if that is significant.

Comment by Mikhail Pershin [ 16/Sep/13 ]

patch was landed

Comment by Jian Yu [ 03/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/59/
Distro/Arch: RHEL6.4/x86_64

sanity-scrub test 0 also hit this failure:
https://maloo.whamcloud.com/test_sets/28af10f4-5aed-11e3-85e2-52540035b04c

Just back-ported the patch to Lustre b2_4 branch: http://review.whamcloud.com/8461

Generated at Sat Feb 10 01:32:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.