[LU-11556] conf-sanity test 32b crashes on MDT umount with “ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1” Created: 22/Oct/18  Updated: 02/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

master clients with 2.11.0 or 2.10.5 servers


Issue Links:
Related
is related to LU-7872 conf-sanity: test_50i 'test failed to... Open
is related to LU-11558 t32_verify_quota() not being called b... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test_32b crashes for master clients and 2.10.5 servers. Looking at the client test_log at https://testing.whamcloud.com/test_sets/b1b83d10-cedf-11e8-9238-52540065bddc, we see that the test crashes during umount of mdt1

CMD: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight 14
sed: -e expression #1, char 9: unknown option to `s'
pdsh@onyx-37vm9: no remote hosts specified
CMD: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight 14
sed: -e expression #1, char 9: unknown option to `s'
pdsh@onyx-37vm9: no remote hosts specified
CMD: onyx-37vm12 umount -d /tmp/t32/mnt/mdt
CMD: onyx-37vm12 umount -d /tmp/t32/mnt/mdt1

Looking at the MDS (vm12) console log, we see

[23682.075589] Lustre: DEBUG MARKER: cat /tmp/t32/list2
[23682.368852] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
[23682.888492] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
[23682.889529] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Skipped 7 previous similar messages
[23688.214366] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd*.*.force_sync=1
[23688.389563] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
[23688.557868] Lustre: DEBUG MARKER: test -f /tmp/t32/sha1sums
[23688.885389] Lustre: DEBUG MARKER: cat /tmp/t32/sha1sums
[23690.182337] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
[23690.182337] 				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
[23690.182337] 				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
[23690.526084] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param t32fs-MDT0000.mdc.max_rpcs_in_flight='14'
[23690.690187] Lustre: Modifying parameter t32fs-MDT0000-mdc.mdc.max_rpcs_in_flight in log t32fs-client
[23690.691354] Lustre: Skipped 6 previous similar messages
[23691.038484] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt
[23691.209348] Lustre: Failing over t32fs-MDT0000
[23691.377221] Lustre: server umount t32fs-MDT0000 complete
[23691.678261] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt1
[23691.844378] LustreError: 11-0: t32fs-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
[23692.002188] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6989360[0x0, 1, [0x240000401:0x1:0x0] hash exist]{
[23692.002188] 
[23692.003646] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb69893b0mdt-object@ffff89bbb6989360( , writecount=0)
[23692.003646] 
[23692.005001] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bb9d517870mdd-object@ffff89bb9d517870(open_count=0, valid=0, cltime=0, flags=0)
[23692.005001] 
[23692.006540] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbe46fd5b0lod-object@ffff89bbe46fd5b0
[23692.006540] 
[23692.007729] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osp@ffff89bbe218f2d0osp-object@ffff89bbe218f280
[23692.007729] 
[23692.008942] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6989360
[23692.008942] 
[23692.009948] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6843510[0x0, 1, [0x280000401:0x1:0x0] hash exist]{
[23692.009948] 
[23692.011273] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb6843560mdt-object@ffff89bbb6843510( , writecount=0)
[23692.011273] 
[23692.012624] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bbe47c2e60mdd-object@ffff89bbe47c2e60(open_count=0, valid=0, cltime=0, flags=0)
[23692.012624] 
[23692.014163] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbf9a19b60lod-object@ffff89bbf9a19b60
[23692.014163] 
[23692.015351] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osd-ldiskfs@ffff89bbf93f8400osd-ldiskfs-object@ffff89bbf93f8400(i:ffff89bbdb2b9a38:25056/936018527)[plain]
[23692.015351] 
[23692.017030] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6843510
[23692.017030] 
[23692.025094] LustreError: 11816:0:(lu_object.c:1177:lu_de[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu

Looking at the kernel crash log, we can see the call trace

[23691.678261] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt1
[23691.844378] LustreError: 11-0: t32fs-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
[23692.002188] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6989360[0x0, 1, [0x240000401:0x1:0x0] hash exist]{

[23692.003646] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb69893b0mdt-object@ffff89bbb6989360( , writecount=0)

[23692.005001] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bb9d517870mdd-object@ffff89bb9d517870(open_count=0, valid=0, cltime=0, flags=0)

[23692.006540] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbe46fd5b0lod-object@ffff89bbe46fd5b0

[23692.007729] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osp@ffff89bbe218f2d0osp-object@ffff89bbe218f280

[23692.008942] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6989360

[23692.009948] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6843510[0x0, 1, [0x280000401:0x1:0x0] hash exist]{

[23692.011273] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb6843560mdt-object@ffff89bbb6843510( , writecount=0)

[23692.012624] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bbe47c2e60mdd-object@ffff89bbe47c2e60(open_count=0, valid=0, cltime=0, flags=0)

[23692.014163] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbf9a19b60lod-object@ffff89bbf9a19b60

[23692.015351] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osd-ldiskfs@ffff89bbf93f8400osd-ldiskfs-object@ffff89bbf93f8400(i:ffff89bbdb2b9a38:25056/936018527)[plain]

[23692.017030] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6843510

[23692.025094] LustreError: 11816:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
[23692.025214] LustreError: 17697:0:(mdt_handler.c:4808:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
[23692.025215] LustreError: 17697:0:(mdt_handler.c:4808:mdt_fini()) LBUG
[23692.025217] Pid: 17697, comm: umount 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Mon Aug 27 17:48:12 UTC 2018
[23692.025217] Call Trace:
[23692.025249]  [<ffffffffc09287cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[23692.025254]  [<ffffffffc092887c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[23692.025266]  [<ffffffffc13d2872>] mdt_device_fini+0x8f2/0x930 [mdt]
[23692.025302]  [<ffffffffc0afc4b7>] class_cleanup+0x987/0xce0 [obdclass]
[23692.025318]  [<ffffffffc0afe83f>] class_process_config+0x19bf/0x2420 [obdclass]
[23692.025332]  [<ffffffffc0aff466>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[23692.025353]  [<ffffffffc0b2c6de>] server_put_super+0x8de/0xcd0 [obdclass]
[23692.025376]  [<ffffffffa9c1debd>] generic_shutdown_super+0x6d/0x100
[23692.025378]  [<ffffffffa9c1e2a2>] kill_anon_super+0x12/0x20
[23692.025393]  [<ffffffffc0b01eb2>] lustre_kill_super+0x32/0x50 [obdclass]
[23692.025395]  [<ffffffffa9c1e65e>] deactivate_locked_super+0x4e/0x70
[23692.025397]  [<ffffffffa9c1ede6>] deactivate_super+0x46/0x60
[23692.025402]  [<ffffffffa9c3cd4f>] cleanup_mnt+0x3f/0x80
[23692.025404]  [<ffffffffa9c3cde2>] __cleanup_mnt+0x12/0x20
[23692.025417]  [<ffffffffa9ab803b>] task_work_run+0xbb/0xe0
[23692.025426]  [<ffffffffa9a2ac55>] do_notify_resume+0xa5/0xc0
[23692.025440]  [<ffffffffaa120ad8>] int_signal+0x12/0x17
[23692.025460]  [<ffffffffffffffff>] 0xffffffffffffffff
[23692.025463] Kernel panic - not syncing: LBUG
[23692.025469] CPU: 1 PID: 17697 Comm: umount Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
[23692.025469] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[23692.025470] Call Trace:
[23692.025478]  [<ffffffffaa10e84e>] dump_stack+0x19/0x1b
[23692.025480]  [<ffffffffaa108b50>] panic+0xe8/0x21f
[23692.025486]  [<ffffffffc09288cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[23692.025493]  [<ffffffffc13d2872>] mdt_device_fini+0x8f2/0x930 [mdt]
[23692.025508]  [<ffffffffc0afc4b7>] class_cleanup+0x987/0xce0 [obdclass]
[23692.025522]  [<ffffffffc0afe83f>] class_process_config+0x19bf/0x2420 [obdclass]
[23692.025529]  [<ffffffffc0933bd7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[23692.025543]  [<ffffffffc0aff466>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[23692.025558]  [<ffffffffc0b2c6de>] server_put_super+0x8de/0xcd0 [obdclass]
[23692.025561]  [<ffffffffa9c1debd>] generic_shutdown_super+0x6d/0x100
[23692.025563]  [<ffffffffa9c1e2a2>] kill_anon_super+0x12/0x20
[23692.025577]  [<ffffffffc0b01eb2>] lustre_kill_super+0x32/0x50 [obdclass]
[23692.025578]  [<ffffffffa9c1e65e>] deactivate_locked_super+0x4e/0x70
[23692.025580]  [<ffffffffa9c1ede6>] deactivate_super+0x46/0x60
[23692.025581]  [<ffffffffa9c3cd4f>] cleanup_mnt+0x3f/0x80
[23692.025583]  [<ffffffffa9c3cde2>] __cleanup_mnt+0x12/0x20
[23692.025584]  [<ffffffffa9ab803b>] task_work_run+0xbb/0xe0
[23692.025586]  [<ffffffffa9a2ac55>] do_notify_resume+0xa5/0xc0
[23692.025588]  [<ffffffffaa120ad8>] int_signal+0x12/0x17

We’ve seen conf-sanity test 32b crash for 2.11.0 servers at https://testing.whamcloud.com/test_sets/fd4e6636-d44e-11e8-9238-52540065bddc



 Comments   
Comment by Gerrit Updater [ 22/Oct/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33422
Subject: LU-11556 tests: fix set_persistent_param_and_check breakage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 319003a39aee9ec683341d79e1ab0bd681c03c2a

Comment by Andreas Dilger [ 22/Oct/18 ]

Note that the above patch is not intended to solve the problem of the crash, it is just fixing another test script problem I saw while looking at the logs. It was introduced by patch https://review.whamcloud.com/30087 "LU-7004 tests: move from lctl conf_param to lctl set_param -P" which recently landed, because that is calling "set_persistent_param_and_check $node" when it should be calling "set_persistent_param_and_check $facet"

Comment by John Hammond [ 29/Oct/18 ]

This is a 2.10.5 bug not a 2.12-2.10.5 interop bug. (But it may be that this 2.10.5 bug is still present in 2.12.)

Comment by Gerrit Updater [ 02/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33422/
Subject: LU-11556 tests: fix set_persistent_param_and_check breakage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf9745943f6e003ef207adcab039cd472e6f3068

Comment by Andreas Dilger [ 07/Feb/19 ]

+1 on b2_10: https://testing.whamcloud.com/test_sets/f0c5be5c-2af0-11e9-9e7f-52540065bddc

Comment by Minh Diep [ 22/Mar/19 ]

+1 on b2_12 https://testing.whamcloud.com/test_sessions/949cecde-fa29-47f8-8b0b-7469db2b2989

Comment by James Nunez (Inactive) [ 28/Mar/19 ]

I've hit this issue three times in a row for conf-sanity test 32b while testing patch https://review.whamcloud.com/#/c/33954/ with master, future 2.13.0, clients and 2.10.x servers.

A couple of test crashes at
https://testing.whamcloud.com/test_sets/10949eb8-46f0-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/6984d84c-51a6-11e9-a256-52540065bddc

Comment by Chris Horn [ 07/May/20 ]

+1 on master https://testing.whamcloud.com/test_sessions/7e1b1932-32a6-4df8-bcc2-cf205d60f7e9

Generated at Sat Feb 10 02:44:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.