Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11556

conf-sanity test 32b crashes on MDT umount with “ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1”

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4
    • None
    • master clients with 2.11.0 or 2.10.5 servers
    • 3
    • 9223372036854775807

    Description

      conf-sanity test_32b crashes for master clients and 2.10.5 servers. Looking at the client test_log at https://testing.whamcloud.com/test_sets/b1b83d10-cedf-11e8-9238-52540065bddc, we see that the test crashes during umount of mdt1

      CMD: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight 14
      sed: -e expression #1, char 9: unknown option to `s'
      pdsh@onyx-37vm9: no remote hosts specified
      CMD: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight 14
      sed: -e expression #1, char 9: unknown option to `s'
      pdsh@onyx-37vm9: no remote hosts specified
      CMD: onyx-37vm12 umount -d /tmp/t32/mnt/mdt
      CMD: onyx-37vm12 umount -d /tmp/t32/mnt/mdt1
      

      Looking at the MDS (vm12) console log, we see

      [23682.075589] Lustre: DEBUG MARKER: cat /tmp/t32/list2
      [23682.368852] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
      [23682.888492] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
      [23682.889529] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Skipped 7 previous similar messages
      [23688.214366] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd*.*.force_sync=1
      [23688.389563] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
      [23688.557868] Lustre: DEBUG MARKER: test -f /tmp/t32/sha1sums
      [23688.885389] Lustre: DEBUG MARKER: cat /tmp/t32/sha1sums
      [23690.182337] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
      [23690.182337] 				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
      [23690.182337] 				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      [23690.526084] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param t32fs-MDT0000.mdc.max_rpcs_in_flight='14'
      [23690.690187] Lustre: Modifying parameter t32fs-MDT0000-mdc.mdc.max_rpcs_in_flight in log t32fs-client
      [23690.691354] Lustre: Skipped 6 previous similar messages
      [23691.038484] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt
      [23691.209348] Lustre: Failing over t32fs-MDT0000
      [23691.377221] Lustre: server umount t32fs-MDT0000 complete
      [23691.678261] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt1
      [23691.844378] LustreError: 11-0: t32fs-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
      [23692.002188] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6989360[0x0, 1, [0x240000401:0x1:0x0] hash exist]{
      [23692.002188] 
      [23692.003646] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb69893b0mdt-object@ffff89bbb6989360( , writecount=0)
      [23692.003646] 
      [23692.005001] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bb9d517870mdd-object@ffff89bb9d517870(open_count=0, valid=0, cltime=0, flags=0)
      [23692.005001] 
      [23692.006540] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbe46fd5b0lod-object@ffff89bbe46fd5b0
      [23692.006540] 
      [23692.007729] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osp@ffff89bbe218f2d0osp-object@ffff89bbe218f280
      [23692.007729] 
      [23692.008942] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6989360
      [23692.008942] 
      [23692.009948] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6843510[0x0, 1, [0x280000401:0x1:0x0] hash exist]{
      [23692.009948] 
      [23692.011273] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb6843560mdt-object@ffff89bbb6843510( , writecount=0)
      [23692.011273] 
      [23692.012624] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bbe47c2e60mdd-object@ffff89bbe47c2e60(open_count=0, valid=0, cltime=0, flags=0)
      [23692.012624] 
      [23692.014163] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbf9a19b60lod-object@ffff89bbf9a19b60
      [23692.014163] 
      [23692.015351] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osd-ldiskfs@ffff89bbf93f8400osd-ldiskfs-object@ffff89bbf93f8400(i:ffff89bbdb2b9a38:25056/936018527)[plain]
      [23692.015351] 
      [23692.017030] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6843510
      [23692.017030] 
      [23692.025094] LustreError: 11816:0:(lu_object.c:1177:lu_de[    0.000000] Initializing cgroup subsys cpuset
      [    0.000000] Initializing cgroup subsys cpu
      

      Looking at the kernel crash log, we can see the call trace

      [23691.678261] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt1
      [23691.844378] LustreError: 11-0: t32fs-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
      [23692.002188] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6989360[0x0, 1, [0x240000401:0x1:0x0] hash exist]{
      
      [23692.003646] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb69893b0mdt-object@ffff89bbb6989360( , writecount=0)
      
      [23692.005001] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bb9d517870mdd-object@ffff89bb9d517870(open_count=0, valid=0, cltime=0, flags=0)
      
      [23692.006540] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbe46fd5b0lod-object@ffff89bbe46fd5b0
      
      [23692.007729] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osp@ffff89bbe218f2d0osp-object@ffff89bbe218f280
      
      [23692.008942] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6989360
      
      [23692.009948] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6843510[0x0, 1, [0x280000401:0x1:0x0] hash exist]{
      
      [23692.011273] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb6843560mdt-object@ffff89bbb6843510( , writecount=0)
      
      [23692.012624] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bbe47c2e60mdd-object@ffff89bbe47c2e60(open_count=0, valid=0, cltime=0, flags=0)
      
      [23692.014163] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbf9a19b60lod-object@ffff89bbf9a19b60
      
      [23692.015351] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osd-ldiskfs@ffff89bbf93f8400osd-ldiskfs-object@ffff89bbf93f8400(i:ffff89bbdb2b9a38:25056/936018527)[plain]
      
      [23692.017030] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6843510
      
      [23692.025094] LustreError: 11816:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      [23692.025214] LustreError: 17697:0:(mdt_handler.c:4808:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
      [23692.025215] LustreError: 17697:0:(mdt_handler.c:4808:mdt_fini()) LBUG
      [23692.025217] Pid: 17697, comm: umount 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Mon Aug 27 17:48:12 UTC 2018
      [23692.025217] Call Trace:
      [23692.025249]  [<ffffffffc09287cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [23692.025254]  [<ffffffffc092887c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [23692.025266]  [<ffffffffc13d2872>] mdt_device_fini+0x8f2/0x930 [mdt]
      [23692.025302]  [<ffffffffc0afc4b7>] class_cleanup+0x987/0xce0 [obdclass]
      [23692.025318]  [<ffffffffc0afe83f>] class_process_config+0x19bf/0x2420 [obdclass]
      [23692.025332]  [<ffffffffc0aff466>] class_manual_cleanup+0x1c6/0x710 [obdclass]
      [23692.025353]  [<ffffffffc0b2c6de>] server_put_super+0x8de/0xcd0 [obdclass]
      [23692.025376]  [<ffffffffa9c1debd>] generic_shutdown_super+0x6d/0x100
      [23692.025378]  [<ffffffffa9c1e2a2>] kill_anon_super+0x12/0x20
      [23692.025393]  [<ffffffffc0b01eb2>] lustre_kill_super+0x32/0x50 [obdclass]
      [23692.025395]  [<ffffffffa9c1e65e>] deactivate_locked_super+0x4e/0x70
      [23692.025397]  [<ffffffffa9c1ede6>] deactivate_super+0x46/0x60
      [23692.025402]  [<ffffffffa9c3cd4f>] cleanup_mnt+0x3f/0x80
      [23692.025404]  [<ffffffffa9c3cde2>] __cleanup_mnt+0x12/0x20
      [23692.025417]  [<ffffffffa9ab803b>] task_work_run+0xbb/0xe0
      [23692.025426]  [<ffffffffa9a2ac55>] do_notify_resume+0xa5/0xc0
      [23692.025440]  [<ffffffffaa120ad8>] int_signal+0x12/0x17
      [23692.025460]  [<ffffffffffffffff>] 0xffffffffffffffff
      [23692.025463] Kernel panic - not syncing: LBUG
      [23692.025469] CPU: 1 PID: 17697 Comm: umount Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
      [23692.025469] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [23692.025470] Call Trace:
      [23692.025478]  [<ffffffffaa10e84e>] dump_stack+0x19/0x1b
      [23692.025480]  [<ffffffffaa108b50>] panic+0xe8/0x21f
      [23692.025486]  [<ffffffffc09288cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [23692.025493]  [<ffffffffc13d2872>] mdt_device_fini+0x8f2/0x930 [mdt]
      [23692.025508]  [<ffffffffc0afc4b7>] class_cleanup+0x987/0xce0 [obdclass]
      [23692.025522]  [<ffffffffc0afe83f>] class_process_config+0x19bf/0x2420 [obdclass]
      [23692.025529]  [<ffffffffc0933bd7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [23692.025543]  [<ffffffffc0aff466>] class_manual_cleanup+0x1c6/0x710 [obdclass]
      [23692.025558]  [<ffffffffc0b2c6de>] server_put_super+0x8de/0xcd0 [obdclass]
      [23692.025561]  [<ffffffffa9c1debd>] generic_shutdown_super+0x6d/0x100
      [23692.025563]  [<ffffffffa9c1e2a2>] kill_anon_super+0x12/0x20
      [23692.025577]  [<ffffffffc0b01eb2>] lustre_kill_super+0x32/0x50 [obdclass]
      [23692.025578]  [<ffffffffa9c1e65e>] deactivate_locked_super+0x4e/0x70
      [23692.025580]  [<ffffffffa9c1ede6>] deactivate_super+0x46/0x60
      [23692.025581]  [<ffffffffa9c3cd4f>] cleanup_mnt+0x3f/0x80
      [23692.025583]  [<ffffffffa9c3cde2>] __cleanup_mnt+0x12/0x20
      [23692.025584]  [<ffffffffa9ab803b>] task_work_run+0xbb/0xe0
      [23692.025586]  [<ffffffffa9a2ac55>] do_notify_resume+0xa5/0xc0
      [23692.025588]  [<ffffffffaa120ad8>] int_signal+0x12/0x17
      

      We’ve seen conf-sanity test 32b crash for 2.11.0 servers at https://testing.whamcloud.com/test_sets/fd4e6636-d44e-11e8-9238-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11556] conf-sanity test 32b crashes on MDT umount with “ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1”
            hornc Chris Horn added a comment - +1 on master https://testing.whamcloud.com/test_sessions/7e1b1932-32a6-4df8-bcc2-cf205d60f7e9

            I've hit this issue three times in a row for conf-sanity test 32b while testing patch https://review.whamcloud.com/#/c/33954/ with master, future 2.13.0, clients and 2.10.x servers.

            A couple of test crashes at
            https://testing.whamcloud.com/test_sets/10949eb8-46f0-11e9-8e92-52540065bddc
            https://testing.whamcloud.com/test_sets/6984d84c-51a6-11e9-a256-52540065bddc

            jamesanunez James Nunez (Inactive) added a comment - I've hit this issue three times in a row for conf-sanity test 32b while testing patch https://review.whamcloud.com/#/c/33954/ with master, future 2.13.0, clients and 2.10.x servers. A couple of test crashes at https://testing.whamcloud.com/test_sets/10949eb8-46f0-11e9-8e92-52540065bddc https://testing.whamcloud.com/test_sets/6984d84c-51a6-11e9-a256-52540065bddc
            mdiep Minh Diep added a comment - +1 on b2_12 https://testing.whamcloud.com/test_sessions/949cecde-fa29-47f8-8b0b-7469db2b2989
            adilger Andreas Dilger added a comment - +1 on b2_10: https://testing.whamcloud.com/test_sets/f0c5be5c-2af0-11e9-9e7f-52540065bddc

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33422/
            Subject: LU-11556 tests: fix set_persistent_param_and_check breakage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cf9745943f6e003ef207adcab039cd472e6f3068

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33422/ Subject: LU-11556 tests: fix set_persistent_param_and_check breakage Project: fs/lustre-release Branch: master Current Patch Set: Commit: cf9745943f6e003ef207adcab039cd472e6f3068
            jhammond John Hammond added a comment -

            This is a 2.10.5 bug not a 2.12-2.10.5 interop bug. (But it may be that this 2.10.5 bug is still present in 2.12.)

            jhammond John Hammond added a comment - This is a 2.10.5 bug not a 2.12-2.10.5 interop bug. (But it may be that this 2.10.5 bug is still present in 2.12.)

            Note that the above patch is not intended to solve the problem of the crash, it is just fixing another test script problem I saw while looking at the logs. It was introduced by patch https://review.whamcloud.com/30087 "LU-7004 tests: move from lctl conf_param to lctl set_param -P" which recently landed, because that is calling "set_persistent_param_and_check $node" when it should be calling "set_persistent_param_and_check $facet"

            adilger Andreas Dilger added a comment - Note that the above patch is not intended to solve the problem of the crash, it is just fixing another test script problem I saw while looking at the logs. It was introduced by patch https://review.whamcloud.com/30087 " LU-7004 tests: move from lctl conf_param to lctl set_param -P" which recently landed, because that is calling "set_persistent_param_and_check $node" when it should be calling "set_persistent_param_and_check $facet"

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33422
            Subject: LU-11556 tests: fix set_persistent_param_and_check breakage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 319003a39aee9ec683341d79e1ab0bd681c03c2a

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33422 Subject: LU-11556 tests: fix set_persistent_param_and_check breakage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 319003a39aee9ec683341d79e1ab0bd681c03c2a

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: