Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      https://review.whamcloud.com/#/c/43964/ seems to cause a crash in ll_migrate similar to what was reported in now landed LU-13157

      very reproducable on recent master-next

       
      [  628.058780] LustreError: 25573:0:(ldlm_resource.c:1124:ldlm_resource_complain()) Skipped 23 previous similar messages
      [  628.489874] LustreError: 14648:0:(osp_sync.c:1094:osp_sync_process_committed()) lustre-OST0003-osc-MDT0001: can't cancel 279 records: rc = -30
      [  628.599298] LustreError: 14648:0:(osp_sync.c:1094:osp_sync_process_committed()) Skipped 15 previous similar messages
      [  628.701810] Lustre: lustre-OST0000-osc-ffff88029a0e8008: Connection restored to 192.168.123.100@tcp (at 0@lo)
      [  629.004000] LustreError: 14050:0:(osp_sync.c:1079:osp_sync_process_committed()) lustre-OST0001-osc-MDT0001: can't cancel record: rc = -30
      [  629.035251] LustreError: 14050:0:(osp_sync.c:1079:osp_sync_process_committed()) Skipped 5 previous similar messages
      [  630.287812] LustreError: 25548:0:(llite_lib.c:1836:ll_md_setattr()) md_setattr fails: rc = -30
      [  630.894353] Lustre: mdt07_002: service thread pid 11113 was inactive for 66.027 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.[  630.930762] Lustre: Skipped 6 previous similar messages
      [  631.237306] LustreError: 11-0: lustre-MDT0001-mdc-ffff88009e7fb7e8: operation ldlm_enqueue to node 0@lo failed: rc = -30
      [  631.396604] LustreError: 25697:0:(llite_lib.c:1836:ll_md_setattr()) md_setattr fails: rc = -30
      [  631.768361] LustreError: 25697:0:(llite_lib.c:1836:ll_md_setattr()) Skipped 1 previous similar message[  632.634314] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  632.658290] IP: [<ffffffffa1051e52>] ll_migrate+0x9b2/0xec0 [lustre]
      [  632.660785] PGD 800000028b3a6067 PUD 28b3a7067 PMD 0 
      [  632.662892] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [  632.665067] Modules linked in: loop zfs(PO) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) jbd2 mbcache lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) dm_flakey dm_mod libcfs(OE) crc_t10dif crct10dif_generic sb_edac edac_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon virtio_console pcspkr i2c_piix4 ip_tables rpcsec_gss_krb5 drm_kms_helper ttm ata_generic pata_acpi drm crct10dif_pclmul crct10dif_common crc32c_intel drm_panel_orientation_quirks ata_piix serio_raw virtio_blk i2c_core libata floppy[  632.698829] CPU: 6 PID: 25076 Comm: lfs Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.9-debug #2
      [  632.701212] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  632.702454] task: ffff8802912324f0 ti: ffff88024ad2c000 task.ti: ffff88024ad2c000
      [  632.705158] RIP: 0010:[<ffffffffa1051e52>]  [<ffffffffa1051e52>] ll_migrate+0x9b2/0xec0 [lustre]
      [  632.708792] RSP: 0018:ffff88024ad2fbc8  EFLAGS: 00010206[  632.710770] RAX: 0000000000000000 RBX: ffff880253c01458 RCX: 0000000000000000
      [  632.713070] RDX: 0000000000000000 RSI: ffff880327331138 RDI: ffff880327331118
      [  632.715401] RBP: ffff88024ad2fc48 R08: ffff8802668e2058 R09: 0000000000000001
      [  632.717769] R10: 0000000000000000 R11: ffff88024ad2f5e6 R12: 0000000000000000
      [  632.720049] R13: ffff8800848d48e8 R14: ffff88029b7166d8 R15: 0000000000000030
      [  632.722352] FS:  00007fa279316740(0000) GS:ffff880331b80000(0000) knlGS:0000000000000000
      [  632.781764] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  632.784011] CR2: 0000000000000008 CR3: 0000000266662000 CR4: 00000000001607e0
      [  632.786316] Call Trace:
      [  632.788315]  [<ffffffff81242213>] ? __check_object_size+0x1c3/0x220
      [  632.790913]  [<ffffffffa103f561>] ll_dir_ioctl+0x5d01/0x6ed0 [lustre]
      [  632.793261]  [<ffffffff81411979>] ? do_raw_spin_unlock+0x49/0x90
      [  632.795476]  [<ffffffff8115260f>] ? delayacct_end+0x8f/0xb0
      [  632.817827]  [<ffffffff81152744>] ? __delayacct_blkio_end+0x34/0x60
      [  632.820259]  [<ffffffff817e0257>] ? io_schedule_timeout+0xe7/0x130
      [  632.822620]  [<ffffffff811b62dd>] ? find_get_pages_tag+0x10d/0x260
      [  632.824759]  [<ffffffff811c3691>] ? pagevec_lookup_tag+0x21/0x30
      [  632.827037]  [<ffffffff811b400e>] ? __filemap_fdatawait_range+0xbe/0x1a0
      [  632.830403]  [<ffffffff8125b3fd>] do_vfs_ioctl+0x40d/0x6c0
      [  632.833084]  [<ffffffff81264d2b>] ? iput+0x3b/0x180
      [  632.835354]  [<ffffffff8125b751>] SyS_ioctl+0xa1/0xc0
      [  632.838391]  [<ffffffff817ee00c>] system_call_fastpath+0x1f/0x24[  632.840775] Code: 03 49 89 45 50 48 8b 44 24 38 49 89 85 38 01 00 00 48 8b 43 20 41 81 8d 30 01 00 00 00 20 00 00 49 89 85 40 01 00 00 48 8b 43 18 <48> 8b 78 08 48 83 c7 18 e8 b1 13 79 e0 48 8b 43 18 48 8b 40 08 [  632.884740] RIP  [<ffffffffa1051e52>] ll_migrate+0x9b2/0xec0 [lustre]
      [  632.891978]  RSP <ffff88024ad2fbc8>
      [  632.912110] CR2: 0000000000000008

      Attachments

        Issue Links

          Activity

            [LU-14988] crash in ll_migrate in racer

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52169
            Subject: LU-14988 tests: enable overstripe in racer
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c9069855175d5807f63db93ada1ecd5760722b39

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52169 Subject: LU-14988 tests: enable overstripe in racer Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c9069855175d5807f63db93ada1ecd5760722b39

            The next step for this issue would be to push a patch that changes:

            -RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-false}
            +RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-true}
            

            and then add enough "Test-Parameters: testlist=racer env=DURATION=900" runs to determine if the overstriping is still causing new failures or not. In the past 4 weeks on master there have been 479 racer runs, with 4 FAIL, 8 CRASH, 1 TIMEOUT:

            https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&server_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&horizon=2332800&status%5B%5D=FAIL&status%5B%5D=TIMEOUT&status%5B%5D=CRASH&test_set_script_id=9ab2e9f8-6aec-11e0-b32b-52540025f9af&sub_test_script_id=12c874f8-c308-11e0-9a20-52540025f9af&source=sub_tests#redirect

            So that is about 1/37 ~= 2.7% failure rate, so running a handful of racer runs would tell us first if this crash is still present/common, and then about 100 total runs would tell us if this induces more failures, or if it can be re-enabled by default (presumably bug has since been fixed).

            adilger Andreas Dilger added a comment - The next step for this issue would be to push a patch that changes: -RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-false} +RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-true} and then add enough " Test-Parameters: testlist=racer env=DURATION=900 " runs to determine if the overstriping is still causing new failures or not. In the past 4 weeks on master there have been 479 racer runs, with 4 FAIL, 8 CRASH, 1 TIMEOUT: https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&server_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&horizon=2332800&status%5B%5D=FAIL&status%5B%5D=TIMEOUT&status%5B%5D=CRASH&test_set_script_id=9ab2e9f8-6aec-11e0-b32b-52540025f9af&sub_test_script_id=12c874f8-c308-11e0-9a20-52540025f9af&source=sub_tests#redirect So that is about 1/37 ~= 2.7% failure rate, so running a handful of racer runs would tell us first if this crash is still present/common, and then about 100 total runs would tell us if this induces more failures, or if it can be re-enabled by default (presumably bug has since been fixed).

            Thank you, Andreas.
            Patch is updated.

            egryaznova Elena Gryaznova added a comment - Thank you, Andreas. Patch is updated.

            Definitely yes. That will allow the functionality to be landed, without it being enabled.

            adilger Andreas Dilger added a comment - Definitely yes. That will allow the functionality to be landed, without it being enabled.
            egryaznova Elena Gryaznova added a comment - - edited

            adilger,
            will the following change help to land the patch ?

            --- a/lustre/tests/racer.sh
            +++ b/lustre/tests/racer.sh
            @@ -70,7 +70,8 @@ RACER_ENABLE_PFL=${RACER_ENABLE_PFL:-true}
             RACER_ENABLE_DOM=${RACER_ENABLE_DOM:-true}
             RACER_ENABLE_FLR=${RACER_ENABLE_FLR:-true}
             RACER_ENABLE_SEL=${RACER_ENABLE_SEL:-true}
            -RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-true}
            +# set false, LU-14988
            +RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-false}
            
            egryaznova Elena Gryaznova added a comment - - edited adilger , will the following change help to land the patch ? --- a/lustre/tests/racer.sh +++ b/lustre/tests/racer.sh @@ -70,7 +70,8 @@ RACER_ENABLE_PFL=${RACER_ENABLE_PFL:- true } RACER_ENABLE_DOM=${RACER_ENABLE_DOM:- true } RACER_ENABLE_FLR=${RACER_ENABLE_FLR:- true } RACER_ENABLE_SEL=${RACER_ENABLE_SEL:- true } -RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:- true } +# set false , LU-14988 +RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:- false }

            Cory,
            it's true that the new test has brought out this new issue, and there is nothing particularly "wrong" with the patch itself, but I disagree that this implies the patch should land immediately. If that were to happen, then all or many racer test runs would trigger this same assertion, and we'd be flooded with a sea of test failures, potentially hiding other regressions that may be introduced in the meantime because it is hard to distinguish the existing "FAIL" or "TIMEOUT" from some new "FAIL", even if the root causes are different.

            Rather than landing the new test patch as-is, the new defect that the new test patch found needs to be fixed, verify that the new test is now passing, and then land both of them.

            adilger Andreas Dilger added a comment - Cory, it's true that the new test has brought out this new issue, and there is nothing particularly "wrong" with the patch itself, but I disagree that this implies the patch should land immediately. If that were to happen, then all or many racer test runs would trigger this same assertion, and we'd be flooded with a sea of test failures, potentially hiding other regressions that may be introduced in the meantime because it is hard to distinguish the existing "FAIL" or "TIMEOUT" from some new "FAIL", even if the root causes are different. Rather than landing the new test patch as-is, the new defect that the new test patch found needs to be fixed, verify that the new test is now passing, and then land both of them.
            spitzcor Cory Spitz added a comment -

            jamesanunez, what's going on with https://review.whamcloud.com/#/c/44992/ ? Any news? I think LU-14754 has been waiting on this to shake out, but there is nothing wrong with that test patch. If that test is bringing out this issue it is all the more reason to land it, don't you think?

            spitzcor Cory Spitz added a comment - jamesanunez , what's going on with https://review.whamcloud.com/#/c/44992/ ? Any news? I think LU-14754 has been waiting on this to shake out, but there is nothing wrong with that test patch. If that test is bringing out this issue it is all the more reason to land it, don't you think?
            gerrit Gerrit Updater added a comment - - edited

            "James Nunez <jnunez@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44992
            Subject: LU-14988 tests: racer overstripe crash
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 720a1c5dfa58236b7c3ca8f57c6f7d4862d16284

            gerrit Gerrit Updater added a comment - - edited "James Nunez <jnunez@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44992 Subject: LU-14988 tests: racer overstripe crash Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 720a1c5dfa58236b7c3ca8f57c6f7d4862d16284

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: