[LU-14988] crash in ll_migrate in racer Created: 07/Sep/21  Updated: 29/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13157 migrate symlink with target name leng... Resolved
is related to LU-17048 Crash in lod_declare_update_extents Resolved
is related to LU-14754 add Overstripe support to racer Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://review.whamcloud.com/#/c/43964/ seems to cause a crash in ll_migrate similar to what was reported in now landed LU-13157

very reproducable on recent master-next

 
[  628.058780] LustreError: 25573:0:(ldlm_resource.c:1124:ldlm_resource_complain()) Skipped 23 previous similar messages
[  628.489874] LustreError: 14648:0:(osp_sync.c:1094:osp_sync_process_committed()) lustre-OST0003-osc-MDT0001: can't cancel 279 records: rc = -30
[  628.599298] LustreError: 14648:0:(osp_sync.c:1094:osp_sync_process_committed()) Skipped 15 previous similar messages
[  628.701810] Lustre: lustre-OST0000-osc-ffff88029a0e8008: Connection restored to 192.168.123.100@tcp (at 0@lo)
[  629.004000] LustreError: 14050:0:(osp_sync.c:1079:osp_sync_process_committed()) lustre-OST0001-osc-MDT0001: can't cancel record: rc = -30
[  629.035251] LustreError: 14050:0:(osp_sync.c:1079:osp_sync_process_committed()) Skipped 5 previous similar messages
[  630.287812] LustreError: 25548:0:(llite_lib.c:1836:ll_md_setattr()) md_setattr fails: rc = -30
[  630.894353] Lustre: mdt07_002: service thread pid 11113 was inactive for 66.027 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.[  630.930762] Lustre: Skipped 6 previous similar messages
[  631.237306] LustreError: 11-0: lustre-MDT0001-mdc-ffff88009e7fb7e8: operation ldlm_enqueue to node 0@lo failed: rc = -30
[  631.396604] LustreError: 25697:0:(llite_lib.c:1836:ll_md_setattr()) md_setattr fails: rc = -30
[  631.768361] LustreError: 25697:0:(llite_lib.c:1836:ll_md_setattr()) Skipped 1 previous similar message[  632.634314] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  632.658290] IP: [<ffffffffa1051e52>] ll_migrate+0x9b2/0xec0 [lustre]
[  632.660785] PGD 800000028b3a6067 PUD 28b3a7067 PMD 0 
[  632.662892] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  632.665067] Modules linked in: loop zfs(PO) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) jbd2 mbcache lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) dm_flakey dm_mod libcfs(OE) crc_t10dif crct10dif_generic sb_edac edac_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon virtio_console pcspkr i2c_piix4 ip_tables rpcsec_gss_krb5 drm_kms_helper ttm ata_generic pata_acpi drm crct10dif_pclmul crct10dif_common crc32c_intel drm_panel_orientation_quirks ata_piix serio_raw virtio_blk i2c_core libata floppy[  632.698829] CPU: 6 PID: 25076 Comm: lfs Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.9-debug #2
[  632.701212] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  632.702454] task: ffff8802912324f0 ti: ffff88024ad2c000 task.ti: ffff88024ad2c000
[  632.705158] RIP: 0010:[<ffffffffa1051e52>]  [<ffffffffa1051e52>] ll_migrate+0x9b2/0xec0 [lustre]
[  632.708792] RSP: 0018:ffff88024ad2fbc8  EFLAGS: 00010206[  632.710770] RAX: 0000000000000000 RBX: ffff880253c01458 RCX: 0000000000000000
[  632.713070] RDX: 0000000000000000 RSI: ffff880327331138 RDI: ffff880327331118
[  632.715401] RBP: ffff88024ad2fc48 R08: ffff8802668e2058 R09: 0000000000000001
[  632.717769] R10: 0000000000000000 R11: ffff88024ad2f5e6 R12: 0000000000000000
[  632.720049] R13: ffff8800848d48e8 R14: ffff88029b7166d8 R15: 0000000000000030
[  632.722352] FS:  00007fa279316740(0000) GS:ffff880331b80000(0000) knlGS:0000000000000000
[  632.781764] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  632.784011] CR2: 0000000000000008 CR3: 0000000266662000 CR4: 00000000001607e0
[  632.786316] Call Trace:
[  632.788315]  [<ffffffff81242213>] ? __check_object_size+0x1c3/0x220
[  632.790913]  [<ffffffffa103f561>] ll_dir_ioctl+0x5d01/0x6ed0 [lustre]
[  632.793261]  [<ffffffff81411979>] ? do_raw_spin_unlock+0x49/0x90
[  632.795476]  [<ffffffff8115260f>] ? delayacct_end+0x8f/0xb0
[  632.817827]  [<ffffffff81152744>] ? __delayacct_blkio_end+0x34/0x60
[  632.820259]  [<ffffffff817e0257>] ? io_schedule_timeout+0xe7/0x130
[  632.822620]  [<ffffffff811b62dd>] ? find_get_pages_tag+0x10d/0x260
[  632.824759]  [<ffffffff811c3691>] ? pagevec_lookup_tag+0x21/0x30
[  632.827037]  [<ffffffff811b400e>] ? __filemap_fdatawait_range+0xbe/0x1a0
[  632.830403]  [<ffffffff8125b3fd>] do_vfs_ioctl+0x40d/0x6c0
[  632.833084]  [<ffffffff81264d2b>] ? iput+0x3b/0x180
[  632.835354]  [<ffffffff8125b751>] SyS_ioctl+0xa1/0xc0
[  632.838391]  [<ffffffff817ee00c>] system_call_fastpath+0x1f/0x24[  632.840775] Code: 03 49 89 45 50 48 8b 44 24 38 49 89 85 38 01 00 00 48 8b 43 20 41 81 8d 30 01 00 00 00 20 00 00 49 89 85 40 01 00 00 48 8b 43 18 <48> 8b 78 08 48 83 c7 18 e8 b1 13 79 e0 48 8b 43 18 48 8b 40 08 [  632.884740] RIP  [<ffffffffa1051e52>] ll_migrate+0x9b2/0xec0 [lustre]
[  632.891978]  RSP <ffff88024ad2fbc8>
[  632.912110] CR2: 0000000000000008


 Comments   
Comment by Gerrit Updater [ 20/Sep/21 ]

"James Nunez <jnunez@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44992
Subject: LU-14988 tests: racer overstripe crash
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 720a1c5dfa58236b7c3ca8f57c6f7d4862d16284

Comment by Cory Spitz [ 12/Oct/21 ]

jamesanunez, what's going on with https://review.whamcloud.com/#/c/44992/ ? Any news? I think LU-14754 has been waiting on this to shake out, but there is nothing wrong with that test patch. If that test is bringing out this issue it is all the more reason to land it, don't you think?

Comment by Andreas Dilger [ 13/Oct/21 ]

Cory,
it's true that the new test has brought out this new issue, and there is nothing particularly "wrong" with the patch itself, but I disagree that this implies the patch should land immediately. If that were to happen, then all or many racer test runs would trigger this same assertion, and we'd be flooded with a sea of test failures, potentially hiding other regressions that may be introduced in the meantime because it is hard to distinguish the existing "FAIL" or "TIMEOUT" from some new "FAIL", even if the root causes are different.

Rather than landing the new test patch as-is, the new defect that the new test patch found needs to be fixed, verify that the new test is now passing, and then land both of them.

Comment by Elena Gryaznova [ 15/Oct/21 ]

adilger,
will the following change help to land the patch ?

--- a/lustre/tests/racer.sh
+++ b/lustre/tests/racer.sh
@@ -70,7 +70,8 @@ RACER_ENABLE_PFL=${RACER_ENABLE_PFL:-true}
 RACER_ENABLE_DOM=${RACER_ENABLE_DOM:-true}
 RACER_ENABLE_FLR=${RACER_ENABLE_FLR:-true}
 RACER_ENABLE_SEL=${RACER_ENABLE_SEL:-true}
-RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-true}
+# set false, LU-14988
+RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-false}
Comment by Andreas Dilger [ 16/Oct/21 ]

Definitely yes. That will allow the functionality to be landed, without it being enabled.

Comment by Elena Gryaznova [ 16/Oct/21 ]

Thank you, Andreas.
Patch is updated.

Comment by Andreas Dilger [ 14/Apr/23 ]

The next step for this issue would be to push a patch that changes:

-RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-false}
+RACER_ENABLE_OVERSTRIPE=${RACER_ENABLE_OVERSTRIPE:-true}

and then add enough "Test-Parameters: testlist=racer env=DURATION=900" runs to determine if the overstriping is still causing new failures or not. In the past 4 weeks on master there have been 479 racer runs, with 4 FAIL, 8 CRASH, 1 TIMEOUT:

https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&server_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&horizon=2332800&status%5B%5D=FAIL&status%5B%5D=TIMEOUT&status%5B%5D=CRASH&test_set_script_id=9ab2e9f8-6aec-11e0-b32b-52540025f9af&sub_test_script_id=12c874f8-c308-11e0-9a20-52540025f9af&source=sub_tests#redirect

So that is about 1/37 ~= 2.7% failure rate, so running a handful of racer runs would tell us first if this crash is still present/common, and then about 100 total runs would tell us if this induces more failures, or if it can be re-enabled by default (presumably bug has since been fixed).

Comment by Gerrit Updater [ 29/Aug/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52169
Subject: LU-14988 tests: enable overstripe in racer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c9069855175d5807f63db93ada1ecd5760722b39

Generated at Sat Feb 10 03:14:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.