[LU-13645] Various data corruptions possible in lustre. Created: 08/Jun/20  Updated: 03/Mar/23  Resolved: 30/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.5
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Blocker
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-12681 Data corruption - due incorrect KMS w... Resolved
is related to LU-11670 Incorrect size when using lockahead Resolved
is related to LU-13128 a race between glimpse and lock cance... Resolved
is related to LU-9479 sanity test 184d 244: don't instantia... Open
is related to LU-13759 sanity-dom sanityn_test_20 fails with... Resolved
is related to LU-14084 change 'lfs migrate' to use 'MIGRATIO... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Two groups data corruption cases possible with a lustre, but both is addressed to the lock cancel without osc object assigned to lock.
This is possible for the DoM and for the Lock Ahead cases.
Lock Ahead bug have a partial fix - it's LU-11670/LUS-6747.

1) first bug is addressed to the situation when check_and_discard function can found lock without l_ast_data assigned, this block to discard a pages from page cache and leave as is.
Next lock cancel will found this lock and page discard is skipped due lack of the osc object assigned. Pages can be read from page cache in next time by ll_do_fast_read which relies an page flags and provide a data from data cache.

For the Lock Ahead case, it don't have a logs and other conformation - but it looks possible.
for the DoM cases - this is confirmed case.
second lock cancel trace

      ldlm_bl_13-35551 [034] 164201.591130: funcgraph_entry:                   |  ll_dom_lock_cancel() {
      ldlm_bl_13-35551 [034] 164201.591132: funcgraph_entry:                   |    cl_env_get() {
      ldlm_bl_13-35551 [034] 164201.591132: funcgraph_entry:        0.054 us   |      _raw_read_lock();
      ldlm_bl_13-35551 [034] 164201.591132: funcgraph_entry:        0.039 us   |      lu_env_refill();
      ldlm_bl_13-35551 [034] 164201.591133: funcgraph_entry:        0.046 us   |      cl_env_init0();
      ldlm_bl_13-35551 [034] 164201.591133: funcgraph_entry:        0.035 us   |      lu_context_enter();
      ldlm_bl_13-35551 [034] 164201.591133: funcgraph_entry:        0.034 us   |      lu_context_enter();
      ldlm_bl_13-35551 [034] 164201.591134: funcgraph_exit:         1.811 us   |    }
      ldlm_bl_13-35551 [034] 164201.591134: funcgraph_entry:                   |    cl_object_flush() {
      ldlm_bl_13-35551 [034] 164201.591134: funcgraph_entry:                   |      lov_object_flush() {
      ldlm_bl_13-35551 [034] 164201.591134: funcgraph_entry:        0.115 us   |        down_read();
      ldlm_bl_13-35551 [034] 164201.591135: funcgraph_entry:                   |        lov_flush_composite() {
      ldlm_bl_13-35551 [034] 164201.591135: funcgraph_entry:                   |          cl_object_flush() {
      ldlm_bl_13-35551 [034] 164201.591135: funcgraph_entry:                   |            mdc_object_flush() {
      ldlm_bl_13-35551 [034] 164201.591136: funcgraph_entry:                   |              mdc_dlm_blocking_ast0() {
      ldlm_bl_13-35551 [034] 164201.591136: funcgraph_entry:                   |                lock_res_and_lock() {
      ldlm_bl_13-35551 [034] 164201.591136: funcgraph_entry:        0.114 us   |                  _raw_spin_lock();
      ldlm_bl_13-35551 [034] 164201.591136: funcgraph_entry:        0.030 us   |                  _raw_spin_lock();
      ldlm_bl_13-35551 [034] 164201.591137: funcgraph_exit:         0.677 us   |                }
      ldlm_bl_13-35551 [034] 164201.591137: funcgraph_entry:        0.031 us   |                unlock_res_and_lock();
      ldlm_bl_13-35551 [034] 164201.591137: funcgraph_exit:         1.363 us   |              }
      ldlm_bl_13-35551 [034] 164201.591137: funcgraph_exit:         1.674 us   |            }
      ldlm_bl_13-35551 [034] 164201.591137: funcgraph_exit:         2.207 us   |          }
      ldlm_bl_13-35551 [034] 164201.591138: funcgraph_exit:         2.596 us   |        }
      ldlm_bl_13-35551 [034] 164201.591138: funcgraph_entry:        0.042 us   |        up_read();
      ldlm_bl_13-35551 [034] 164201.591138: funcgraph_exit:         3.714 us   |      }
      ldlm_bl_13-35551 [034] 164201.591138: funcgraph_exit:         4.279 us   |    }
      ldlm_bl_13-35551 [034] 164201.591138: funcgraph_entry:                   |    cl_env_put() {
      ldlm_bl_13-35551 [034] 164201.591138: funcgraph_entry:        0.034 us   |      lu_context_exit();
      ldlm_bl_13-35551 [034] 164201.591139: funcgraph_entry:        0.030 us   |      lu_context_exit();
      ldlm_bl_13-35551 [034] 164201.591139: funcgraph_entry:        0.030 us   |      _raw_read_lock();
      ldlm_bl_13-35551 [034] 164201.591139: funcgraph_exit:         0.990 us   |    }
      ldlm_bl_13-35551 [034] 164201.591140: funcgraph_exit:         8.253 us   |  }

easy to see - mdc_dlm_blocking_ast0 skipped at begin, it mean lock isn't granted or no l_ast_data aka osc object assigned. Data was obtained from page cache later.

          <...>-40843 [000] 164229.430007: funcgraph_entry:                   |  ll_do_fast_read() {
           <...>-40843 [000] 164229.430009: funcgraph_entry:                   |    generic_file_read_iter() {
           <...>-40843 [000] 164229.430010: funcgraph_entry:        0.044 us   |      _cond_resched();
           <...>-40843 [000] 164229.430010: funcgraph_entry:                   |      pagecache_get_page() {
           <...>-40843 [000] 164229.430010: funcgraph_entry:        0.706 us   |        find_get_entry();
           <...>-40843 [000] 164229.430011: funcgraph_exit:         1.078 us   |      }
           <...>-40843 [000] 164229.430012: funcgraph_entry:                   |      mark_page_accessed() {
           <...>-40843 [000] 164229.430012: funcgraph_entry:        0.088 us   |        activate_page();
           <...>-40843 [000] 164229.430012: funcgraph_entry:        0.143 us   |        workingset_activation();
           <...>-40843 [000] 164229.430013: funcgraph_exit:         0.925 us   |      }
           <...>-40843 [000] 164229.430014: funcgraph_entry:        0.032 us   |      _cond_resched();
           <...>-40843 [000] 164229.430014: funcgraph_entry:                   |      pagecache_get_page() {
           <...>-40843 [000] 164229.430014: funcgraph_entry:        0.070 us   |        find_get_entry();
           <...>-40843 [000] 164229.430014: funcgraph_exit:         0.401 us   |      }
           <...>-40843 [000] 164229.430015: funcgraph_entry:                   |      mark_page_accessed() {
           <...>-40843 [000] 164229.430015: funcgraph_entry:        0.037 us   |        activate_page();
           <...>-40843 [000] 164229.430015: funcgraph_entry:        0.039 us   |        workingset_activation();
           <...>-40843 [000] 164229.430015: funcgraph_exit:         0.649 us   |      }
....

Short description - how it was hit.
getattr_by_name provide a "DoM" bit in response while client have an DoM lock already, but no io under this lock was exist.

2) DoM's read on open corruption.
Scenario is near to same as previously. Open provide a data which moved to the page cache very early, with Uptodata flag set. But osc object isn't assigned to the lock.
Data was read with ll_do_fast_read and no real IO + lock match in mdc_enqueue_send().
Lock was canceled without pages flush, but client continue to read a stale data via ll_do_fast_read.

...



 Comments   
Comment by Alexey Lyashkov [ 09/Jun/20 ]

Several other corruption cases related to situation "lock without l_ast_data assigned". Inspired discussion of review of KMS bug (LU-12681 osc: wrong cache of LVB attrs).

1) layout change vs lock cancel. layout change disconnects an locks from it's object and wait it will pickup at lock enqueue time. Lock cancel run have no chance to flush pages in this case.

2) Inode destroy case. Inode destroy will cause a ast disconnect also, but inode recreation can found an old lock during check_and_discard run without l_ast_data assigned and page flush is not possible.

3) Layout change vs DoM cancel lock. MD lock can downgraded to lost all bits except an DoM, so it will go through lov to flush a data, but situation where lov can't find a DoM component is possible. So pages will still in page cache.

4) it looks like a tiny write ( LU-9409 llite: Add tiny write support) affect by this also. As have no sync with lustre internals we can lost a cache flush in case lock without l_ast_data exist.

Comment by Alexey Lyashkov [ 19/Jun/20 ]

I can drop a some cases after research with Vitaly.
Layout look change drop a whole client cache for object, it's good for data correctness but too bad for the SEL. As extending layout want to flush a whole dirty memory for the object which want a noticeable time on loaded cluster.
simple reproducer verify it situation.

--- a/lustre/tests/sanity-pfl.sh
+++ b/lustre/tests/sanity-pfl.sh
@@ -855,8 +855,10 @@ test19_io_base() {
                        error “Create $comp_file failed”
        fi
+       dd if=/dev/zero of=$comp_file bs=100K count=1 conv=notrunc ||
+               error “dd to extend faied”
        # write past end of first component, so it is extended
-       dd if=/dev/zero of=$comp_file bs=1M count=1 seek=127 conv=notrunc ||
+       dd if=/dev/zero of=$comp_file bs=100K count=1 seek=1270 conv=notrunc ||
                error “dd to extend failed”
        local ost_idx1=$($LFS getstripe -I1 -i $comp_file)

result is

== sanity-pfl test 19a: Simple test of extension behavior ============================================ 18:07:26 (1592492846)
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0327618 s, 32.0 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0535577 s, 19.6 MB/s
Pass!

Layout change vs DoM cancel is still possible but very hard to reach. I think adding a LASSERT in this place will be good to confirm data isn't corrupted, once assert will hit - dom blocking callback for "complex" ibit locks need reworked.

LU-13128 "osc: glimpse and lock cancel race" have a backside effect in mdc changes. This fixes a the lock conversion bug with DoM bit, where an osc object removed from lock early. It causes a stale data in cache as DoM cancel will skip due osc object lost.

Vitaly investigation about group lock problem say - it hard to reproduce but it don't 100% same logic as expected for the extents locks. Additional problem is group id generation for swap layout. random id used for this case, but it's not a unique value over large cluster and should avoid as possible.

So currently we can focus on two confirmed bugs.

1) mdc check_and_discard function can skip a object discard due lock without osc object assigned.
(similar to the LU-11670/LUS-6747). Patch ready - summits son.

2) and fixing an dom read-on-open, which put a uptodate pages into page cache, but ldlm lock don't have an osc object assigned and have no way to the flush any data. Patch in process.

Comment by Alexey Lyashkov [ 03/Jul/20 ]

It looks bugs affected an any Lustre versions includes an DoM and Lock Ahead features. Initial testing say - most bugs can be fixed with two low risks patches. Some problems with group locks/unprotected layout change invested separately.

Patch submit blocked due master branch build breakage with Redhat debug kernel caused a James S. backports for xarray.

Comment by Cory Spitz [ 06/Jul/20 ]

From Alexey in Linux Lustre Client slack:

it’s EASY to replicate. reproducer is IOR with rewrite - bug hit in 1-5h from start. Fix verification - no corruption in 24h for load. But in general it want just getattr after open. (edited)
once getattr will return a ibit with 0x40+0x1b - bug will hit.
as about lock ahead part - this just because LA locks is same as DoM lock. both don’t have an osc object assigned before usage. so bugs is similar - i think glimpse bug will be same for DoM with LA fixed early.
and you are wrong this is not an SEL only bug - as SEL it’s just PFL - so PFL is in under attack and other layout modification. Currently, Vitaly confirm just a problems with delete layout components.
for SEL - page cache is flushed - so very low risk for bug.

Comment by Gerrit Updater [ 08/Jul/20 ]

Alexey Lyashkov (alexey.lyashkov@hpe.com) uploaded a new patch: https://review.whamcloud.com/39319
Subject: LU-13645 llite: flush an read-on-open pages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 10c3c4af36b716a6c8d8c683e6030ca5d070cefa

Comment by Gerrit Updater [ 16/Jul/20 ]

Vitaly Fertman (vitaly.fertman@hpe.com) uploaded a new patch: https://review.whamcloud.com/39405
Subject: LU-13645 ldlm: re-process ldlm lock cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1e450528a954bfaa7fe4bc9d72e46f16c3f5efa4

Comment by Gerrit Updater [ 16/Jul/20 ]

Vitaly Fertman (vitaly.fertman@hpe.com) uploaded a new patch: https://review.whamcloud.com/39406
Subject: LU-13645 ldlm: group locks for DOM ibit lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c7d44c3c1791b9d67912946c7454aa40402808a4

Comment by Gerrit Updater [ 13/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39405/
Subject: LU-13645 ldlm: re-process ldlm lock cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d7e6b6d2ab8718b55271be56afc4ee5f2beae84b

Comment by Gerrit Updater [ 19/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39318/
Subject: LU-13645 ldlm: don't use a locks without l_ast_data
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a6798c5806088dc1892dd752012a54f0ec8f1798

Comment by Gerrit Updater [ 30/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39406/
Subject: LU-13645 ldlm: group locks for DOM IBIT lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 06740440363424bff6cfdb467fcc5544e42cabc1

Comment by Gerrit Updater [ 30/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39878/
Subject: LU-13645 ldlm: extra checks for DOM locks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0a3c72f13045309573f74f2e02771035d734cc05

Comment by Peter Jones [ 30/Oct/20 ]

All patches landed for 2.14

Comment by Gerrit Updater [ 03/Mar/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50199
Subject: LU-13645 ldlm: re-process ldlm lock cleanup
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7d64d487ad53f365b900408155a340b9385cf54f

Comment by Gerrit Updater [ 03/Mar/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50200
Subject: LU-13645 ldlm: don't use a locks without l_ast_dat
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 722dda219811cae47816f9928aea9348fa1f2bd6

Generated at Sat Feb 10 03:03:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.