[LU-13809] PFL file lost during recovery Created: 21/Jul/20  Updated: 12/Jan/21  Resolved: 02/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Critical
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14325 Interop: replay-single test 134 fails... Open
is related to LU-12040 File lost during recovery Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A reincarnation of LU-12040 with PFL. An attempt to replay a pooled PFL file create/open silently fails.



 Comments   
Comment by Gerrit Updater [ 21/Jul/20 ]

Alexander Zarochentsev (alexander.zarochentsev@hpe.com) uploaded a new patch: https://review.whamcloud.com/39468
Subject: LU-13809 tests: improve replay-single test_134
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3ecde8b6e83d4c9dfa4d78ea0cbd203e197e3c0

Comment by Alexander Zarochentsev [ 21/Jul/20 ]

the test https://review.whamcloud.com/39468 illustrates a file loss:

Failing mds1 on devvm1
Stopping /mnt/lustre-mds1 (opts:) on devvm1
reboot facets: mds1
Failover mds1 to devvm1
mount facets: mds1
Starting mds1:   /dev/mapper/mds1_flakey /mnt/lustre-mds1
Started lustre-MDT0000
devvm1: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
 replay-single test_134: @@@@@@ FAIL: pfl file does not exist 
  Trace dump:
  = ./../tests/test-framework.sh:6216:error()
  = replay-single.sh:4910:test_134()
  = ./../tests/test-framework.sh:6519:run_one()
  = ./../tests/test-framework.sh:6568:run_one_logged()
  = ./../tests/test-framework.sh:6393:run_test()
  = replay-single.sh:4912:main()
Dumping lctl log to /tmp/test_logs/1595329100/replay-single.test_134.*.1595329165.log
Dumping logs only on local client.
Resetting fail_loc on all nodes...done.
Destroy the created pools: pool_134
lustre.pool_134
OST lustre-OST0001_UUID removed from pool lustre.pool_134
Pool lustre.pool_134 destroyed
FAIL 134 (40s)
[root@devvm1 tests]#
Comment by Gerrit Updater [ 13/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39468/
Subject: LU-13809 mdc: fix lovea for replay
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 72d45e1d344c5559d7620102a86a83bbf095796b

Comment by Cory Spitz [ 01/Sep/20 ]

zam, is there any work remaining for this ticket? If not, I think we can resolve it for 2.14.0 with the landing of https://review.whamcloud.com/#/c/39468/.

Comment by Alexander Zarochentsev [ 02/Sep/20 ]

spitzcor
> is there any work remaining for this ticket?
no work except porting to 2.12

Comment by Cory Spitz [ 02/Oct/20 ]

zam, were you going to push to b2_12 then?

Generated at Sat Feb 10 03:04:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.