[LU-5900] replay-dual test_11: rm: cannot remove `/mnt/lustre/f11.replay-dual-[1-5]': No such file or directory Created: 11/Nov/14  Updated: 01/Dec/14  Resolved: 01/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.4
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/100/
Distro/Arch: RHEL6.5/x86_64


Issue Links:
Related
is related to LU-5079 conf-sanity test_47 timeout Resolved
Severity: 3
Rank (Obsolete): 16486

 Description   

replay-dual test 11 failed as follows:

rm: cannot remove `/mnt/lustre/f11.replay-dual-[1-5]': No such file or directory
 replay-dual test_11: @@@@@@ FAIL: test_11 failed with 1

Dmesg on client node:

Lustre: 15249:0:(client.c:2752:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
  req@ffff88006a414400 x1484130260889212/t515396075526(515396075526) o36->lustre-MDT0000-mdc-ffff88006aa09400@10.1.4.66@tcp:12/10 lens 520/416 e 1 to 0 dl 1415377944 ref 2 fl Interpret:R/4/0 rc -75/-75
LustreError: 15249:0:(client.c:2740:ptlrpc_replay_interpret()) request replay timed out, restarting recovery
LustreError: 167-0: lustre-MDT0000-mdc-ffff880037fb4400: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
LustreError: 1678:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -5
Lustre: lustre-MDT0000-mdc-ffff880037fb4400: Connection restored to lustre-MDT0000 (at 10.1.4.66@tcp)
LustreError: 1678:0:(dir.c:378:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -5
LustreError: 1678:0:(dir.c:584:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -5

Dmesg on MDS node:

Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
Lustre: lustre-MDT0000: disconnecting 1 stale clients
Lustre: 18536:0:(ldlm_lib.c:2092:target_recovery_thread()) too long recovery - read logs
LustreError: dumping log to /tmp/lustre-log.1415377850.18536

Maloo report: https://testing.hpdd.intel.com/test_sets/a6c1b3de-68c5-11e4-a63a-5254006e85c2



 Comments   
Comment by Jian Yu [ 11/Nov/14 ]

The same failure also occurred on SLES11SP3/x86_64 client + RHEL6.5/x86_64 server test session:
https://testing.hpdd.intel.com/test_sets/bdf623d0-6872-11e4-acbe-5254006e85c2

It's a regression failure introduced by Lustre b2_5 build #100.

Unfortunately, I found that replay-dual was not in autotest review test groups, so the failure was not detected in patch review testing.

Comment by Jian Yu [ 11/Nov/14 ]

Here is a for-test-only patch trying to reproduce the failure on Lustre b2_5 build #100: http://review.whamcloud.com/11611

Comment by Jian Yu [ 11/Nov/14 ]

More instance on Lustre b2_5 build #100:
https://testing.hpdd.intel.com/test_sets/23fe8258-69d1-11e4-8f09-5254006e85c2

Comment by Jian Yu [ 12/Nov/14 ]

The same regression failure also occurred on master branch:
https://testing.hpdd.intel.com/test_sets/174878bc-5aad-11e4-8200-5254006e85c2
https://testing.hpdd.intel.com/test_sets/b126fa0a-6a32-11e4-b203-5254006e85c2

Comment by Jian Yu [ 12/Nov/14 ]

It was the patches http://review.whamcloud.com/11213 (master) and http://review.whamcloud.com/12365 (b2_5) for LU-5079 that caused the regressions.

Comment by Peter Jones [ 01/Dec/14 ]

As per Yu Jian this can be closed as a duplicate of LU-5079

Generated at Sat Feb 10 01:55:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.