[LU-6144] Rolling downgrade: sanity test_160 keeps running and cannot finish Created: 20/Jan/15  Updated: 16/Jan/22  Resolved: 16/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

before downgrade: server and client lustre-master build #2810
after downgrade: client-1 2.6.0
client-2 lustre-master build #2810
server: lustre-master build #2810


Severity: 3
Rank (Obsolete): 17126

 Description   

sanity test_160 just keep running for over several hours and cannot finish

client log shows:

== sanity test 160a: changelog sanity == 12:53:31 (1421787211)
Registered as changelog user cl6
945564 01CREAT 20:53:32.152975358 2015.01.20 0x0 t=[0x20000bf81:0x48:0x0] j=cp.00
 p=[0x20000bf81:0x46:0x0] pic1.jpg
945565 08RENME 20:53:32.162975688 2015.01.20 0x0 t=[0:0x0:0x0] j=mv.0 p=[0x200000
bf81:0x41:0x0] zach s=[0x20000bf81:0x46:0x0] sp=[0x20000bf81:0x42:0x0] zachy
945566 03HLINK 20:53:32.166975820 2015.01.20 0x0 t=[0x20000bf81:0x48:0x0] j=ln.00
 p=[0x20000bf81:0x42:0x0] portland.jpg
945567 04SLINK 20:53:32.169975919 2015.01.20 0x0 t=[0x20000bf81:0x49:0x0] j=ln.00
 p=[0x20000bf81:0x41:0x0] desktop.jpg
945568 06UNLNK 20:53:32.172976018 2015.01.20 0x1 t=[0x20000bf81:0x49:0x0] j=rm.00
 p=[0x20000bf81:0x41:0x0] desktop.jpg
verifying changelog mask
mdd.lustre-MDT0000.changelog_mask=-MKDIR
mdd.lustre-MDT0000.changelog_mask=+CLOSE
mdd.lustre-MDT0000.changelog_mask=+MKDIR
mdd.lustre-MDT0000.changelog_mask=-CLOSE
35 02MKDIR 05:32:13.88218195 2015.01.20 0x0 t=[0x2000090a1:0x20e1:0x0] j=test_idd
.205.29582 p=[0x200000007:0x1:0x0] f205.sanity
36 07RMDIR 05:32:14.57249262 2015.01.20 0x0 t=[0x2000090a1:0x20e1:0x0] j=test_idd
.205.11894 p=[0x200000007:0x1:0x0] f205.sanity
37 05MKNOD 05:32:15.51282063 2015.01.20 0x0 t=[0x2000090a1:0x20e2:0x0] j=test_idd
.205.26781 p=[0x200000007:0x1:0x0] f205.sanity
38 06UNLNK 05:32:16.46314899 2015.01.20 0x1 t=[0x2000090a1:0x20e2:0x0] j=test_idd
.205.14763 p=[0x200000007:0x1:0x0] f205.sanity
39 01CREAT 05:32:17.41347732 2015.01.20 0x0 t=[0x2000090a1:0x20e3:0x0] j=test_idd
.205.8582 p=[0x200000007:0x1:0x0] f205.sanity
40 12LYOUT 05:32:17.43347798 2015.01.20 0x0 t=[0x2000090a1:0x20e3:0x0] j=test_idd
.205.8582
41 13TRUNC 05:32:19.204418125 2015.01.20 0xe t=[0x2000090a1:0x20e3:0x0] j=test_ii
d.205.9105
42 13TRUNC 05:32:21.790503464 2015.01.20 0xe t=[0x2000090a1:0x20e3:0x0] j=test_ii
d.205.30748
43 08RENME 05:32:22.974542541 2015.01.20 0x0 t=[0:0x0:0x0] j=test_id.205.11518 pp
=[0x200000007:0x1:0x0] jobstats_test_rename s=[0x2000090a1:0x20e3:0x0] sp=[0x2000
000007:0x1:0x0] f205.sanity
44 01CREAT 05:32:33.660892782 2015.01.20 0x0 t=[0x2000090a1:0x20e6:0x0] p=[0x2000
000007:0x1:0x0] f205.sanity
45 06UNLNK 05:32:33.684894002 2015.01.20 0x1 t=[0x2000090a1:0x20e3:0x0] p=[0x2000
000007:0x1:0x0] jobstats_test_rename
46 02MKDIR 05:32:44.995265190 2015.01.20 0x0 t=[0x2000090a1:0x20e7:0x0] j=mkdir..
0 p=[0x200000007:0x1:0x0] d206.sanity


 Comments   
Comment by Andreas Dilger [ 21/Jan/15 ]

Is this failure reproducible? Are there any logs available (console on client/agent and MDS)?

I'm assuming that "rolling downgrade" actually means "upgrade then downgrade"? We do not support formatting at a newer version and then downgrade to an older version. It is never allowed to downgrade beyond the version used for formatting, and downgrade of multiple versions is also not allowed.

Comment by Sarah Liu [ 21/Jan/15 ]

Hi Andreas,

Yes, as you said, before rolling downgrade, I first did the rolling upgrade from 2.6.0 to master(it passed), so the system was formatted under 2.6.0, then downgrade to 2.6.0. I also hit this issue after downgrade the MDS from master to 2.6.0.

I will try to get more logs.

Comment by Sarah Liu [ 18/Feb/15 ]

Can still hit this error in tag-2.6.94 testing, same symptom, test_160a keep running with no other errors, can only see the messages showing in the description.

Comment by Jodi Levi (Inactive) [ 18/Feb/15 ]

Mike,
Would you be able to look into this one? What additional information would you need to assess this issue?

Comment by Isaac Huang (Inactive) [ 04/Mar/15 ]

Two instances of sanity test_160a time out (logs are all there):
https://testing.hpdd.intel.com/test_sessions/997b7ac0-c284-11e4-87b9-5254006e85c2

Seemed like the MDS rebooted during the test, not sure if that's intended or not.

Comment by Sarah Liu [ 09/Mar/15 ]

No, that was not intended. the test should be done in 30 secs.

Comment by Mikhail Pershin [ 16/Jan/22 ]

Outdated

Comment by Andreas Dilger [ 16/Jan/22 ]

Mike, for future reference, we typically mark Jira tickets Resolved instead of Closed, since Resolved still allows tags to be added/removed (eg. always_except or similar) and a few other useful maintenance actions that Closed does not.

Comment by Mikhail Pershin [ 16/Jan/22 ]

I see, will use 'Resolved'.  'Closed' could be reopened for maintenance as I can see though that is inconvenient comparing with 'Resolved'

Generated at Sat Feb 10 01:57:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.