[LU-16146] after dropping mgs/mdt0 for test: mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.15.0
Labels:
- llnl
Environment:
TOSS 4.4-4.1
4.18.0-372.19.1.1toss.t4.x86_64
lustre 2.15.0_3.llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".

When the mgs node was powered off, an ior test was in progress.

Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as

2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6

I believe that -6 is coming from lfsck_start()

        lfsck = lfsck_instance_find(key, true, false);                                                           
        if (unlikely(lfsck == NULL))                                                                             
                RETURN(-ENXIO);                                                                                  
                                                                                                                 
        if (unlikely(lfsck->li_stopping))                                                                        
                GOTO(put, rc = -ENXIO);

For our reference, our local ticket is TOSS5875

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2022-09-07_console.garter1.gz
204 kB
13/Sep/22 12:08 AM
2022-09-07_console.garter2.gz
119 kB
13/Sep/22 12:09 AM

Activity

[LU-16146] after dropping mgs/mdt0 for test: mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6

Gian-Carlo Defazio made changes - 26/Jan/23 12:28 AM

Resolution		New: Cannot Reproduce [ 5 ]
Status	Original: Open [ 1 ]	New: Closed [ 6 ]

Gian-Carlo Defazio added a comment - 26/Jan/23 12:28 AM

We were unable to reproduce this when bringing up out file system both manually and using pacemaker.

Gian-Carlo Defazio added a comment - 26/Jan/23 12:28 AM We were unable to reproduce this when bringing up out file system both manually and using pacemaker.

Gian-Carlo Defazio made changes - 26/Jan/23 12:27 AM

Labels

Original: llnl topllnl

New: llnl

Olaf Faaland made changes - 29/Dec/22 8:24 PM

Description

Original: After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".

When the mgs node was powered off, an ior test was in progress.

Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as
{noformat}
2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat}
I believe that -6 is coming from lfsck_start()
{noformat}
lfsck = lfsck_instance_find(key, true, false);
if (unlikely(lfsck == NULL))
RETURN(-ENXIO);

if (unlikely(lfsck->li_stopping))
GOTO(put, rc = -ENXIO);
{noformat}

New: After doing a test that involved turning off the mgs/mdt0 node for the test cluster garter, the mdts fail to come back up. They are stuck "WAITING".

When the mgs node was powered off, an ior test was in progress.

Failover seemed to work, but the ior job was unable to finish. The cluster was eventually rebooted.

After the reboot, the logs show that all 4 mdts are failing an lfsck_start with a line such as
{noformat}
2022-09-07 14:35:12 [ 5791.188613] Lustre: 32474:0:(mdt_handler.c:7522:mdt_postrecov()) lflood-MDT0000: auto trigger paused LFSCK failed: rc = -6{noformat}
I believe that -6 is coming from lfsck_start()
{noformat}
lfsck = lfsck_instance_find(key, true, false);
if (unlikely(lfsck == NULL))
RETURN(-ENXIO);

if (unlikely(lfsck->li_stopping))
GOTO(put, rc = -ENXIO);
{noformat}

For our reference, our local ticket is TOSS5875

Olaf Faaland made changes - 29/Dec/22 8:21 PM

Description

Peter Jones made changes - 19/Nov/22 4:23 PM

Link

Original: This issue is related to JFC-21 [ JFC-21 ]

Gian-Carlo Defazio added a comment - 24/Oct/22 4:13 PM

pjones
Sorry for the delay.
The system I'd need to generate those logs on, garter, is being used for higher priority stuff right now. It's using new MDTs and OSTs. I have the old MDTs that failed on lfsck_start saved, but I can't switch back to them right now.

So yes I can get logs for trying to start up the MDTs, but I'm not sure when.

Gian-Carlo Defazio added a comment - 24/Oct/22 4:13 PM pjones Sorry for the delay. The system I'd need to generate those logs on, garter, is being used for higher priority stuff right now. It's using new MDTs and OSTs. I have the old MDTs that failed on lfsck_start saved, but I can't switch back to them right now. So yes I can get logs for trying to start up the MDTs, but I'm not sure when.

Peter Jones added a comment - 21/Oct/22 11:29 PM

defazio will you be able to gather this additional debug info?

Peter Jones added a comment - 21/Oct/22 11:29 PM defazio will you be able to gather this additional debug info?

Lai Siyao added a comment - 13/Oct/22 2:20 PM

I'll review related code. It's better if you can enable trace with "lfs set_param debug=+trace" and dump debug of MDS.

Lai Siyao added a comment - 13/Oct/22 2:20 PM I'll review related code. It's better if you can enable trace with "lfs set_param debug=+trace" and dump debug of MDS.

Gian-Carlo Defazio added a comment - 13/Sep/22 12:09 AM

I've added the console logs for garter[1,2]

Gian-Carlo Defazio added a comment - 13/Sep/22 12:09 AM I've added the console logs for garter [1,2]

People

Assignee:: Lai Siyao

Reporter:: Gian-Carlo Defazio

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 08/Sep/22 9:26 PM

Updated:: 26/Jan/23 12:28 AM

Resolved:: 26/Jan/23 12:28 AM