Linkvalidator: When to do the link checking?core

Currently, linkvalidator does link checking like this:

  1. Either in scheduler or in link check tab, you can start link checking by defining a startpage and a depth
  2. Linkvalidator first collects all pages (walks up to depth starting from startpage, hidden pages can be ignored)
  3. It then deletes all broken links already detected for these pages from list of broken links tx_linkvalidator_links
  4. It then starts checking the records and fills tx_linkvalidator_links with the new found broken links

This has several disadvantages:

  • The deletion of all broken links at the beginning - if someone else works on broken links, while they are being checked, they are suddenly all deleted and crawling again may take hours
  • we check again and again - even if no records are changed
  • even though we may be checking a lot, there may be still broken links undetected or broken links shown that no longer exist

The current mechanism is not well suited for large sites and for sites where several people may work on broken links simultaneously.

The current mechanism may check a lot but still the information won’t be up to date.

You can start checking at any level (either in scheduler or in “Check links” tab) which may result in the entire pagetree being checked 7 days ago and a small subtree checked 2 hours ago. This means the “last check time” may be different. There is no way for the user to know if he / she is looking for an up-to-date list of broken links.

What would be the best solution?

  1. Add incremental checking - only changed records are checked, see - will check less, but some problems still remain
  2. Directly check when records are changed, e.g. via add, delete, edit events - external links should not get checked synchronously - this may take long!!!
  3. When records are changed, use a link check queue and add a link check request to the queue. There can be a scheduler task to work on queue
    4 … ?

The main problem with the link checker is, that we don’t rely on the state of our system only, but also the states of the external sites.

What we could do is the following:

  1. Put all links in a table (during editing / crawling?) with atleast the following properties:

    • Uri
    • number of failed checks
    • first failed check
    • last failed check
    • status or status code
  2. Check the links less often, if they have many failed attempts or the first failed check is way back in the past

  3. Add a timeout which is bound to the dead links, which avoids rescans, maybe provide an option to force scan all entries

  4. Set failed attempts to 0 and the first and last failed check to null if the request was successful.

Maybe we can have all known links in that table. That way we can also add statistics easily.

And if the links are bound to a page atleast we can make the audit easier.

With additional fields this process can also be executed in parallel or even on a different maschine …

I like your proposal very much.

Some clarifications:

Do you really mean links there and not URLs? - with “URL” I mean plain URL information (e.g., without context where the URL is used. With “link”, I mean additional information such as anchor text, record uid, page uid etc.) - this is in tx_linkvalidator_link table for the broken links.

Can you clarify here what you mean, by “audit” and by “links are bound to a page”.

Maybe we can have all known links in that table.

I think these are 2 different things:

a) the list of (external) URLs and the status information like status code and last check (what you wrote in 1). I see that as a great idea and definitely helpful. (already thought about putting the URL and status information in Caching Framework, this goes even farther)

b) And (possibly) a list of all (?) links. Currently we already have a list of broken links: that is the tx_linkvalidator_link table. A list of all links might get huge.

I definitely think a) is a good way to go for crawling the external links. With b), I am not sure if I understood you correctly or if you are really only talking about a).

I mean the URI.

And yes i mean all links.

Assume you have 2 pages. Maybe in different parts of the page tree. On this page we are using the same URI in different links.

The current approach scans that URI twice, no matter if the link is broken or not.

Having alt the links in the table we could scan that link only once.

Later on we could also optimize that, if URIs with the same Domain fail in a huge amount (here is the option to skip all entries for the domain).

Regarding the audit: we could utilize the reference index to bind the link to the link.

The current approach scans that URI twice, no matter if the link is broken or not.

This is just a side note, but to be correct: This is not entirely true: In the ExternalLinkType there is an array of the URLs and the last status is used. So if the crawling is in one batch, the same (external) URL is not queried twice.

But, the next time the crawling is done, we start over.

The downside is also this is in RAM and may bloat up if entire large site is scanned.

This is a lightweight version of what you were proposing.

Further things i like to add:

  • the result should be updated, if you repair a link. It is hard to work on a long list
  • there could be a status / remark field, where an editor could add notes to broken links
  • a “broken” link should be marked (maybe via status) to “ignore”. F.h. I want to mark links to pages which are restricted via Stop Date.

That has been partially done in version 10: If you click on pencil from the linkvalidator list and fix the link, it will get removed from list (when you return to the list). BUT, this will ONLY happen if you go in via the linkvalidator list. So, it is better than before but not sufficient.

Fact is, I had the same feeling as you and it was really bugging me.

  • a “broken” link should be marked (maybe via status) to “ignore”. F.h. I want to mark links to pages which are restricted via Stop Date.

That is also something I am working on / thinking about, see my comment and the 3 different approaches in

The question is also, how do you ignore: entire page, entire record, only the link or by link target (URL)?

I am thinking about setting entire records or pages to “exclude from link checking”. And also specificy external URLs to exclude as in

If we have a table of all links, why not complete it and invent LAL = Link Abstraction Layer. regarding the work with FAL it might be a big work but a further step in unification of links.

As our servers are behind a firewall it is very difficult to run the link validator as each external link domain needs to be declared before it can be checked. with a list/ table it would be possible to add an interface having an ‘external’ process checking all links and return a status.

A patch is ready for reviewing and testing:

A new “LinkTargetCache” is introduced. It is currently used for external links only, but is done in a way that is general and can be used by the other link types as well. There are still some thing to do (see todos in commit message). See also changelog.

The patch already works, reviews are welcome and would be helpful for me. If possible, please focus on general, conceptual things first. Some things will most likely still undergo changes.