Currently, linkvalidator does link checking like this:
Either in scheduler or in link check tab, you can start link checking by defining a startpage and a depth
Linkvalidator first collects all pages (walks up to depth starting from startpage, hidden pages can be ignored)
It then deletes all broken links already detected for these pages from list of broken links tx_linkvalidator_links
It then starts checking the records and fills tx_linkvalidator_links with the new found broken links
This has several disadvantages:
The deletion of all broken links at the beginning - if someone else works on broken links, while they are being checked, they are suddenly all deleted and crawling again may take hours
we check again and again - even if no records are changed
even though we may be checking a lot, there may be still broken links undetected or broken links shown that no longer exist
The current mechanism is not well suited for large sites and for sites where several people may work on broken links simultaneously.
The current mechanism may check a lot but still the information won’t be up to date.
You can start checking at any level (either in scheduler or in “Check links” tab) which may result in the entire pagetree being checked 7 days ago and a small subtree checked 2 hours ago. This means the “last check time” may be different. There is no way for the user to know if he / she is looking for an up-to-date list of broken links.
Directly check when records are changed, e.g. via add, delete, edit events - external links should not get checked synchronously - this may take long!!!
When records are changed, use a link check queue and add a link check request to the queue. There can be a scheduler task to work on queue
4 … ?
Do you really mean links there and not URLs? - with “URL” I mean plain URL information (e.g. https://some.example.org), without context where the URL is used. With “link”, I mean additional information such as anchor text, record uid, page uid etc.) - this is in tx_linkvalidator_link table for the broken links.
Can you clarify here what you mean, by “audit” and by “links are bound to a page”.
Maybe we can have all known links in that table.
I think these are 2 different things:
a) the list of (external) URLs and the status information like status code and last check (what you wrote in 1). I see that as a great idea and definitely helpful. (already thought about putting the URL and status information in Caching Framework, this goes even farther)
b) And (possibly) a list of all (?) links. Currently we already have a list of broken links: that is the tx_linkvalidator_link table. A list of all links might get huge.
I definitely think a) is a good way to go for crawling the external links. With b), I am not sure if I understood you correctly or if you are really only talking about a).
The current approach scans that URI twice, no matter if the link is broken or not.
This is just a side note, but to be correct: This is not entirely true: In the ExternalLinkType there is an array of the URLs and the last status is used. So if the crawling is in one batch, the same (external) URL is not queried twice.
But, the next time the crawling is done, we start over.
The downside is also this is in RAM and may bloat up if entire large site is scanned.
This is a lightweight version of what you were proposing.
That has been partially done in version 10: If you click on pencil from the linkvalidator list and fix the link, it will get removed from list (when you return to the list). BUT, this will ONLY happen if you go in via the linkvalidator list. So, it is better than before but not sufficient.
Fact is, I had the same feeling as you and it was really bugging me.
a “broken” link should be marked (maybe via status) to “ignore”. F.h. I want to mark links to pages which are restricted via Stop Date.
If we have a table of all links, why not complete it and invent LAL = Link Abstraction Layer. regarding the work with FAL it might be a big work but a further step in unification of links.
As our servers are behind a firewall it is very difficult to run the link validator as each external link domain needs to be declared before it can be checked. with a list/ table it would be possible to add an interface having an ‘external’ process checking all links and return a status.
A new “LinkTargetCache” is introduced. It is currently used for external links only, but is done in a way that is general and can be used by the other link types as well. There are still some thing to do (see todos in commit message). See also changelog.
The patch already works, reviews are welcome and would be helpful for me. If possible, please focus on general, conceptual things first. Some things will most likely still undergo changes.