Link Extinction

The extinction of links ( Eng. Link rot , literally - the decay of links) is a process in which hyperlinks to specific sites or the Internet generally point to web pages , servers or other resources that become inaccessible forever ^[1] . Reliable data on the half-life of a random page does not exist: specific estimates vary greatly between different studies, as well as between different collections of links on which these studies were conducted (see section # Prevalence ).

Content

1 Terminology
2 Reasons
3 prevalence
4 detection
5 fight
- 5.1 By the authors
- 5.2 Server side
- 5.3 User side
- 5.4 Website archiving
6 See also
7 notes
8 Literature
9 Literature for further reading
- 9.1 Internet Extinction
- 9.2 In the academic literature
- 9.3 In digital libraries
10 Links

Terminology

Extinction of links is also called “link death” or “link breaking”. A link that no longer works is called a broken link, dead link, or dangling link. Formally, this is a kind of hanging pointer - the object referenced by the link no longer exists.

Reasons

One of the most common reasons for broken links to appear is that the webpage that the link points to no longer exists. This often leads to a 404 error, which indicates that the web server is responding, but it cannot find the specified page. Another kind of dead link occurs when the server that contains the page stops working or moves to another domain name . The browser may return a DNS error or display a site that is not related to the page you were looking for. The latter can happen if the domain name is transferred to another owner. Other reasons for broken links may include:

The website has been rebuilt or redesigned, or the underlying technology has been changed, resulting in a large number of inbound and outbound links changing or becoming inaccessible.
Many news sites store articles for a short time, and then translate them into . This leads to a significant loss of links in news discussion groups using informational Internet portals for links.
Content may automatically become inaccessible after a certain period.
Content may be intentionally deleted by the owner.
The server may be updated and the code (for example, PHP) may not work as a result correctly.
Links may be removed in a lawsuit.
Search results from social networks such as Facebook and Tumblr are prone to broken links due to frequent changes in user privacy, account deletions, search result links to dynamic pages that give new results that are different from the cached result , or the removal of a link or photo.
Links may contain short-lived, user-specific information, such as session or login date. Since such information is not true all the time, the result may be a broken link.
The link may be broken due to some types of blocking, such as content filters or firewalls .
A website can be closed or turned off, which leads to broken links if they point to this website.
A website may change its domain name. Links to the old domain name can then become broken.
Dead links can happen on the server side when content is collected from sources on the Internet without proper link checking.
When new private gTLD domains become popular, top-level domains such as .mcdonalds or .xperia are discontinued ^[2] .

Prevalence

Answer 404 “Not Found” is familiar even to casual users on the network. A large number of studies examined the prevalence of broken links on the Internet, in the scientific literature and in electronic libraries ^[3] . In a 2003 experiment, Featherly (et al) found that approximately one in 200 links disappear from the Internet each week. McCone (et al., 2005) found that half of the URLs cited in articles were not available 10 years after publication, while other studies showed even worse link extinction in the scientific literature ^[4] ^[5 ^] ^] . Nelson and Allen ^[6] studied the extinction of links in digital libraries and found that about 3% of objects were inaccessible after one year. In 2014, Maciej Tseglovsky, owner of the bookmarking site, reported that a “fairly stable share” of 5% of links died out in a year ^[7] . Examining Yahoo! Links showed the half-life of a random page in 2016-2017 (shortly after Yahoo! stopped publishing this directory) for about two years ^[8] .

Some studies in the early stages of the Internet (in the late 1990s - early 2000s) showed a significant difference (more than an order of magnitude) of half-lives between different collections of links ^[9] .

In 2014, researchers at Harvard Law School Jonathan Zittrane, Kendra Albert, and Lawrence Lemming found that approximately 50% of the URLs in the US Supreme Court findings did not refer to the source information ^[1] . They also found that in the set of legal journals between 1999 and 2011, more than 70% of links did not work as expected. A 2013 study of magazine analyzed about 15,000 links in abstracts from the pages of Thomson Reuters and found that the average life time of web pages was 9.3 years, and 62% were archived ^[10] . In August 2015, the Weblock.io website analyzed over 180,000 links from the texts of the three main publishers with open access and found that more than 24.5% of the cited links are inaccessible ^[11] .

Discovery

Detection of broken links can be done manually or automatically. Automated methods, including plugins for WordPress , Drupal and other content management systems , can be used to detect broken URLs . An alternative is to use broken link test tools such as Xenu's Link Sleuth . However, if the URL returns an HTTP code of 200 (OK) , the page may be accessible, but the contents of the page may be changed and is no longer relevant. So manual page checking, apparently, should be mandatory. Some servers return soft 404 , telling the requesting computer that the link is working, although, in fact, it does not work. Bar-Yosef (et al., 2004) ^[12] developed a heuristic algorithm that automatically detects pages that return soft 404.

Fight

There are many solutions to overcome broken links. Some methods try to prevent them altogether, while others try to get around them when a broken link is detected. There are also many tools to combat the extinction of links.

From the authors

Carefully select and use hyperlinks and check them regularly after publication. The best technologies include links to the main sources, rather than secondary ones and preference should be given to sustainable sites. McCone et al in 2005 suggested avoiding quoting URLs that link to personal pages of researchers.
Always look for the most compact and direct URL and make sure that it is a semantic URL without irrelevant information after the base URL ^[13] . This process is often referred to as URL normalization or .
As far as possible, use persistent identifiers such as ARK ( Archival Resource Key ), DOIs , links, and PURL .
Avoid links to PDF documents where possible, since PDF documents, after all, documents, not web pages, their contents may change without notice, and their names often contain characters, such as a space, so they need to be encoded for url. Large PDF documents can load slowly and cause a timeout error ^[13] .
Avoid providing links to pages deep for the site, which is known as external linking .
Use website archiving services (for example, WebCite ) to constantly archive and retrieve quoted Internet links ^[14] .

Server Side

Never change URLs and never delete pages. If there are reasons why the page is no longer needed, such as editing a message on news sites, replace it with a page explaining the reasons for the removal.
If the URL changes, use a redirect mechanism such as “301: Moved Permanently” to automatically inform browsers and search engines about the new location.
A web content management system can provide embedded link management solutions by updating them if they change or move around the site.
WordPress prevents link extinction by replacing non-canonical URLs with versions ^[15] .
trying to automatically fix broken links.
Creating permalinks stops the formation of broken links by ensuring that content is not migrated in the foreseeable future. Another type of creating permalinks is the permalink link, which then redirects to the actual content, which ensures that the link is saved even if the real content is moved to another place, so that links pointing to the resource remain unchanged.
Design URLs - for example, semantic URLs - so that they do not need to be changed when another person begins to serve the document, or when other software is used on the server ^[16] .

User side

The Linkgraph widget determines the URL of the correct page based on the old bit of the URL by using historical location information.
The Google 404 Widget widget attempts to “guess” the correct URL and gives the user a dialog box to search for the correct page.
When the user receives the code 404, the Google toolbar tries to help the user find the missing page ^[17] .

Website Archiving

To counter link extinction, website archiving is actively used to save web pages or parts of the network and to ensure that a set of pages is stored in archives , such as the , for future researchers, historians, and society. The goal of Internet archiving is to create an archive of the entire network by periodically taking snapshots of pages, which can then be freely accessed through the Wayback Machine . In January 2013, the company announced that it had reached a milestone of 240 billion archived URLs ^[18] . National libraries , and other organizations are also involved in archiving culturally important Web content.

Individual citizens can use many tools that allow them to archive web resources that may become unavailable in the future:

The WayBack Machine, a not-for-profit organization Internet Archive ^[19] , is a free website that archives old web pages. It does not archive websites whose owners indicate that they do not want their website to be archived.
A WebCite tool specifically designed for scientific authors, journal editors, and publishers for on-demand archiving and fetching links on the Internet ^[14] .
Archive.is archive site saves snapshots of web pages. It retrieves one page per request, but, unlike WebCite, it includes Web 2.0 sites such as Google Maps and Twitter.
The service, supported by Harvard Law School, together with a broad coalition of university libraries, takes a snapshot of the content URL and returns a permalink ^[1] .
The Hiberlink project, created by the University of Edinburgh in collaboration with the Los Alamos National Laboratory and other organizations, works to measure the “extinction of links” in online scientific articles, as well as determine where the web content has been archived ^[20] . The related project, Memento, has set the technical standard for accessing online content as it existed in the past ^[21] .
Some social bookmarking websites allow users to make an online clone of any webpage on the Internet by creating a copy with an independent url that remains accessible even if the original page has ceased to exist.
The Amber tool, created at Harvard at the , is a tool to combat link extinction by archiving to WordPress and Drupal to prevent censorship of the network and support archiving ^[22] .

However, such conservation systems may experience turning off / on the service, so that the saved URLs periodically become unavailable ^[23] .

Notes

↑ ¹ ² ³ Zittrain, Albert, Lessig, 2014 , p. 88–99.
↑ The death of a TLD (neopr.) . blog.benjojo.co.uk . Date accessed July 27, 2018. Archived July 26, 2018.
↑ Habibzadeh, Sciences, 2013 , p. 455–64.
↑ Spinellis, 2003 , p. 71–77.
↑ Lawrence, Pennock, Flake et al., 2001 , p. 26-31.
↑ Nelson, Allen, 2002 .
↑ Cegłowski, 2014 .
↑ Van der Graaf, 2017 .
↑ Koehler, 2004 .
↑ Hennessey, Xijin Ge, 2013 , p. S5.
↑ All-Time Weblock Report (Neopr.) (August 2015). Date of treatment January 12, 2016. Archived March 4, 2016.
↑ Bar-Yossef, Broder, Kumar, Tomkins, 2004 , p. 328.
↑ ¹ ² Kille, 2014 .
↑ ¹ ² Eysenbach, Trudel, 2005 , p. e60.
↑ Rønn-Jensen, 2007 .
↑ Berners-Lee, 1998 .
↑ Mueller, 2007 .
↑ Wayback Machine: Now with 240,000,000,000 URLs | Internet Archive Blogs (unopened) (January 9, 2013). Date of treatment April 16, 2014. Archived on September 12, 2017.
↑ Internet Archive: Digital Library of Free Books, Movies, Music & Wayback Machine (Neopr.) (March 10, 2001). Date of treatment October 7, 2013. Archived January 26, 1997.
↑ Hiberlink ( unspecified ) . Date of treatment January 15, 2015. Archived January 29, 2015.
↑ Memento: Time Travel for the Web (Neopr.) . Date of treatment January 15, 2015. Archived on January 7, 2015.
↑ Harvard University's Berkman Center Releases Amber, a "Mutual Aid" Tool for Bloggers & Website Owners to Help Keep the Web Available | Berkman Center (neopr.) . cyber.law.harvard.edu . Date of treatment January 28, 2016. Archived February 2, 2016.
↑ Habibzadeh, 2015 , p. one.

Literature

Ziv Bar-Yossef, Andrei Z. Broder, Ravi Kumar, Andrew Tomkins. Sic transit gloria telae: towards an understanding of the web's decay // Proceedings of the 13th conference on the World Wide Web - WWW '04. - 2004. - S. 328–337. - ISBN 978-1581138443 . - DOI : 10.1145 / 988672.988716 .
Jesper Rønn-Jensen. Software Eliminates User Errors And Linkrot . - Justaddwater.dk, 2007. Archived October 11, 2007.
Tim Berners-Lee Cool URIs don't change . - 1998. Archived on September 27, 2013.
Hans Van der Graaf. The half-life of a link is two year // ZOMDir's blog. - 2017. Archived on October 17, 2017.
Maciej Cegłowski. Web Design: The First 100 Years . - 2014. - September. Archived July 22, 2015.
Jonathan Zittrain, Kendra Albert, Lawrence Lessig. Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations // Legal Information Management. - 2014 .-- June ( vol. 14 , issue 2 ). - S. 88–99 . - DOI : 10.1017 / S1472669614000255 .
Jason Hennessey, Steven Xijin Ge. A Cross Disciplinary Study of Link Decay and the Effectiveness of Mitigation Techniques // BMC Bioinformatics. - 2013 .-- T. 14 . - S. S5 . - DOI : 10.1186 / 1471-2105-14-S14-S5 . - PMID 24266891 . Archived January 21, 2015.
John Mueller. FYI on Google Toolbar's Latest Features . - Google Webmaster Central Blog, 2007. Archived September 13, 2008.
Parham Habibzadeh. Are current archiving systems reliable enough? // International Urogynecology Journal. — 2015. — Т. 26 , вып. 10 . — ISSN 0937-3462 . — DOI : 10.1007/s00192-015-2805-7 . — PMID 26224384 .

Литература для дальнейшего чтения

Вымирание ссылок в интернете

Leighton Walter Kille. The Growing Problem of Internet "Link Rot" and Best Practices for Media and Online Publishers . — Journalist's Resource, Harvard Kennedy School, 2014. — Ноябрь. Архивировано 12 января 2015 года.
Gunther Eysenbach, Mathieu Trudel. Going, going, still there: Using the WebCite service to permanently archive cited web pages // Journal of Medical Internet Research. — 2005. — Т. 7 , вып. 5 . — С. e60 . — DOI : 10.2196/jmir.7.5.e60 . — PMID 16403724 .
Wallace Koehler. A longitudinal study of web pages continued: a consideration of document persistence // Information Research. — 2004. — Т. 9 , вып. 2 .
Dennis Fetterly, Mark Manasse, Marc Najork, Janet Wiener. A large-scale study of the evolution of web pages // Proceedings of the 12th international conference on World Wide Web . — 2003.
John Markwell, David W. Brooks. Broken Links: The Ephemeral Nature of Educational WWW Hyperlinks // Journal of Science Education and Technology. — 2002. — Т. 11 , вып. 2 . — С. 105–108 . — DOI : 10.1023/A:1014627511641 .
Tim Berners-Lee . Cool URIs Don't Change . — 1998.

В академической литературе

Habibzadeh P., Schattauer Sciences. Decay of References to Web sites in Articles Published in General Medical Journals: Mainstream vs Small Journals // Applied Clinical Informatics. — GmbH - Publishers for Medicine and Natural, 2013. — Т. 4 , вып. 4 . — С. 455–64 . — DOI : 10.4338/aci-2013-07-ra-0055 . — PMID 24454575 .
Daniel Gomes, Mário J. Silva. Modelling Information Persistence on the Web // Proceedings of the 6th International Conference on Web Engineering . — 2006. Архивная копия от 16 июля 2011 на Wayback Machine
Robert P. Dellavalle, Eric J. Hester, Lauren F. Heilig, Amanda L. Drake, Jeff W. Kuntzman, Marla Graber, Lisa M. Schilling. Going, Going, Gone: Lost Internet References // Science. — 2003. — Т. 302 , вып. 5646 . — С. 787–788 . — DOI : 10.1126/science.1088234 . — PMID 14593153 .
Steve Lawrence, David M. Pennock, Gary William Flake, Robert Krovetz, Frans M. Coetzee, Eric Glover, Finn Arup Nielsen, Andries Kruger, C. Lee Giles. Persistence of Web References in Scientific Research // Computer . — 2001. — Т. 34 , вып. 2 . — С. 26–31 . — DOI : 10.1109/2.901164 .
Wallace Koehler. An Analysis of Web Page and Web Site Constancy and Permanence // Journal of the American Society for Information Science. — 1999. — Т. 50 , вып. 2 . — С. 162–180 . — DOI : 10.1002/(SICI)1097-4571(1999)50:2<162::AID-ASI7>3.0.CO;2-B .
Frank McCown, Sheffan Chan, Michael L. Nelson, Johan Bollen. The Availability and Persistence of Web References in D-Lib Magazine // Proceedings of the 5th International Web Archiving Workshop and Digital Preservation (IWAW'05) . — 2005.
Carmine Sellitto. The impact of impermanent Web-located citations: A study of 123 scholarly conference publications // Journal of the American Society for Information Science and Technology. — 2005. — Т. 56 , вып. 7 . — С. 695–703 . — DOI : 10.1002/asi.20159 .
Diomidis Spinellis . The Decay and Failures of Web References // Communications of the ACM. — 2003. — Т. 46 , вып. 1 . — С. 71–77 . — DOI : 10.1145/602421.602422 .

В цифровых библиотеках

Michael L. Nelson, B. Danette Allen. Object Persistence and Availability in Digital Libraries // D-Lib Magazine. - 2002. - T. 8 , no. 1 . - DOI : 10.1045 / january2002-nelson .

Links

Future-Proofing Your URIs
To Nielsen Jakob, «Fighting Linkrot» , Jakob to Nielsen's Alertbox , June 14, 1998.

[_c07a6fe753e65e7f-1] ¹ ² ³ Zittrain, Albert, Lessig, 2014 , p. 88–99.

[2] The death of a TLD (neopr.) . blog.benjojo.co.uk . Date accessed July 27, 2018. Archived July 26, 2018.

[_40b807c60794e7af-3] Habibzadeh, Sciences, 2013 , p. 455–64.

[_b646f675eb204c2c-4] Spinellis, 2003 , p. 71–77.

[_bf567166be006123-5] Lawrence, Pennock, Flake et al., 2001 , p. 26-31.

[_93a7d75ac6595d08-6] Nelson, Allen, 2002 .

[_bb5b17b69efe23c9-7] Cegłowski, 2014 .

[_4a6f49b6c39d82c8-8] Van der Graaf, 2017 .

[_76c4bc4efb3034f7-9] Koehler, 2004 .

[_3e85c990ae1183c7-10] Hennessey, Xijin Ge, 2013 , p. S5.

[11] All-Time Weblock Report (Neopr.) (August 2015). Date of treatment January 12, 2016. Archived March 4, 2016.

[_587d8aee37da4bf2-12] Bar-Yossef, Broder, Kumar, Tomkins, 2004 , p. 328.

[_2f0dbd94debbb4ed-13] ¹ ² Kille, 2014 .

[_00b23cb6764ade8f-14] ¹ ² Eysenbach, Trudel, 2005 , p. e60.

[_74a353e3ec3623c1-15] Rønn-Jensen, 2007 .

[_2c2e6a84dc34470e-16] Berners-Lee, 1998 .

[_64c30dd7d9727d0c-17] Mueller, 2007 .

[18] Wayback Machine: Now with 240,000,000,000 URLs | Internet Archive Blogs (unopened) (January 9, 2013). Date of treatment April 16, 2014. Archived on September 12, 2017.

[19] Internet Archive: Digital Library of Free Books, Movies, Music & Wayback Machine (Neopr.) (March 10, 2001). Date of treatment October 7, 2013. Archived January 26, 1997.

[20] Hiberlink ( unspecified ) . Date of treatment January 15, 2015. Archived January 29, 2015.

[21] Memento: Time Travel for the Web (Neopr.) . Date of treatment January 15, 2015. Archived on January 7, 2015.

[22] Harvard University's Berkman Center Releases Amber, a "Mutual Aid" Tool for Bloggers & Website Owners to Help Keep the Web Available | Berkman Center (neopr.) . cyber.law.harvard.edu . Date of treatment January 28, 2016. Archived February 2, 2016.

[_f9536b4f10bf7214-23] Habibzadeh, 2015 , p. one.