HITS Algorithm

A tightly coupled set of authoritative and hub documents

The HITS algorithm ( Eng. Hyperlink Induced Topic Search ), proposed in 1999 by John Kleinberg , allows you to find web pages that match the user’s request, based on the information contained in the hyperlinks^[1] .

The HITS metric is often used to answer a broad topic of queries and finding document communities ( Tightly-Knit Community ) on the Internet . The idea of the algorithm is based on the assumption that hyperlinks encode a significant number of hidden authority pages ^[2] .

An authoritative document (authoritative page, author) is a document corresponding to a user’s request, having a greater share among documents of this subject, that is, a larger number of documents refer to this document ^[1] .

A hub document (hub page, intermediary) is a document containing many links to authoritative documents.

The page referenced by many other points should be a good "author." In turn, a page that points to many others should be a good intermediary. Based on this, in the HITS algorithm for each web page, two ratings are calculated: a credibility score and an intermediary score. That is, for each page, its significance as the “author” and “intermediary” ^[3] ^{[4] is} recursively calculated ^. .

Content

1 Algorithm
2 Detailing
- 2.1 Rule of authority update
- 2.2 Hub update rule
- 2.3 Normalization
3 HITS and PageRank Algorithm
4 Disadvantages of HITS
5 notes
6 Literature

Algorithm

Extending the root set of relevant pages in the base set

The first step in the HITS algorithm is to get the most relevant pages in the search query . This set is called the root set and can be obtained by accepting the most popular pages n returned by the text search algorithm. The basic set is formed by increasing the root set with all the web pages that are associated with it and with some pages that link to it. Web pages in the base set and all hyperlinks between these pages form a concentrated subgraph. HITS calculations are performed only on this subgraph.

Estimates of an authoritative document and an intermediary are defined in terms of each other in mutual recursion . The authority rating of a page is calculated as the sum of the mediation page rating values that point to this page. The intermediary’s rating value is calculated as the sum of the ratings of the authoritative pages to which it points.

The algorithm performs a series of iterations , each of which consists of two main stages:

Renewal of authority . An update of the authoritative assessment of each vertex of the subgraph, equivalent to the sum of the intermediary ratings of each of the vertices pointing to them.
Hub update . Updating the intermediary assessment of each vertex of a subgraph by summing up the authoritative assessments of each of the vertices to which they point.

Authority assessment and intermediary assessment for the top is calculated according to the following algorithm:

Start with the peaks with an authority score and mediation score of 1.
Compliance with the rule of authority update.
Implementation of the hub update rule.
Normalization of values by dividing each intermediary score by the square root of the sum of squares of all intermediary ratings, and dividing each authority score by the square root of the sum of squares of all authority ratings.
Repeat from the second step as needed.

Detail

To start ranking, $\forall p$ ${\ displaystyle \ forall p}$ , $\mathrm {auth} (p)=1$ ${\ displaystyle \ mathrm {auth} (p) = 1}$ and $\mathrm {hub} (p)=1$ ${\ displaystyle \ mathrm {hub} (p) = 1}$ . Consider two types of updates: the authority update rule and the hub update. In order to calculate authority / intermediary ratings, repeated iterations of authority update rules and hub updates are applied. The K-step of applying the algorithm implies the application of the k-time of the first rule of updating authority and then the rule of hub updating.

Basic operations in the HITS algorithm: authority update rule and hub update

Authority Renewal Rule

$\forall p$ ${\ displaystyle \ forall p}$ , we get $\mathrm {auth} (p)$ ${\ displaystyle \ mathrm {auth} (p)}$ = $\displaystyle \sum _{i=1}^{n}\mathrm {hub} (i)$ ${\ displaystyle \ displaystyle \ sum _ {i = 1} ^ {n} \ mathrm {hub} (i)}$ where n is the total number of pages associated with p and i is the page associated with p. Thus, the page credibility score is calculated as the sum of the mediation page score values that point to this page.

Hub Update Rule

$\forall p$ ${\ displaystyle \ forall p}$ , we get $\mathrm {hub} (p)$ ${\ displaystyle \ mathrm {hub} (p)}$ = $\displaystyle \sum _{i=1}^{n}\mathrm {auth} (i)$ ${\ displaystyle \ displaystyle \ sum _ {i = 1} ^ {n} \ mathrm {auth} (i)}$ where n is the total number of pages that p points to and i is the page that p points to. Thus, the intermediary assessment of the page is calculated as the sum of the values of the authority ratings of the pages to which it refers.

Depending on these values, the importance of the web pages for a particular request is calculated and then displayed to the user. The rating of the HITS module calculates the rank of a web page offline after it has been downloaded and stored in a local database. ^[5]

Normalization

Final vertex estimates are determined after an endless repetition of the algorithm. The direct and consistent application of the rules of hub updating and updating authority leads to diverging values that must be normalized by the matrix after each iteration. Thus, the values obtained as a result of this process ultimately converge.

HITS and PageRank Algorithm

The HITS algorithm has several important differences from the PageRank algorithm . ^[6]

The HITS algorithm calculates not only the rank of each node, but also gives an intermediary score.
The PageRank algorithm contains a free parameter α, which is usually not included in the HITS algorithm.
The priority, as a result of the PageRank algorithm, is, as a rule, to use older resources, while the HITS algorithm has a smaller bias in this regard.
The PageRank algorithm can find the only unique solution.

Despite the differences between HITS and PageRank, the common thing in these algorithms is that the authority (weight) of the node depends on the weight of other nodes, and the level of the “intermediary” depends on how authoritative the nodes it refers to.

Calculation of the authority of individual documents today is widely used in applications such as determining the order of scanning documents in the network by the IPS robot, ranking search results, forming thematic reviews, etc.

Currently, the technology of artificially raising the ranks of individual web documents or their groups of websites by establishing hyperlinks that are not related to their content has become widespread. These technologies, which are an unreliable variety of Search Engine Optimization ( SEO) methods, called “black” SEO, are based on the adaptation to existing algorithms for ranking web documents by the most popular ( search engines ).

In turn, such technologies lead to the need for continuous improvement of ranking algorithms in search engines, orientation to the content component of web documents when determining their ranks. ^[four]

HITS Weaknesses

When evaluating the HITS algorithm, a lot of research was done and it was shown that while the algorithm works well for most queries, it does not work for some others. There are several reasons ^[7] :

Intermediaries and authors.

The inevitability of a clear distinction between “intermediaries” and “authors”, since many intermediary pages are also copyrighted.

Topic shift ( Eng. Topic drift ).

The dominant arrangement of some thematically closely related documents as a result of the HITS algorithm. In some cases, these documents may not be relevant to the request . It was recorded that in one case, when the search query element was “Jaguar”, the HITS algorithm converged to a football team called Jaguars.

To solve this problem, the PHITS algorithm was proposed ^[4] , as some extension of the standard HITS algorithm. As part of this algorithm, it is assumed: $D$ ${\ displaystyle D}$ - a lot of citing documents, $C$ ${\ displaystyle C}$ - many links $Z$ ${\ displaystyle Z}$ - many classes (factors). It is also assumed that the event $d\in {D}$ ${\ displaystyle d \ in {D}}$ happens with probability $P(d)$ ${\ displaystyle P (d)}$ . Conditional Probabilities $P(c|z)$ ${\ displaystyle P (c | z)}$ and $P(z|d)$ ${\ displaystyle P (z | d)}$ used to describe dependencies between links $c\in {C}$ ${\ displaystyle c \ in {C}}$ latent factor $z\in {Z}$ ${\ displaystyle z \ in {Z}}$ and document $d\in {D}$ ${\ displaystyle d \ in {D}}$ .

The likelihood function is evaluated:

L(C|D)=\prod _{c\in {C},d\in {D}}^{\ }P(d,c)=\prod _{c\in {C},d\in {D}}^{\ }P(d)P(c|d)

{\ displaystyle L (C | D) = \ prod _ {c \ in {C}, d \ in {D}} ^ {\} P (d, c) = \ prod _ {c \ in {C}, d \ in {D}} ^ {\} P (d) P (c | d)}

,

P(c|d)=\sum _{z\in {Z}}P(c|z)P(z|d)

{\ displaystyle P (c | d) = \ sum _ {z \ in {Z}} P (c | z) P (z | d)}

The purpose of the PHITS algorithm is to match $P(z)$ ${\ displaystyle P (z)}$ , $P(c|z)$ ${\ displaystyle P (c | z)}$ , $P(z|d)$ ${\ displaystyle P (z | d)}$ to maximize $L(C|D)$ ${\ displaystyle L (C | D)}$ .

After that:

 $P(c|z)$           $P$       $($       $c$         $|$         $z$       $)$           ${\ displaystyle P (c | z)}$       - ranks of "authors";

 $P(z|d)$           $P$       $($       $z$         $|$         $d$       $)$           ${\ displaystyle P (z | d)}$       - ranks of "intermediaries".

To calculate the ranks, it is necessary to set the number of factors in the set

Z

{\ displaystyle Z}

, and then

P(c|z)

{\ displaystyle P (c | z)}

will characterize the quality of the page as an “author” in the context of the subject. The disadvantages of the method include the fact that the iterative process most often stops not at the absolute, but at the local maximum of the likelihood function

L

{\ displaystyle L}

. At the same time, in situations where the set of found web pages does not clearly dominate the subject of the request, PHITS surpasses the HITS algorithm.

Automatically generated links.

Some of the links are computer generated, but the HITS algorithm still gives them equal values.

Irrelevant documents.

Some queries may return to a high place in the ranking irrelevant documents, which leads to erroneous results of the HITS algorithm.

Notes

↑ ¹ ² Krizhanovsky, 2008 , p. 27.
↑ The metric of HITS, 2005 , p. 55.
↑ Kleinberg, 1999 .
↑ ¹ ² ³ Algorithm HITS, 2009 .
↑ Hubs and authorities, 2010 , p. 5.
↑ PageRank and HITS, 2010 , p. 257.
↑ Problems with the HITS algorithm, 2011 , p. 255.

Literature

Lande D.V., Snarsky A.A., Bezsudnov I.V. Internet. Navigation in complex networks: models and algorithms (Russian) . - Librocom, 2009 .-- 264 p. - ISBN 978-5-397-00497-8 .
Cronin B. Annual Review of Information Science and Technology . - 2004 .-- 674 p. - ISBN 1573872091 .
Kleinberg J. Authoritative sources in a hyperlinked environment . - 1999.
Kleinberg J. HITS algorithm: authoritative sources in a hyperlinked environment (Russian) / translation by S. Neilenko. - 1999.
Gupta GK Introduction to Data Mining with Case Studies: 2nd Edition . - PHI Learning Pvt. Ltd., 2011 .-- 491 p. - ISBN 978-81-203-4326-9 .
Leo JG, Jonathan RP Discrete Calculus. Applied Analysis on Graphs for Computational Science . - Springer, 2010 .-- 366 p. - ISBN 978-1-84996-289-6 . (inaccessible link)
Scime A. Web Mining: Applications and Techniques . - Idea Group Inc., 2005 .-- 433 p. - ISBN 1591404150 .
Krizhanovsky A.A. PhD thesis. Mathematical and software for constructing lists of semantically related words based on the rating of wiki texts . - SPb. , 2008 .-- S. 27-30. - 188 p.
Chandranna AK An Online Versions of Hyperlinked-Induced Topics Search (HITS) Algorithm . - 2010.

[_2accd3fe65e2de83-1] ¹ ² Krizhanovsky, 2008 , p. 27.

[_25ad3e7b93d72430-2] The metric of HITS, 2005 , p. 55.

[_b085bc805d5a2f10-3] Kleinberg, 1999 .

[_72e7e698dcf932df-4] ¹ ² ³ Algorithm HITS, 2009 .

[_b4655ac6327dbf2d-5] Hubs and authorities, 2010 , p. 5.

[_cb4bdc07062c9f1e-6] PageRank and HITS, 2010 , p. 257.

[_b017a1400afec86d-7] Problems with the HITS algorithm, 2011 , p. 255.