The histogram of directional gradients ( English Histogram of Oriented Gradients, HOG ) - descriptors of special points that are used in computer vision and image processing to recognize objects . This technique is based on counting the number of gradient directions in local areas of the image. This method is similar to edge direction histograms , SIFT descriptors, and form contexts , but differs in that it is computed on a dense grid of uniformly distributed cells and uses normalization of overlapped local contrast to increase accuracy.
Navit Dalal and Bill Triggs , INRIA researchers, for the first time described a histogram of directional gradients in their work on a CVPR in June 2005. In this work, they used an algorithm to locate pedestrians in static images, although they subsequently expanded the scope to find people in the video, as well as various animals and machines in static images.
Content
Theory
The basic idea of the algorithm is the assumption that the appearance and shape of an object on a part of an image can be described by the distribution of intensity gradients or the direction of edges. The implementation of these descriptors can be done by dividing the image into small connected areas called cells, and calculating for each cell of the histogram the directions of gradients or the directions of the edges for the pixels inside the cell. The combination of these histograms is the descriptor. To increase accuracy, local histograms are normalized in contrast. To this end, a measure of intensity is computed over a larger portion of the image, which is called a block, and the resulting value is used for normalization. Normalized descriptors have better invariance with respect to lighting.
The HOG Descriptor has several advantages over other descriptors. Since HOG works locally, the method maintains the invariance of geometric and photometric transformations, except for the orientation of the object. Such changes will appear only in large fragments of the image. Moreover, as Dalal and Triggs discovered, a rough partition of space, accurate calculation of directions, and strong local photometric normalization make it possible to ignore pedestrian movements if they maintain an upright posture. The HOG handle is thus a good way to find people in images. [one]
Algorithm implementation
Gradient Calculation
The first step in the calculation of many points detectors is color normalization and gamma correction. Dalal and Triggs found that this step can be omitted for the HOG Descriptor, since subsequent normalization will produce the same result. Therefore, in the first step, gradient values are calculated. The most common method is to use a one-dimensional differentiating mask in the horizontal and / or vertical direction. This method requires filtering the color or brightness component using the following filtering cores:
- and
Dalal and Triggs used more complex masks, such as Sobel 3x3 ( Sobel Operator ) or diagonal masks, but these masks showed lower performance for this task. They also experimented with Gaussian blur before applying the differentiating mask, but also found that skipping this step increases the speed without noticeable loss of quality. [2]
Destination grouping
In the next step, cell histograms are calculated. Each pixel in a cell participates in weighted voting for the directions histogram channels, based on the value of gradients. The cells can be rectangular or round, the histogram channels are evenly distributed from 0 to 180, or from 0 to 360 degrees, depending on whether the “signed” or “unsigned gradient” is calculated. Dalal and Triggs found that an unsigned gradient with nine histogram channels provides better recognition results. When weights are distributed in a vote, the pixel weight can be specified either by the absolute value of the gradient, or by some function of it; in real tests, the absolute value of the gradient gives the best results. Other options may be the square root, square, or truncated absolute value of the gradient. [3]
Handle Blocks
To take brightness and contrast into account, gradients should be locally normalized, for which cells should be grouped into larger coherent blocks. The HOG descriptor, therefore, is the vector of the components of the normalized histograms of cells from all areas of the block. As a rule, blocks overlap, that is, each cell is included in more than one final descriptor. Two main block geometries are used: rectangular R-HOG and round C-HOG. R-HOG blocks are usually square grids characterized by three parameters: the number of cells per block, the number of pixels per cell, and the number of channels per cell histogram. In the Dalal and Triggs experiment, the optimal parameters are 16x16 blocks, 8x8 cells and 9 channels per histogram. Moreover, they found that it is possible to slightly increase the speed of calculations by applying a Gaussian filter inside each block before the voting procedure, which, in turn, reduces the weight of pixels at the block boundaries. The R-HOG blocks are very similar to SIFT descriptors ; however, despite their similar structure, R-HOG blocks are computed on dense grids of a fixed scale without a fixed direction, while SIFT descriptors are computed in sparse, non-scale sensitive key points of the image and use rotation to align the direction. In addition, R-HOG blocks are used together to encode information about the form of objects, while SIFT descriptors are used separately.
C-HOG blocks have 2 varieties: with a solid central cell and divided into sectors. These blocks can be described by 4 parameters: the number of sectors and rings, the radius of the central ring and the coefficient of expansion for the radii of the remaining rings. Dalal and Triggs found that both species showed the same result, and the division into 2 rings and 4 sectors with a radius of 4 pixels and an expansion factor of 2 gave the best result in their experiment. In addition, Gaussian weighing did not provide any improvements in the use of C-HOG units. These blocks are similar to form contexts , but have an important difference: C-HOG blocks contain cells with several direction channels, while form contexts use only the presence of one edge. [four]
Block normalization
Dalal and Triggs investigated four methods for normalizing blocks. Let be - non-normalized vector containing all histograms of this block, - its k- norm with and - some small constant (the exact value is not so important). Then the normalization factor can be obtained in one of the following ways:
- L2-norm:
- L2-hys: L2-norm is bounded above (values of v, greater than 0.2, are assumed to be equal to 0.2) and renormalized, as in [5]
- L1-norm:
- root of L1-norm:
Dalal and Triggs found that the L1-norm gives less reliable results than the other three, which work about equally well, but all four methods significantly improve results compared to non-normalized ones. [four]
SVM Classifier
The final step in recognizing objects using HOG is the classification of descriptors using a teacher- based learning system. Dalal and Triggs used the Support Vector Machine (SVM) method .
Testing
In the original human detection experiment, Dalal and Trigg compared the R-HOG and C-HOG descriptors with generalized Haar wavelets and form contexts . Generalized Haar wavelets are Haar directed wavelets and were used in 2001 by Mohan, Papageorgiu and Poggio in their object detection experiments. PCA-SIFT descriptors are similar to SIFT descriptors, but differ in that the principal component method is applied to normalized gradients. The PCA-SIFT descriptors were first used in 2004 in the work of Ke and Sukhthankar; it was stated that they exceed the usual SIFT descriptors in their parameters. Finally, form contexts, like C-HOG, use round bins, but consider voices only on the basis of the presence of an edge, disregarding orientation. Contexts of form appeared in 2001 in the work of Belongi, Malik and Puzich.
Testing was conducted on two different data sets. The MIT pedestrian database contains a training sample of 509 images and a test sample of 200 images. The set contains images of people only in front or behind, poses in the images are almost no different. This database is widely known and used in other studies; it can be found at https://web.archive.org/web/20080130190339/http://cbcl.mit.edu/cbcl/software-datasets/PedestrianData.html . The second data set was specially created by Dalal and Triggs for their experiment, since on the MIT set HOG descriptors showed almost perfect results. This dataset, known as INRIA, contains 1805 images of people. The set contains images of people in a wide variety of poses, includes images with a difficult background (for example, against the background of a crowd), and is much more difficult to recognize than the MIT set. The INRIA database is currently available at http://lear.inrialpes.fr/data .
According to the research results, the C-HOG and R-HOG descriptors give comparable results, with the C-HOG having a slightly smaller proportion of missing images with a fixed fraction of the first kind errors on both sets of images.
| Descriptor | Image set | The proportion of missing images | The proportion of errors of the first kind |
|---|---|---|---|
| Hog | MIT | ≈0 | 10 −4 |
| Hog | INRIA | 0.1 | 10 −4 |
| Generalized Haar Wavelets | MIT | 0.01 | 10 −4 |
| Generalized Haar Wavelets | INRIA | 0.3 | 10 −4 |
| PCA-SIFT form contexts | MIT | 0.1 | 10 −4 |
| PCA-SIFT form contexts | INRIA | 0.5 | 10 −4 |
You can find the corresponding DET chart at the following link. [2]
Further development
As part of the Pascal Visual Object Classes workshop in 2006, Dalal and Triggs presented the results of applying HOG descriptors to search on images not only of people, but also of cars, buses, bicycles, dogs, cats and cows, as well as the optimal parameters for the formation and normalization blocks in each case. Under the link you can see examples for the detection of motorcycles. [6]
Then, at the 2006 European Computer Vision Conference , Dalal and Triggs, together with Cordelia Schmid, applied HOG descriptors to recognize people on video. The method they proposed is to use common HOG descriptors on each frame and internal motion histograms (IMH) on pairs of consecutive frames. IMH descriptors use gradient lengths obtained from the optical flow between two successive frames.
See also
- Finding angles
- Finding pedestrians
- Special points (computer vision)
- Finding special points (computer vision)
- Feature highlighting
- Object Recognition
- SIFT
Notes
- ↑ Histograms of Oriented Gradients for Human Detection, pg. 2 . Archived January 25, 2013.
- ↑ 1 2 Histograms of Oriented Gradients for Human Detection, pg. 4 . Archived January 25, 2013.
- ↑ Histograms of Oriented Gradients for Human Detection, pg. 5 . Archived January 25, 2013.
- ↑ 1 2 Histograms of Oriented Gradients for Human Detection, pg. 6 . Archived January 25, 2013.
- ↑ DG Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60 (2): 91-110, 2004.
- ↑ Object Detection using Histograms of Oriented Gradients . Archived January 25, 2013.
Links
- http://www.mathworks.com/matlabcentral/fileexchange/33863 Implementation for Matlab (mex file)
- http://www.cs.cmu.edu/~yke/pcasift/ - Code for finding objects using the PCA-SIFT method
- http://lear.inrialpes.fr/software/ - A set of software for finding objects using HOG (home page of the research group)
- https://web.archive.org/web/20100502032344/http://www.navneetdalal.com/software - A set of software for finding objects using HOG (Navnit Dalal home page)
- http://pascal.inrialpes.fr/data/human/ - Set of INRIA images with people
- http://cbcl.mit.edu/software-datasets/PedestrianData.html - MIT Pedestrian Image Set