Image hierarchical representations models based on Latent Dirichlet Allocation
1. Foreword
Currently, local feature-based description operator, for example SIFT[1], is the most successful and extensive methods in computer vision[2]. It’s succeeded in applying to scene and object identification. The main concepts of those methods are: within numerous small areas of some image, coding on the gradient orientation both with compactness and distinction. Based on gained visual words, these local features (clustering center in description space)are able to acquire information. As for this kind of quantization, various scholars put forward different global image representation methods, such as bag-of-words model and space Pyramid mode[3].And they redefined the measure to replace hard clustering by soft clustering[4]. In other words, it uses individual characteristics of the operator to describe several hybrid features. This method promotes the robustness as well as exists biological system [5]. It’s indicated according to the theoretical analysis and biological experimental result that this decomposition has better performance in data sets of object identification and some new data set[6,7].
According to the recent research result, multilayer visual presentation could advance the performance of object identification by lifting the metabolic robustness[8], which is exactly corresponding to current stratified theoretical framework and theoretical research of mammal visual cortex[9].These theories extrude the importance of feedback: both improving skills of classification study and explicating local information of reasoning process.
However, most of the existing hierarchical representation model[11,12] process information via feed forward mode, which belongs to strict feed-forward implementation model. It means that every layer’s input is the output in former layer which is non-robust for local fuzziness in visual input. In other words, this method cannot solve the problem of local fuzzification. In order to specify local information, we need more contexts expressed by local images. Therefore, it’s a necessary to get a recursive Bayesian probabilistic model in allusion to complex visual features.
Based on the above analysis, we propose a probability model for all stratified hierarchy of learning and inference, which is based on being successfully applied to the visual word modeling[13] and the potential Dirichlet distribution (Latent Dirichlet Allocation, LDA) [14]of object detection tasks [7]. It takes the probability of recursive decomposition process into account, and obtains multilayer structure pyramid LDA derived model through the derivation. The model has two important features: 1) to increase the presentation layer can improve the performance of the model plane; 2) Bayesian probabilistic framework provides a good way to integrate top-down information into the model .Bayesian probabilistic method is superior to feed forward implementation form. In the test results of standard identification data sets show that with the existing hierarchical representation methods, the proposed model showed better performance.
2.LDA model of multi-layer structure pyramid
This paper presents a multi-layer structure pyramid LDA model, namely learning and inference hierarchy of all hierarchical probabilistic model to effectively expressing stratified representation of the image. Note: The proposed multi-layer structure pyramid model differs from traditional hierarchical LDA model[15].Level LDA model is the topic layer formed on the same vocabulary; by contrast, multi-layer structure pyramid LDA model is fixed by recursively formed words, it promotes a single variable of LDA model potential topics to change into Bayesian network variables. Here, we work on level LDA model with meticulous elaboration. Hierarchical LDA model is an extension of the LDA model, the theme depends on the hierarchy. Assuming a given layer L tree, each node is assigned a topic, the text will produce it in the following ways: 1) Choose a path from the root of the tree to a leaf node ; 2) obtained qvector from the L-dimensional Dirichlet in proportion theme ; 3) along this route, qwords resulting from the mixing ratio of the document theme . Finally, the use of CRP (Chinese restaurant process) is to slack assumptions of hybrid tree structure. CRP is an integer division in the distribution, and can be imagined as a process in which the customers of amount M sitting at an Chinese restaurant with infinite number of tables. Main process: the first customer sitting in a seat, then the m-th queuing customer seated in the following probability distribution.
OTi refers to table “i” already occupied ; NUT refers to the next unoccupied table; PC refers to customers already has a seat; mi refers to the number of customers at the table. After M customers sit, the seat design would do implementation of the M division , this distribution and Dirichlet process [16] has the same partition structure.
Below, this paper proposes LDA-based hierarchical representation of the image. Compared to the previous latent factor model method, this paper is based on the potential underlying theme, formal hierarchical spatial distribution of an advanced method. In order to clearly explain the problem, first of all, this model is divided into two layers: L0 and L1, shown in Figure 1 (a) above. By way of example, we can see in Fig 1 (a), the example, L0 layer structure of spatial grid size is 4 × 4, the gradient direction box size is of V = 8 distribution of a word. Because the vocabulary of words corresponding to the position box, so that their occurrences refer to the histogram of energy, namely corresponding to the SIFT descriptors box.
In Figure 1 (a) in, T0 component mixing model is gained from a number of parameters
Φ0 ∈R T0×X0×V space. In this example, Φ0 ∈R T0×?4×4?×8. Integration of the L1 layer includes the mixing ratio of L0 layer, wherein, L0 layer in the spatial characteristics of the grid from the L1 region to X1.L1 constructs a component in the L0 spatial distribution. T1 component mixture model on the L1 layer is parameterized via multiple parameters.
Spatial grid at each level and each observed word x’s position of variables play a decisive role. However, the distribution of words on the grid is not uniform, and the different components may differ. This must be calculated in a Φ mixed distribution introducing Χspace (number) distribution on each level , all these spatial distributions are at different levels. Therefore, we need to define a complete derivative model. Note: The network size is 1 × 1 single-layer model and the model is equivalent to the LDA, and is a special case of the proposed method.
Fig 1 (a) Concept for the pyramid multilayer structure LDA model
2.1 Derivative modeling process
Fig1 (b) shows, it is a double pyramid LDA graphical model, variables are represented in the figure: ,
α,β0,β1 are symmetric Dirichlet and priorities;
T0 and T1,are respectively number of mixture components corresponding to L0 and L1 layer ;
Φ0 ∈R T0×X0×V and Φ0 ∈R T1×X1×T0 X is introduced at each level space (number) distribution.
X1, x2, z1, z0 and w are in the following derivative process and 3) shows the definition.
(b) Graphical model describing two-layers
α,β0,β1 hyper parameters are given , the joint distribution of the model parameters can be broken down as follows:
?1?
In the above formula, parentheses superscript index is specific to each variable. "." refers to the whole range of the variable. For example, Φi(ti.’,’), it refers to the polynomial parameters in grid and the underlying theme t1 on topic L0 .
2.2 Learning and reasoning
For the learning process of the model parameters, by inference mixture distribution is obtained for each word w (d, n) and potential distribution z0 (d, n) and z1 (d, n) samples at the L1 and L0 . Further, in FIG.1 (a), the observed position variables x0 (d, n), and x1 (d, n) through the grid X0 and X1 to track each occurrence of vocabulary.
Here, we briefly explain the Gibbs sampling method[14,17]. As the single realization form of MCMC (Markov Chain Monte Carlo), its purpose is to construct a certain convergence of the probability distribution of the target. Sampling from the Markov chain is considered to be close to the probability. Objective probability distribution function is essential in the use of Gibbs sampling. Using Gibbs sampling, it’s based on sampling current values and documentation set, to choose all variants value, from their distribution and then make transition to the next - state. Gibbs algorithm is as follows:
With yi refers to a random number of [1, T] , T is the number of topics.
i∈[1,2....,N], N is all the number of words occurrences of the text corpus, relating to the vocabulary size and vocabulary appear location. This is the initial state Markov chain;
According to the following formula, we will obtain the next state of the Markov chain, and carry enough iterations until Markov chain tends to target distribution, and marks yi as current value.
Here, y is the subject which the word w is from; nj(w) is the frequency to word j form w ; nj??? is the number of all words assigned topic j.
As for each sample yi, we assess the Φ and q
Nj(d) is the j is number of words assigned to the subject document j ; n*(d) is the number of all words assigned to topic j.
By Gibbs sampling inference methods, based on the model parameters joint distribution decomposition formula (1) and LDA theoretical model[14],put polynomial parametric model into integration, such as the push to equation (2). In equation (2), since all the variables x is observed, and between the Φ and X, the conversion may be a decisive, therefore, can eliminate all of the terms in addition to terms involving x. Then according to the following basic LDA
Formula to obtain the equation (2) results:
Where, N is the number of words in a text; can be seen as a random variables of k-dimensional vector , parameters α component αi > 0. Between different layers through the variable x will perform spatial grouping. The formula (2) is easy to generate L-layer model, as shown in Equation (3). Note :that the formula (3) in the "evidence" refers to the famous Bayesian evidence
3. Performance Evaluation
To evaluate the performance data of proposed probabilistic model multilayer structure pyramid, this section tests the Caltech 101 data sets[18].Set contains 101 classified objects. Different classes of objects contain different number of images. Experimental has been compared the performance of each component of a different number, and explore a single model, feed forward LDA model (FLDA) and the proposed total probability generation model (RLDA). The results showed that:
1) In the monolayer model adding a layer may improve the classification performance; 2) RLDA feed forward model improves performance of the FLDA model.
3.1Implementation
This paper presents methods of image feature representation is 16 × 16 pixels SIFT descriptors, which is spanned from 6 pixels to accurately extract pictures. Each descriptor is processed by proposed probabilistic model. Since LDA model requires that the count data is discrete, and the SIFT dimension is continuous values, so the maximum value normalized to 100 SIFT mark, this quantization level descriptors retains enough information.
Experimental compares the following three models:
1) LDA: LDA Model with 20K SIFT feature extraction area, training different number of components (128, 1024 and 2048). In addition, “superior region" is contained of 4 × 4 SIFT feature area also trained LDA model.
2) FLDA: The feed forward model is the first region in the training SIFT features to practice an LDA model. LDA model study on the input side, integrating 4 × 4 SIFT "superior region" as the output. Experiment tested the 128 at the bottom components and tested 128 and 1024 components at the top level.
3.2 Evaluation
Classification test follows spatial pyramid matching model , and it is general method [3]of the Caltech-101 Evaluation . A spatial pyramid has three layers: contains 4 × 4, 2 × 2, 1 × 1 grids, which is built on top of features. Using the maximum pool[6] to pyramid polymerization space and using linear SVM to classification. Libsvm is selected as experimental SVM training lab environment. 30 for each image class object is used for training, the remaining image for testing. Experiment run 10 times to obtain average accuracy and standard deviation. Experiments, Gibbs sampling topics mesh T = 100, hyper parameters α = 50 / T, β0,1 = 0.01. Adopting proposed image hierarchical representation model for the entire modeled theme of document set , Diego generation of 1000 times, each text is represented as a main theme that contains 100 multinomial distribution on the set of questions, getting documentation set implicit theme in the constructed on a linear SVM classifier.
As for the comparison model, LDA model calculation complexity is O(N2 k), and N is the number of words in a text, k is random variable dimension. FLDA model core implementation is still LDA model, but using a feed forward manner, so the calculation of complex is the same as LDA model. Hierarchical LDA model was constructed as a hierarchical tree, therefore, the calculation complexity is O(N2 k),As for RLDA model, because it is pyramid multilayer structure, an increase for each additional layer is equivalent to a extra layer of the tree, so its computational complexity is the same as level of LDA . In fact, based on the models constructed, linear SVM classifier aims to predicate the experimental data classification. Since we use the linear SVM classifier device, and the purpose of our experiment is to classify the data set correctly, therefore, based on the output data of the model are linearly SVM classification, the classification of the computational complexity is also a linear relationship.
First, test LDA model on the single SIFT features ,so with 128 components obtained classification accuracy of 57.9%, weight increase to 1024, the classification accuracy increased to 67.8%, component increased to 2048, the classification accuracy of 70.2%.
Here, the focus on the evaluation of the impact of the increase of presentation layer on classification performance. In this test, the number of components are of a given topic is 128, training and comparing with three models described in Chapter 3.1: single LDA, double FLDA and RLDA model. Single model and the two-layer model for the sub-class is the same dimension of feature vectors. Double bottom and top model layer of 128 components for training, were recorded as 128b and 128t, and they correspond to the SIFT descriptor. Table 1 shows the components of the experimental results of 128 models in L1 layer and 128 components modes in L2 layer. Here, “bottom" refers to the area in the SIFT feature training, "top" means the 4 × 4 SIFT's "Parent Zone" on the train, "both" refers to the superposition trained on two characteristics.
Table 1 Comparison of experimental results of three models
As it is evident from Table 1, in the area of training the RLDA model of SIFT features, the classification accuracy of 62.6%, more than the 4.7 percentage points compared with single-LDA model classification accuracy 57.9%. This indicates that joint learning has outstanding performance than the traditional single-LDA model. In addition, the two-layer model of the single-layer model to improve the classification performance, that FLDA correct classification rate of 61.3. Both SIFT feature greatly improves the classification accuracy was 66.0%.Regional monolayer model accuracy is 57.9% when tested on the classification, as well as in 4 × 4 SIFT "upper zone" monolayer model when tested on the classification of fine degree of 60.5%.
Figure 2 (a) shows the number of training samples of average accuracy rate at different image points. Apparently, RLDA classification performance of the model is always superior to FLDA model, FLDA classification performance of the model. In addition, experimental tests of the L1 and L2 layers superimposed onto a characteristic feature vector under the classification performance (for example, Table 1, "both" case), this feature includes more information, taking into account both of the characteristics of the larger space, also considering a small special sign. Figure 2 (b) shows the performance comparison, when use of this information of the "both" case, FLDA is further improved by nearly 3% accuracy, RLDA increased by 2% accuracy.
“bottom” and “top” case
(b) “both” case
Next, the1024 top components and 128 bottom components, learning on the two-layer mode. It was found that these models classification performance’s difference is smaller than before, but it is higher than 128b/128t case’s classification performance. Specifically, degree of RLDA model study on the top floor was 72.6%, while the FLDA model learning precision is 72.5% and monolayer model classification accuracy is 68.8%. RLDA model at the bottom learning accuracy up to 62.7%, FLDA model reached 62.6% meanwhile. RLDA reached 73.7%, FLDA’s accuracy, when using double, reached 72.9%.In addition to the above experiments, in this section we will present model associated with the hierarchy model (Hierarchical-LDA [15], CNN [19], CNN + Transfer [2], CDBN [10] and Hierarchy-of-part [20] Model), respectively with 15 and 30 images of the training set images for linear SVM classifier performance for ratio comparison on the Caltech-101 data sets, as shown in Tables 2 and 3. From the experimental results in Table 2 clearly shows, most existing hierarchical model outperforms the classification SIFT features single test area classification accuracy of LDA model 57.9%; while the proposed RLDA model hierarchy is seven percentage point higher than the existing model type of classification accuracy.
Table 2 Comparison of classification accuracies between RLD model and other hierarchical models by using 30 training images
Table 3 Comparison of classification accuracies between RLDA model and other hierarchical models by using 15 training images
Currently, most of the existing hierarchical models are presented with the best performance of layer model learning, and in the literature [11, 12, 20], the experimental results showed that they added an identification layer but decreased fine degrees; while in the proposed Bayesian model, it can be seen from the experimental results, adding a presentation layer enhances the classification performance. Most importantly, full Bayesian model proposed in image classification outperforms former feed-forward method. This indicates that Bayesian inference resulted in far more stable assessment of the regional characteristics, and has been robust recognition performance. Hierarchical recognition highlights the importance of feedback.
In addition, Figure 3 presents the total probability feed forward model and generate models learning component of the feature region in the image comparison visualization. FLDA only in the local edge direction for learning, and RLDA learning is more complex spatial structure. As can be seen by Figure 3, FLDA feed forward model, the underlying theme is essentially a direction. The second layer is not attached on them any other structure. Thus, there appears the top topic in the same direction which is limited to a space supported by a larger range; while as for RLDA model, the top component seems to exhibit more potential, discriminative space frame structure. Meanwhile, we can also find the characteristics of the underlying neighborhood RLDA, within the subject there is a strong relationship between the activities, indicating that the mold type of reasoning can be rendered by the underlying connection between the sub-space continued structures.
Figure 3 Comparison of the (a) feed-forward FLDA and (b) full generative models in terms of components learned, visualized as average image patches.
4. Conclusion
For higher complexity visual features, we proposed a joint training of all presentation layer probability models. This model has better classification performance than the existing hierarchical model. In addition, it indicates by adding additional layers significantly to improve the classification performance, and this method has better performance than the method for feed forward. This shows the importance of the hierarchical model of visual recognition of feedback. In conclusion, the proposed probabilistic model is robust, easy to modular combination. With the increase of layers, this model can be characterized from a source to all intermediate object recognition for methods to evolve.
References
[1]
Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision (IJCV), 2004, 60(2): 91-110.
[2]
Ahmed A, Yu K, Xu W, Gong Y, Xing E P. Training hierarchical feed-forward visual recognition models using transfer learning from pseudo-tasks[C]// Proceedings of the 10th European Conference on Computer Vision (ECCV): Part III, Marseille, France, Berlin: Springer-Verlag, 2008: 69-82.
[3]
Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), NY, USA: IEEE Computer Society, 2006: 2169-2178.
[4]
Yang J, Yu K, Gong Y, Huang T. Linear spatial pyramid
matching using sparse coding for image classification[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Florida, USA: IEEE Computer Society, 2009: 1794-1801.
[5]
Olshausen B A, Field D J. Sparse coding with an over-complete basis set: a strategy employed by V1?[J]. Vision Research, 1997, 37(23): 3311-3325.
[6]
Boureau Y L, Bach F, LeCun Y, Ponce J. Learning mid-level features for recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA: IEEE Computer Society, 2010: 2559-2566.
[7]
Fritz M, Black M J, Bradski G R, Karayev S, Darrell T. An additive latent feature model for transparent object recognition[C]// Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS), Vancouver, B.C., Canada: MIT Press, 2009: 558-566.
[8]
Mutch J, Lowe D G. Object class recognition and localization using sparse features with limited receptive fields[J]. International Journal of Computer Vision (IJCV), 2008, 80(1): 45-47.
[9]
Rolls E T, Deco G. Computational neuroscience of vision[M]. Oxford University Press, 2002.
[10]
Lee H, Grosse R, Ranganath R, Ng A Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations[C]// Proceedings of the 26th Annual International Conference on Machine Learning (ICML), NY, USA, New York: ACM, 2009: 609-616.
[11]
Ranzato M A, Huang F J, Boureau Y L, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Alaska, USA: IEEE Computer Society, 2008: 1-8.
[12]
Serre T, Wolf L, Bileschi S, Poggio T. Object recognition
with cortex-like mechanisms[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(3): 411-426.
[13]
Sivic J, Russell B C, Efros A A, Zisserman A, Freeman W T. Discovering objects and their locations in images[C]// Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV), Beijing, China, New York: Springer-Verlag, 2005: 370-377.
[14]
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(1): 993-1022.
[15]
Blei D M, Griffiths T L, Jordan M I, Tenenbaum J B. Hierarchical topic models and the nested Chinese restaurant process[C]// Advances in Neural Information Processing Systems (NIPS) 16, British Columbia, Canada: MIT Press, 2010.
[16]
Ferguson T. A Bayesian analysis of some nonparametric problems[J]. The Annals of Statistics, 1973, 1(3): 209-230.
[17]
Heinrich G. Parameter estimation for text analysis[R]. Technical Report, Darmstadt, 2008.
[18]
Fei L F, Fergus R, Perona P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories[J]. Journal of Computer Vision and Image Understanding, 2004, 106(1): 59-70.
[19]
Kavukcuoglu K, Sermanet P, Boureau Y L, Gregor K, Mathieu M I, LeCun Y. Learning convolutional feature hierarchies for visual recognition[C]// Advances in Neural Information Processing Systems (NIPS), Vancouver, B.C., Canada: MIT Press, 2010: 1090-1098.
[20]
Fidler S, Boben M, Leonardis A. Similarity-based cross-layered hierarchical representation for object categorization[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Alaska, USA: IEEE Computer Society, 2008: 1-8.
1. Foreword
Currently, local feature-based description operator, for example SIFT[1], is the most successful and extensive methods in computer vision[2]. It’s succeeded in applying to scene and object identification. The main concepts of those methods are: within numerous small areas of some image, coding on the gradient orientation both with compactness and distinction. Based on gained visual words, these local features (clustering center in description space)are able to acquire information. As for this kind of quantization, various scholars put forward different global image representation methods, such as bag-of-words model and space Pyramid mode[3].And they redefined the measure to replace hard clustering by soft clustering[4]. In other words, it uses individual characteristics of the operator to describe several hybrid features. This method promotes the robustness as well as exists biological system [5]. It’s indicated according to the theoretical analysis and biological experimental result that this decomposition has better performance in data sets of object identification and some new data set[6,7].
According to the recent research result, multilayer visual presentation could advance the performance of object identification by lifting the metabolic robustness[8], which is exactly corresponding to current stratified theoretical framework and theoretical research of mammal visual cortex[9].These theories extrude the importance of feedback: both improving skills of classification study and explicating local information of reasoning process.
However, most of the existing hierarchical representation model[11,12] process information via feed forward mode, which belongs to strict feed-forward implementation model. It means that every layer’s input is the output in former layer which is non-robust for local fuzziness in visual input. In other words, this method cannot solve the problem of local fuzzification. In order to specify local information, we need more contexts expressed by local images. Therefore, it’s a necessary to get a recursive Bayesian probabilistic model in allusion to complex visual features.
Based on the above analysis, we propose a probability model for all stratified hierarchy of learning and inference, which is based on being successfully applied to the visual word modeling[13] and the potential Dirichlet distribution (Latent Dirichlet Allocation, LDA) [14]of object detection tasks [7]. It takes the probability of recursive decomposition process into account, and obtains multilayer structure pyramid LDA derived model through the derivation. The model has two important features: 1) to increase the presentation layer can improve the performance of the model plane; 2) Bayesian probabilistic framework provides a good way to integrate top-down information into the model .Bayesian probabilistic method is superior to feed forward implementation form. In the test results of standard identification data sets show that with the existing hierarchical representation methods, the proposed model showed better performance.
2.LDA model of multi-layer structure pyramid
This paper presents a multi-layer structure pyramid LDA model, namely learning and inference hierarchy of all hierarchical probabilistic model to effectively expressing stratified representation of the image. Note: The proposed multi-layer structure pyramid model differs from traditional hierarchical LDA model[15].Level LDA model is the topic layer formed on the same vocabulary; by contrast, multi-layer structure pyramid LDA model is fixed by recursively formed words, it promotes a single variable of LDA model potential topics to change into Bayesian network variables. Here, we work on level LDA model with meticulous elaboration. Hierarchical LDA model is an extension of the LDA model, the theme depends on the hierarchy. Assuming a given layer L tree, each node is assigned a topic, the text will produce it in the following ways: 1) Choose a path from the root of the tree to a leaf node ; 2) obtained qvector from the L-dimensional Dirichlet in proportion theme ; 3) along this route, qwords resulting from the mixing ratio of the document theme . Finally, the use of CRP (Chinese restaurant process) is to slack assumptions of hybrid tree structure. CRP is an integer division in the distribution, and can be imagined as a process in which the customers of amount M sitting at an Chinese restaurant with infinite number of tables. Main process: the first customer sitting in a seat, then the m-th queuing customer seated in the following probability distribution.
OTi refers to table “i” already occupied ; NUT refers to the next unoccupied table; PC refers to customers already has a seat; mi refers to the number of customers at the table. After M customers sit, the seat design would do implementation of the M division , this distribution and Dirichlet process [16] has the same partition structure.
Below, this paper proposes LDA-based hierarchical representation of the image. Compared to the previous latent factor model method, this paper is based on the potential underlying theme, formal hierarchical spatial distribution of an advanced method. In order to clearly explain the problem, first of all, this model is divided into two layers: L0 and L1, shown in Figure 1 (a) above. By way of example, we can see in Fig 1 (a), the example, L0 layer structure of spatial grid size is 4 × 4, the gradient direction box size is of V = 8 distribution of a word. Because the vocabulary of words corresponding to the position box, so that their occurrences refer to the histogram of energy, namely corresponding to the SIFT descriptors box.
In Figure 1 (a) in, T0 component mixing model is gained from a number of parameters
Φ0 ∈R T0×X0×V space. In this example, Φ0 ∈R T0×?4×4?×8. Integration of the L1 layer includes the mixing ratio of L0 layer, wherein, L0 layer in the spatial characteristics of the grid from the L1 region to X1.L1 constructs a component in the L0 spatial distribution. T1 component mixture model on the L1 layer is parameterized via multiple parameters.
Spatial grid at each level and each observed word x’s position of variables play a decisive role. However, the distribution of words on the grid is not uniform, and the different components may differ. This must be calculated in a Φ mixed distribution introducing Χspace (number) distribution on each level , all these spatial distributions are at different levels. Therefore, we need to define a complete derivative model. Note: The network size is 1 × 1 single-layer model and the model is equivalent to the LDA, and is a special case of the proposed method.
Fig 1 (a) Concept for the pyramid multilayer structure LDA model
2.1 Derivative modeling process
Fig1 (b) shows, it is a double pyramid LDA graphical model, variables are represented in the figure: ,
α,β0,β1 are symmetric Dirichlet and priorities;
T0 and T1,are respectively number of mixture components corresponding to L0 and L1 layer ;
Φ0 ∈R T0×X0×V and Φ0 ∈R T1×X1×T0 X is introduced at each level space (number) distribution.
X1, x2, z1, z0 and w are in the following derivative process and 3) shows the definition.
(b) Graphical model describing two-layers
α,β0,β1 hyper parameters are given , the joint distribution of the model parameters can be broken down as follows:
?1?
In the above formula, parentheses superscript index is specific to each variable. "." refers to the whole range of the variable. For example, Φi(ti.’,’), it refers to the polynomial parameters in grid and the underlying theme t1 on topic L0 .
2.2 Learning and reasoning
For the learning process of the model parameters, by inference mixture distribution is obtained for each word w (d, n) and potential distribution z0 (d, n) and z1 (d, n) samples at the L1 and L0 . Further, in FIG.1 (a), the observed position variables x0 (d, n), and x1 (d, n) through the grid X0 and X1 to track each occurrence of vocabulary.
Here, we briefly explain the Gibbs sampling method[14,17]. As the single realization form of MCMC (Markov Chain Monte Carlo), its purpose is to construct a certain convergence of the probability distribution of the target. Sampling from the Markov chain is considered to be close to the probability. Objective probability distribution function is essential in the use of Gibbs sampling. Using Gibbs sampling, it’s based on sampling current values and documentation set, to choose all variants value, from their distribution and then make transition to the next - state. Gibbs algorithm is as follows:
With yi refers to a random number of [1, T] , T is the number of topics.
i∈[1,2....,N], N is all the number of words occurrences of the text corpus, relating to the vocabulary size and vocabulary appear location. This is the initial state Markov chain;
According to the following formula, we will obtain the next state of the Markov chain, and carry enough iterations until Markov chain tends to target distribution, and marks yi as current value.
Here, y is the subject which the word w is from; nj(w) is the frequency to word j form w ; nj??? is the number of all words assigned topic j.
As for each sample yi, we assess the Φ and q
Nj(d) is the j is number of words assigned to the subject document j ; n*(d) is the number of all words assigned to topic j.
By Gibbs sampling inference methods, based on the model parameters joint distribution decomposition formula (1) and LDA theoretical model[14],put polynomial parametric model into integration, such as the push to equation (2). In equation (2), since all the variables x is observed, and between the Φ and X, the conversion may be a decisive, therefore, can eliminate all of the terms in addition to terms involving x. Then according to the following basic LDA
Formula to obtain the equation (2) results:
Where, N is the number of words in a text; can be seen as a random variables of k-dimensional vector , parameters α component αi > 0. Between different layers through the variable x will perform spatial grouping. The formula (2) is easy to generate L-layer model, as shown in Equation (3). Note :that the formula (3) in the "evidence" refers to the famous Bayesian evidence
3. Performance Evaluation
To evaluate the performance data of proposed probabilistic model multilayer structure pyramid, this section tests the Caltech 101 data sets[18].Set contains 101 classified objects. Different classes of objects contain different number of images. Experimental has been compared the performance of each component of a different number, and explore a single model, feed forward LDA model (FLDA) and the proposed total probability generation model (RLDA). The results showed that:
1) In the monolayer model adding a layer may improve the classification performance; 2) RLDA feed forward model improves performance of the FLDA model.
3.1Implementation
This paper presents methods of image feature representation is 16 × 16 pixels SIFT descriptors, which is spanned from 6 pixels to accurately extract pictures. Each descriptor is processed by proposed probabilistic model. Since LDA model requires that the count data is discrete, and the SIFT dimension is continuous values, so the maximum value normalized to 100 SIFT mark, this quantization level descriptors retains enough information.
Experimental compares the following three models:
1) LDA: LDA Model with 20K SIFT feature extraction area, training different number of components (128, 1024 and 2048). In addition, “superior region" is contained of 4 × 4 SIFT feature area also trained LDA model.
2) FLDA: The feed forward model is the first region in the training SIFT features to practice an LDA model. LDA model study on the input side, integrating 4 × 4 SIFT "superior region" as the output. Experiment tested the 128 at the bottom components and tested 128 and 1024 components at the top level.
3.2 Evaluation
Classification test follows spatial pyramid matching model , and it is general method [3]of the Caltech-101 Evaluation . A spatial pyramid has three layers: contains 4 × 4, 2 × 2, 1 × 1 grids, which is built on top of features. Using the maximum pool[6] to pyramid polymerization space and using linear SVM to classification. Libsvm is selected as experimental SVM training lab environment. 30 for each image class object is used for training, the remaining image for testing. Experiment run 10 times to obtain average accuracy and standard deviation. Experiments, Gibbs sampling topics mesh T = 100, hyper parameters α = 50 / T, β0,1 = 0.01. Adopting proposed image hierarchical representation model for the entire modeled theme of document set , Diego generation of 1000 times, each text is represented as a main theme that contains 100 multinomial distribution on the set of questions, getting documentation set implicit theme in the constructed on a linear SVM classifier.
As for the comparison model, LDA model calculation complexity is O(N2 k), and N is the number of words in a text, k is random variable dimension. FLDA model core implementation is still LDA model, but using a feed forward manner, so the calculation of complex is the same as LDA model. Hierarchical LDA model was constructed as a hierarchical tree, therefore, the calculation complexity is O(N2 k),As for RLDA model, because it is pyramid multilayer structure, an increase for each additional layer is equivalent to a extra layer of the tree, so its computational complexity is the same as level of LDA . In fact, based on the models constructed, linear SVM classifier aims to predicate the experimental data classification. Since we use the linear SVM classifier device, and the purpose of our experiment is to classify the data set correctly, therefore, based on the output data of the model are linearly SVM classification, the classification of the computational complexity is also a linear relationship.
First, test LDA model on the single SIFT features ,so with 128 components obtained classification accuracy of 57.9%, weight increase to 1024, the classification accuracy increased to 67.8%, component increased to 2048, the classification accuracy of 70.2%.
Here, the focus on the evaluation of the impact of the increase of presentation layer on classification performance. In this test, the number of components are of a given topic is 128, training and comparing with three models described in Chapter 3.1: single LDA, double FLDA and RLDA model. Single model and the two-layer model for the sub-class is the same dimension of feature vectors. Double bottom and top model layer of 128 components for training, were recorded as 128b and 128t, and they correspond to the SIFT descriptor. Table 1 shows the components of the experimental results of 128 models in L1 layer and 128 components modes in L2 layer. Here, “bottom" refers to the area in the SIFT feature training, "top" means the 4 × 4 SIFT's "Parent Zone" on the train, "both" refers to the superposition trained on two characteristics.
Table 1 Comparison of experimental results of three models
As it is evident from Table 1, in the area of training the RLDA model of SIFT features, the classification accuracy of 62.6%, more than the 4.7 percentage points compared with single-LDA model classification accuracy 57.9%. This indicates that joint learning has outstanding performance than the traditional single-LDA model. In addition, the two-layer model of the single-layer model to improve the classification performance, that FLDA correct classification rate of 61.3. Both SIFT feature greatly improves the classification accuracy was 66.0%.Regional monolayer model accuracy is 57.9% when tested on the classification, as well as in 4 × 4 SIFT "upper zone" monolayer model when tested on the classification of fine degree of 60.5%.
Figure 2 (a) shows the number of training samples of average accuracy rate at different image points. Apparently, RLDA classification performance of the model is always superior to FLDA model, FLDA classification performance of the model. In addition, experimental tests of the L1 and L2 layers superimposed onto a characteristic feature vector under the classification performance (for example, Table 1, "both" case), this feature includes more information, taking into account both of the characteristics of the larger space, also considering a small special sign. Figure 2 (b) shows the performance comparison, when use of this information of the "both" case, FLDA is further improved by nearly 3% accuracy, RLDA increased by 2% accuracy.
“bottom” and “top” case
different number of training samples on Caltech-101 dataset |
Figure 2 Comparison of classification accuracies with |
Next, the1024 top components and 128 bottom components, learning on the two-layer mode. It was found that these models classification performance’s difference is smaller than before, but it is higher than 128b/128t case’s classification performance. Specifically, degree of RLDA model study on the top floor was 72.6%, while the FLDA model learning precision is 72.5% and monolayer model classification accuracy is 68.8%. RLDA model at the bottom learning accuracy up to 62.7%, FLDA model reached 62.6% meanwhile. RLDA reached 73.7%, FLDA’s accuracy, when using double, reached 72.9%.In addition to the above experiments, in this section we will present model associated with the hierarchy model (Hierarchical-LDA [15], CNN [19], CNN + Transfer [2], CDBN [10] and Hierarchy-of-part [20] Model), respectively with 15 and 30 images of the training set images for linear SVM classifier performance for ratio comparison on the Caltech-101 data sets, as shown in Tables 2 and 3. From the experimental results in Table 2 clearly shows, most existing hierarchical model outperforms the classification SIFT features single test area classification accuracy of LDA model 57.9%; while the proposed RLDA model hierarchy is seven percentage point higher than the existing model type of classification accuracy.
Table 2 Comparison of classification accuracies between RLD model and other hierarchical models by using 30 training images
Table 3 Comparison of classification accuracies between RLDA model and other hierarchical models by using 15 training images
Currently, most of the existing hierarchical models are presented with the best performance of layer model learning, and in the literature [11, 12, 20], the experimental results showed that they added an identification layer but decreased fine degrees; while in the proposed Bayesian model, it can be seen from the experimental results, adding a presentation layer enhances the classification performance. Most importantly, full Bayesian model proposed in image classification outperforms former feed-forward method. This indicates that Bayesian inference resulted in far more stable assessment of the regional characteristics, and has been robust recognition performance. Hierarchical recognition highlights the importance of feedback.
In addition, Figure 3 presents the total probability feed forward model and generate models learning component of the feature region in the image comparison visualization. FLDA only in the local edge direction for learning, and RLDA learning is more complex spatial structure. As can be seen by Figure 3, FLDA feed forward model, the underlying theme is essentially a direction. The second layer is not attached on them any other structure. Thus, there appears the top topic in the same direction which is limited to a space supported by a larger range; while as for RLDA model, the top component seems to exhibit more potential, discriminative space frame structure. Meanwhile, we can also find the characteristics of the underlying neighborhood RLDA, within the subject there is a strong relationship between the activities, indicating that the mold type of reasoning can be rendered by the underlying connection between the sub-space continued structures.
Figure 3 Comparison of the (a) feed-forward FLDA and (b) full generative models in terms of components learned, visualized as average image patches.
4. Conclusion
For higher complexity visual features, we proposed a joint training of all presentation layer probability models. This model has better classification performance than the existing hierarchical model. In addition, it indicates by adding additional layers significantly to improve the classification performance, and this method has better performance than the method for feed forward. This shows the importance of the hierarchical model of visual recognition of feedback. In conclusion, the proposed probabilistic model is robust, easy to modular combination. With the increase of layers, this model can be characterized from a source to all intermediate object recognition for methods to evolve.
References
[1]
Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision (IJCV), 2004, 60(2): 91-110.
[2]
Ahmed A, Yu K, Xu W, Gong Y, Xing E P. Training hierarchical feed-forward visual recognition models using transfer learning from pseudo-tasks[C]// Proceedings of the 10th European Conference on Computer Vision (ECCV): Part III, Marseille, France, Berlin: Springer-Verlag, 2008: 69-82.
[3]
Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), NY, USA: IEEE Computer Society, 2006: 2169-2178.
[4]
Yang J, Yu K, Gong Y, Huang T. Linear spatial pyramid
matching using sparse coding for image classification[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Florida, USA: IEEE Computer Society, 2009: 1794-1801.
[5]
Olshausen B A, Field D J. Sparse coding with an over-complete basis set: a strategy employed by V1?[J]. Vision Research, 1997, 37(23): 3311-3325.
[6]
Boureau Y L, Bach F, LeCun Y, Ponce J. Learning mid-level features for recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA: IEEE Computer Society, 2010: 2559-2566.
[7]
Fritz M, Black M J, Bradski G R, Karayev S, Darrell T. An additive latent feature model for transparent object recognition[C]// Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS), Vancouver, B.C., Canada: MIT Press, 2009: 558-566.
[8]
Mutch J, Lowe D G. Object class recognition and localization using sparse features with limited receptive fields[J]. International Journal of Computer Vision (IJCV), 2008, 80(1): 45-47.
[9]
Rolls E T, Deco G. Computational neuroscience of vision[M]. Oxford University Press, 2002.
[10]
Lee H, Grosse R, Ranganath R, Ng A Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations[C]// Proceedings of the 26th Annual International Conference on Machine Learning (ICML), NY, USA, New York: ACM, 2009: 609-616.
[11]
Ranzato M A, Huang F J, Boureau Y L, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Alaska, USA: IEEE Computer Society, 2008: 1-8.
[12]
Serre T, Wolf L, Bileschi S, Poggio T. Object recognition
with cortex-like mechanisms[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(3): 411-426.
[13]
Sivic J, Russell B C, Efros A A, Zisserman A, Freeman W T. Discovering objects and their locations in images[C]// Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV), Beijing, China, New York: Springer-Verlag, 2005: 370-377.
[14]
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(1): 993-1022.
[15]
Blei D M, Griffiths T L, Jordan M I, Tenenbaum J B. Hierarchical topic models and the nested Chinese restaurant process[C]// Advances in Neural Information Processing Systems (NIPS) 16, British Columbia, Canada: MIT Press, 2010.
[16]
Ferguson T. A Bayesian analysis of some nonparametric problems[J]. The Annals of Statistics, 1973, 1(3): 209-230.
[17]
Heinrich G. Parameter estimation for text analysis[R]. Technical Report, Darmstadt, 2008.
[18]
Fei L F, Fergus R, Perona P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories[J]. Journal of Computer Vision and Image Understanding, 2004, 106(1): 59-70.
[19]
Kavukcuoglu K, Sermanet P, Boureau Y L, Gregor K, Mathieu M I, LeCun Y. Learning convolutional feature hierarchies for visual recognition[C]// Advances in Neural Information Processing Systems (NIPS), Vancouver, B.C., Canada: MIT Press, 2010: 1090-1098.
[20]
Fidler S, Boben M, Leonardis A. Similarity-based cross-layered hierarchical representation for object categorization[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Alaska, USA: IEEE Computer Society, 2008: 1-8.