计算机教育资源垂直搜索引擎系统_英语论文|留学生论文|essay|英语硕士论文|毕业论文

Abstract: This paper introduces in detail the vertical search engine of computer education resource oriented architecture, mainly describes the composition design method of vertical search engine crawler crawling strategy, theme correlation algorithm and thesaurus. The experimental results show that the maximum response time of Her-itrix in software system is 01563 seconds, and the accuracy of the query precision and the topic correlation degree identification algorithm can reach more than 60%, which can be applied to Web.

Key words: computer; educational resources; vertical search engine; vector space; model

At present, the search engine based on educational resources has already appeared, such as 12sou education resources search, search education, EDUGO and 100 million library education, and the vertical search engine for computer education resources has not yet been put into operation in China. In the existing education resources vertical search, EDUGO, search and so on can be a simple classification of computer resources, but when I enter the keyword / computer 0 / operating system 0 / network 0 and a series of computer related key words, but there is no search results, and 12sou is only a small amount of results, the results are difficult to represent the knowledge of computer professionals, it is still in the primary stage of the search engine. In addition, Skynet is the forerunner of search engine technology, but its purpose is not for application of educational resources. Web educational resources mining Laboratory of Nanjing Normal University is also doing research on educational resource search, but its research direction is focused on the breadth of the mining. In this paper, the vertical search engine for computer education resources not only consider the breadth of the mining, but also consider the depth of mining, hoping to provide a fast and convenient computer resource search application for the professional computer learners.

Open source web crawler Heritrix powerful, and can be extended, to be able to transform it into a specific crawler to meet the needs of the crawler. Lucene, as an open source text retrieval framework, provides a very good index and retrieval mechanism.

In addition, a variety of Chinese word segmentation, web analysis tools are very good open source projects, and the performance is not bad. Therefore, the use of open source based approach can speed up the development of the system speed and efficiency, and ultimately to ensure the smooth completion of the system. Therefore, the vertical search engine system is mainly to integrate multiple open source frameworks, and design a vertical search system that can meet the needs of the system.

Architecture design of 1 vertical search engine

The vertical search engine for computer education resources can directly retrieve the theme education resources through the query condition, and give the effective URL address of the download resource, which is open or directly downloaded to the local users. In addition, the system should also be able to statistics the specific use of each resource information, and in accordance with the rules of a certain sort, to form a list of resources recommended to the user, the user in a good human-computer interaction environment to quickly get the best computer education resources.

Due to the limited hardware conditions, the author uses the centralized policy design software system, which is about to be put on a PC machine in each processing stage, and in the process of the implementation of the software to use the distributed idea, as far as possible to simulate the distributed operation. In order to reduce the coupling degree between modules, and to improve the cohesion of the module, [1]. is used in parallel operation to improve the throughput of the system. In view of the traditional vertical search engine system module partition method, the author has improved, the system is divided into: subject crawler, pre processor, index, retrieval and user interface 5 parts, the overall structure of the system design is shown in Figure 1

2 key algorithm design of vertical search engine

Design of crawling strategy for 2.1 focused crawler

Focused crawler is the core of the vertical search engine. It has all the basic functions of common reptiles, such as URL extraction, web analysis, web pages, etc., it can be based on the relevant degree to grab the specified web page, and vertical search topic related to the [2].Heritrix is a generic web crawler, it does not have the theme discrimination capabilities, it is necessary to URL its work process is: first get URL, and then read the [3]. [4-5], VSM, and then use the vector space model (L, VectorSpaceMode) algorithm and already preset good correlation threshold, decision the current web page if the topic is related, grab the theme page to the Ground disk.

So the system in to grab the page, the first word, access to key words of word frequency, and then the thesaurus in the standard document VSM operation, finally according to the relevant threshold to determine whether related to the theme.

The design strategy of 2.3 Thesaurus

The establishment of the thesaurus is clearly focused crawling in crawling data to do a preparatory work, thesaurus built to collect many entries, there is a contradiction in the process: the process of gathering these entries must be based on a prophet of the lexicon of word segmentation.

In view of the importance of thesaurus, entry to establish artificial and computer supported cooperative work mode. First of all, the author of Interactive Encyclopedia and the world wiki category research, the computer translation excerpts out, the theme crawler crawling process divided into link extraction and data processing in two stages, crawler by artificial given URL seed began to crawl, according to the theme resources research results showed that, in the process of crawling which is mainly on several portal resources website for data acquisition, crawler crawling according to depth first and breadth first combined search strategy. The domain name is limited to a few key sites, the theme of the site and the non subject site is a depth first strategy, and the theme of the site within the scope of the implementation of the breadth of the priority strategy, so that you can cover the maximum range of computer education resources required.

2.2 topic relevance algorithm design

Vector space model will document is mapped to a feature vector V (d) = (T1, X1 (d), TN, xn (d)), where Ti (I = 1,2,, n) for a list of each other is not identical entries, Xi (d) for the value of the Ti in D, namely vector. Each of the words and their weights constitutes the a i-dimensional space graph, the correlation of the two documents is two map spatial proximity.

Assuming a total of 10 words, W1, W2, and W10 and 2 article, the frequency of D1 and D2. Statistics this article income table as shown in Table 1.

Commonly used VSM calculation formula a lot, this system uses the most commonly used Co-sine function, as follows:

As a result, the similarity between the two documents D1 and D2 in Table 1 is as follows:

Then the page on its website capture, read the data to the file stream, from the stream file to extract Chinese string, then based on network thesaurus Sogou for reasonable segmentation, segmentation using statistical algorithms TF, and in accordance with the frequency from high to low order arrangement, finally based on artificial way, choosing the theme related entries, forming thesaurus.

Figure 2 is already established good computer basic theme word, figure 3 is on page information extraction of Chinese characters, and word and remove stop words, and then calculate the TF, finally according to TF after sorting the set up more detailed classification and Thesaurus directory. Thesaurus of the thought of classification, the search engine can not only in the calculation of the theme related degree according to the different categories is calculated, for collection of web page classification storage and in indexing and retrieval service also according to the classification operation.

3 system implementation and experiment

The experimental data of this paper are obtained from the experimental environment as shown in Table 2.

3.1 system implementation

The implementation of vertical search engine for computer education resources is shown in Figure 4. Figure 4 in the selection / operating system 0 of the subject search including Word documents (Doc), slide (PPT), electronic form (XLS), PDF documents, including a variety of topics, including resource formats. Therefore, in the following search results in the acquisition of all of the resources contained / operating system 0. Therefore, the architecture of the vertical search engine and the design of the related algorithm are feasible.

3.2 system response time test

The author also uses the LoadRunner to expand the Heritrix to carry on the system performance test, the test result is shown in Figure 5.

Test results show that heritrix crawling in the early, the slow response time increased, when it runs to 115H, its response time to fluctuate in a range, in 6 hours of the quickly pulled, and then progressively decreased, start in about 7h and continue to stabilize. Its operation is obviously limited by the network environment, the characteristics of the web page and the efficiency of the algorithm, so it is not ideal for the time response is always in a range of fluctuations, and gradually stabilized.

The maximum response time of the system is shown in Figure 6. The value is 01563s, because the test software and server on the same PC, so the network transmission delay has little effect on the results, the obtained should be the ideal value of the centralized environment.

3.3 query accuracy test

Query accuracy = the number of pages returned correctly / the total number of pages returned. The study by typing in keywords, the artificial statistics returned to the right page number roughly judge search engine query accuracy. Because the precision of the query is related to the size of the collected data set, this paper is just a rough statistics. Table 3 is the experimental results of query precision. Through table 3, we found that the accuracy of the system is related to the query data set, the data set is small, the error is bigger, the query precision of the system in Table 3, the average value is 1612

3.4 subject related degree accuracy test

Topic related degree accuracy = total number of pages / test pages related to the topic. Due to the limited energy of the author, the data set is not large, which collected 10 web pages, 50 web pages and 100 Web pages from the topic related degree analysis module. Through the statistical analysis of the above data, we can see that the accuracy of the algorithm is about 0162, the experimental data is based on the collection of the text set, and the number of entries in the dictionary, the different input sets will have an impact on the experimental results.

4 concluding remarks

Heritrix is an open source web crawler based on Java. It has good robustness, portability and scalability. Therefore, the vertical search engine for computer education resources based on Heritrix and Lucene to achieve, and the introduction of VSM theme correlation degree identification algorithm, the successful operation of the software indicates that the system architecture technology, the key algorithm is reasonable, and the system provides a list of sorted retrieval resources, query accuracy and theme similarity has reached more than 60%, the software has been able to provide users with professional Web search service. The next step will be to consider the application of the system in the distributed environment, and to solve the problem of the system performance is somewhat loss after the introduction of the topic correlation degree algorithm, to further optimize the current topic relevance judgment algorithm, in order to obtain a higher accuracy of the query and the theme. At last, hope that the research can make a breakthrough.

留学生论文网

搜索

计算机教育资源垂直搜索引擎系统