Design of a new text data similarity determination technique
Text similarity determination is one of the core technologies in the field of natural language processing. Its accuracy directly affects the accuracy of semantic similarity analysis and image, graphic and audio similarity. It is of great significance to conduct in-depth research on text similarity and mine efficient algorithms. The research goal of this paper is to improve the accuracy and efficiency of text similarity determination. This study summarize five traditional text similarity determination methods, namely Euclidean distance, cosine similarity, Manhattan distance, Jaccard similarity and Pearson correlation coefficient, and analyze their shortcomings in computing resource consumption and application limitations. To make up for these shortcomings, this paper proposes a new text similarity determination technology based on feature identification storage and comparison. By designing specific rules (Rule1 and Rule2), text feature values are extracted and compared to achieve efficient determination. The feature library is stored in an in-memory database, and the initialization and storage process of the feature value is designed. Experimental results show that the new technology has significantly improved performance when processing the task of similarity determination of millions of texts, with computing efficiency increased by about 50% and accuracy increased by about 15%. The innovation of this paper lies in the flexible compilation of rules, the generation of in-memory database feature library, and the design of feature value initialization and storage process. Through these innovations, this paper provides a more efficient and accurate solution for text similarity determination, which has important reference value for research and application in related fields.