检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]上海交通大学计算机科学与工程系
出 处:《计算机工程与应用》2008年第21期25-29,88,共6页Computer Engineering and Applications
基 金:国家自然科学基金( the National Natural Science Foundation of China under Grant No.60496326);江西省教育厅科技计划项目( No.[2006]178)
摘 要:随着Internet网络资源的快速膨胀,海量的非结构化文本处理任务成为巨大的挑战。文本分割作为文本处理的一个重要的预处理步骤,其性能的优劣直接影响信息检索、文本摘要和问答系统等其他任务处理的效果。针对文本分割中需要解决的主题相关性度量和边界划分策略两个根本问题,提出了一种基于词典词语量化关系的句子间相关性度量方法,并建立了一个计算句子之间的间隔点分隔值的数学模型,以实现基于句子层次的中文文本分割。通过三组选自国家汉语语料库的测试语料的实验表明,该方法识别分割边界的平均错误概率■和最低值均好于现有的其他中文文本分割方法。With the quick expanding of the Internet information resource, the task of processing a mass of non-structured texts is faced with a huge challenge.Text segmentation based on the topic is a very important preproeessing step of text processing,and the performance of text segmentation technique has an immediate influence on the result of these tasks,such as Information Retrieval,Text Summarization and Q-A system.However,there exists two key problems in the text segmentation task,namely,how to measure the relevance of between topics and how to make a strategy for identifying the segment boundary based on the relevance of the context.In order to solve the above problems,this paper presents a new approach to measure the relevance of between sentences based on the Quantified Conceptual Relations (QCR) extracted from Modern Chinese Standard Dictionary (MCSD),and built a model to calculate the Segmentation Value of the gap point of between sentences for the task of text segmentation oriented sentence-level (no paragraph-level):The experiment results show that this approach has achieved a lower average error rate Pk than that of state-of-the-art methods in the task of Chinese Text Segmentation.
关 键 词:文本分割 词语量化关系 句子相关性度量 间隔点 分隔值
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.149.230.234