检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:尹美娟[1] 陈庶民[1] 刘晓楠[1] 路林[1]
出 处:《计算机科学》2011年第12期182-186,199,共6页Computer Science
基 金:某国防基金资助
摘 要:邮箱用户身份信息挖掘是数据挖掘研究的一个热点。当前相关研究大多仅从邮件头中抽取邮箱用户的别名,遗漏了邮件正文中潜藏的更能代表通信双方身份的别名信息。针对纯文本邮件正文中邮箱用户别名信息抽取问题,提出了基于统计和规则过滤的称呼块和签名块定位算法,该算法能高效准确地从邮件正文中提取出蕴涵邮箱用户别名的称呼块和签名块文本片段;进一步提出了基于别名边界词汇模板修正的别名抽取方法,从而提高了仅基于命名实体识别或词性标注工具识别别名的准确率。实验结果表明,提出的方法可以有效地抽取出邮件正文中邮箱用户的别名。Mining user identity information from emails is an important research topic in data mining. Most approaches extract users' names only from the email headers,but names appearing in email bodies are usually more suitable for re presenting the sender's or recipient' s identity. This paper focused on extracting users' name aliases in the body of plain-text emails. Firstly, to effectively elicit salutation and signature block from email bodies, a salutation and signature blocks locating algorithm based on statistical and rules restricted methods was proposed. Then to extract all valid aliases in the salutation and signature lines, a novel approach was proposed based on name boundary word template built on the characteristics of alias neighboring words, which can verify and amend aliases identified by named entity recognition or part-of-speech tagging tools. Results on Enron corpus indicate that the approaches proposed can efficiently and automat- ically extract user's aliases from email Bodies.
关 键 词:实体解析 邮件正文 别名抽取 称呼块签名块定位 别名边界词汇模板
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.146