基于邮件正文的邮箱用户别名抽取  被引量:2

Extracting Name Aliases of Mailbox Users from Email Bodies

在线阅读下载全文

作  者:尹美娟[1] 陈庶民[1] 刘晓楠[1] 路林[1] 

机构地区:[1]信息工程大学信息工程学院,郑州450002

出  处:《计算机科学》2011年第12期182-186,199,共6页Computer Science

基  金:某国防基金资助

摘  要:邮箱用户身份信息挖掘是数据挖掘研究的一个热点。当前相关研究大多仅从邮件头中抽取邮箱用户的别名,遗漏了邮件正文中潜藏的更能代表通信双方身份的别名信息。针对纯文本邮件正文中邮箱用户别名信息抽取问题,提出了基于统计和规则过滤的称呼块和签名块定位算法,该算法能高效准确地从邮件正文中提取出蕴涵邮箱用户别名的称呼块和签名块文本片段;进一步提出了基于别名边界词汇模板修正的别名抽取方法,从而提高了仅基于命名实体识别或词性标注工具识别别名的准确率。实验结果表明,提出的方法可以有效地抽取出邮件正文中邮箱用户的别名。Mining user identity information from emails is an important research topic in data mining. Most approaches extract users' names only from the email headers,but names appearing in email bodies are usually more suitable for re presenting the sender's or recipient' s identity. This paper focused on extracting users' name aliases in the body of plain-text emails. Firstly, to effectively elicit salutation and signature block from email bodies, a salutation and signature blocks locating algorithm based on statistical and rules restricted methods was proposed. Then to extract all valid aliases in the salutation and signature lines, a novel approach was proposed based on name boundary word template built on the characteristics of alias neighboring words, which can verify and amend aliases identified by named entity recognition or part-of-speech tagging tools. Results on Enron corpus indicate that the approaches proposed can efficiently and automat- ically extract user's aliases from email Bodies.

关 键 词:实体解析 邮件正文 别名抽取 称呼块签名块定位 别名边界词汇模板 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象