Duplicate bug reporting is a critical problem in the software repositories’mining area.Duplicate bug reports can lead to redundant efforts,wasted resources,and delayed software releases.Thus,their accurate identifica...
Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the ...
Matching dependencies (MDs) are used to declaratively specify the identification (or matching) of cer- tain attribute values in pairs of database tuples when some similarity conditions on other values are satisfie...
Supported by the National Natural Science Foundation of China (No.60673001) ; the State Key Development Program of Basic Research of China (No. 2004CB318203).
Based on variable sized chunking, this paper proposes a content aware chunking scheme, called CAC, that does not assume fully random file contents, but tonsiders the characteristics of the file types. CAC uses a candi...
supported by the National Natural Science Foundation of China (Grant No.31000561 and 30900825);the Knowledge Innovation Program of the Chinese Academy of Sciences (Grant No.KSCX2-EW-R-01-04)
The emergence of next-generation sequencing (NGS) technologies has significantly improved sequencing throughput and reduced costs. However, the short read length, duplicate reads and massive volume of data make the ...
The 454 Genome Sequencer (GS) FLX System is one of the next-generation sequencing systems featured by long reads, high accuracy, and ultra-high throughput. Based on the mechanism of emulsion PCR, a unique DNA tem- p...
The National Natural Science Foundation of China(No.60673139)
A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the d...