Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.11851/6633
Full metadata record
DC FieldValueLanguage
dc.contributor.authorAshraf, Fatima-
dc.contributor.authorÖzyer, Tansel-
dc.contributor.authorAlhajj, Reda-
dc.date.accessioned2021-09-11T15:43:01Z-
dc.date.available2021-09-11T15:43:01Z-
dc.date.issued2008en_US
dc.identifier.issn1094-6977-
dc.identifier.issn1558-2442-
dc.identifier.urihttps://doi.org/10.1109/TSMCC.2008.923882-
dc.identifier.urihttps://hdl.handle.net/20.500.11851/6633-
dc.description.abstractIn the past few years, there has been an exponential increase in the amount of information available on the World Wide Web. This plethora of information can be extremely beneficial for users. However, the amount of human intervention that is currently required for this is inconvenient. Information extraction (IE) systems try to solve this problem by making the task as automatic as possible. Most of the existing approaches, however, require user feedback in one form or another during the extraction. This paper proposes a system that employs clustering techniques for automatic IE from HTML documents containing semistructured data. Using domain-specific information provided by the user, the proposed system parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters, and finally, the output is reported. We employed a multiobjective genetic-algorithm-based clustering approach in the process; it is capable of finding the number of clusters and the most natural clustering. The proposed approach is tested by conducting experiments on a number of Web sites from different domains. To demonstrate the effectiveness of this approach, the results of the experiments are tested against those reported in the literature, and prove comparable.en_US
dc.language.isoenen_US
dc.publisherIEEE-Inst Electrical Electronics Engineers Incen_US
dc.relation.ispartofIEEE Transactions On Systems Man And Cybernetics Part C-Applications And Reviewsen_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectclusteringen_US
dc.subjectHypertext Markup Language (HTML) documentsen_US
dc.subjectinformation extraction (IE)en_US
dc.subjectWeb pagesen_US
dc.titleEmploying clustering techniques for automatic information extraction from HTML documentsen_US
dc.typeArticleen_US
dc.departmentFaculties, Faculty of Engineering, Department of Computer Engineeringen_US
dc.departmentFakülteler, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümütr_TR
dc.identifier.volume38en_US
dc.identifier.issue5en_US
dc.identifier.startpage660en_US
dc.identifier.endpage673en_US
dc.identifier.wosWOS:000259192000004en_US
dc.identifier.scopus2-s2.0-50649094223en_US
dc.institutionauthorÖzyer, Tansel-
dc.identifier.doi10.1109/TSMCC.2008.923882-
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
item.cerifentitytypePublications-
item.languageiso639-1en-
item.openairecristypehttp://purl.org/coar/resource_type/c_18cf-
item.openairetypeArticle-
item.fulltextNo Fulltext-
item.grantfulltextnone-
crisitem.author.dept02.1. Department of Artificial Intelligence Engineering-
Appears in Collections:Bilgisayar Mühendisliği Bölümü / Department of Computer Engineering
Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection
WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection
Show simple item record



CORE Recommender

SCOPUSTM   
Citations

26
checked on Apr 13, 2024

WEB OF SCIENCETM
Citations

13
checked on Apr 13, 2024

Page view(s)

54
checked on Apr 15, 2024

Google ScholarTM

Check




Altmetric


Items in GCRIS Repository are protected by copyright, with all rights reserved, unless otherwise indicated.