Abstract
Studying, understanding and exploiting the content of a document collection require automatic techniques that can effectively support the users in extracting useful information from it and reason with this information. Concept networks (e.g., taxonomies) may play a relevant role in this perspective, but are seldom available, and cannot be manually built and maintained cheaply and reliably. On the other hand, automated learning of these resources from text needs to be robust with respect to missing or partial knowledge, because often only sparse fragments of the target network can be extracted. This work presents ConNeKTion, a tool that is able to learn concept networks from plain text and to structure and enrich them by finding concept generalizations. The proposed methodologies are general and applicable to any language. It also provides functionalities for the exploitation of the learned knowledge, and a control panel that allows the user to comfortably carry out these activities. Several experiments and applications are reported, showing the usefulness and flexibility of ConNeKTion.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
1 A vertex-induced sub-graph is a subset of the vertexes of a graph together with all edges in the graph whose endpoints are both in the subset.
2 A weak component is defined as a maximal sub-graph in which there exists a path between all pairs of vertexes (considering undirected edges).
3 First the synsets of each word are extracted from WordNet, then, for each synset, all the associated domains in WordNet Domains are selected, and finally each domain is weighted according to the density function presented in [1], depending on the number of domains to which each synset belongs, on the number of synsets associated to each word, and on the number of words that make up the sentence. Each synset of a word is weighted based on the weights of the associated domains, and the one with highest weight is selected.
4 Note that this is different than the spreading activation algorithm [4], in that (1) graph traversal is not affected by weights on edges nor thresholds, (2) we focus on paths rather than nodes, and specifically we are interested in the path(s) between two particular nodes rather than in the whole graph activation, hence (3) in our approach setting the initial activation weight of start nodes makes no sense, and (4) this allows to exploit a bi-directional partial search rather than a mono-directional complete graph traversal.
5 Again, this is not a spreading activation, even if weights on edges are exploited.
6 A technique to semi-automatically extract a domain-specific ontology from free text without using external resources but focusing on Hub Words. After building the ontology, the ‘Hub Weight’ of a word t is computed as:
$$W(t) = \alpha w_{0} + \beta n + \gamma \sum_{i=1}^{n} w(t_{i})$$where w 0 is a given initial weight, n is the number of relationships in which t is involved, w(t i ) is the t f∗i d f weight of the i-th word related to t, and α+β+γ=1. These elements, with some modifications, appear in the first three terms of our formula.
References
Angioni M, Demontis R, Tuveri F (2008) A semantic approach for resource cataloguing and query resolution. Commun SIWN Spec Issue Distrib Agent-based Retr Tools 5:62–66
Argamon S, Whitelaw C, Chase P, Hota SR, Garg N, Levitan S (2007) Stylistic text classification using functional lexical: research articles. J Am Soc Inf Sci Technol 58(6):802–822
Cimiano P, Hotho A, Staab S (2005) Learning concept hierarchies from text corpora using formal concept analysis. J Artif Int Res 24(1):305–339
Crestani F (1997) Application of spreading activation techniques in information retrieval. Artif Intell Rev 11:453–482
Deerwester S (1988) Improving information retrieval with latent semantic indexing. In: Borgman CL, Pai EYH (eds) Proceedings of the 51st ASIS annual meeting (ASIS 88), vol 25. American Society for Information Science, Atlanta
Defays D (1977) An efficient algorithm for a complete link method. Comput J 20(4):364–366
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. In: Machine learning, pp 143–175
Fellbaum C (ed) (1998) An electronic lexical database. MIT Press, Cambridge
Ferilli S (2011) Automatic digital document processing and management, problems, algorithms and techniques, 1st edn. Springer Publishing Company, Incorporated
Ferilli S, Basile TMA, Di Mauro N, Esposito F (2011) Plugging numeric similarity in first-order logic horn clauses comparison. In: Pirrone R, Sorbello F (eds) 7th international conference of the Italian association for artificial intelligence, vol 6934. Springer, LNCS, pp 33–44
Ferilli S, Biba M, Basile TMA, Esposito F (2009) Combining qualitative and quantitative keyword extraction methods with document layout analysis. In: Post-proceedings of the 5th Italian research conference on digital libraries - IRCDL 2009, Padova Italy, 29–30 January 2009, pp 22–33
Ferilli S, Biba M, Di Mauro N, Basile TMA, Esposito F (2009) Plugging taxonomic similarity in first-order logic horn clauses comparison. In: Emergent perspectives in artificial intelligence, lecture notes in artificial intelligence. Springer, pp 131–140
Ferilli S, Leuzzi F, Rotella F (2011) Cooperating techniques for extracting conceptual taxonomies from text. In: Proceedings of the workshop on mining complex patterns at AI*IA 7th conference
Gale W.A., Church K.W., Yarowsky D. (1992) One sense per discourse. In: DARPA speech and natural language workshop
Gupta V, Lehal G (2009) A survey of text mining techniques and applications. J Emerg Tech Web Intell 1(1):60–76
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: An update. SIGKDD Explor Newsl 11(1):10–18
Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 26(2):147–160
Hasegawa R, Kitamura M, Kaiya H, Saeki M (2009) Extracting conceptual graphs from Japanese documents for software requirements modeling. In: Proceedings of the 6th APCCM, APCCM 09, vol 96. Australian Computer Society, Inc., Darlinghurs, Australia, pp 87–96
Hensman S (2004) Construction of conceptual graph representation of texts. In: Proceedings of the student research workshop at HLT-NAACL 2004, HLT-SRWS 04. Association for Computational Linguistics Stroudsburg, pp 49–54
Jones WP, Furnas GW (1987) Pictures of relevance: a geometric analysis of similarity measures. J Amer Soc Inf Sci 38(6):420–442
Karypis G, Han E-H (2000) Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorization. Technical report tr-00-0016, University of Minnesota
Karypis G, (Sam) Han E-H (2000) Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorization. Technical report, IN CIKM00
Kimmig A, Costa VS, Rocha R, Demoen B, De Raedt L (2008) On the efficient execution of problog programs. In: Garcia de la Banda M, Pontelli E (eds) ICLP, Lecture notes in computer Science, vol 5366. Springer, pp 175–189
Kimmig A, De Raedt L, Toivonen H (2007) Probabilistic explanation based learning. In: ECML, pp 176–187
Kipper K, Dang HT, Palmer M (2000) Class-based construction of a verb lexicon. In: Proceedings of the 17th NCAI and 12th IAAI conference. AAAI Press, pp 691–696
Klein D, Manning CD (2003) Fast exact inference with a factored model for natural language parsing. In: Advances in neural information processing systems, vol 15. MIT Press
Koo S-O, Lim S-Y, Lee S-J (2003) Constructing an ontology based on hub words. In: ISMIS03, pp 93–97
Leuzzi F, Ferilli S, Rotella F (2013) ConNeKTion: a tool for handling conceptual graphs automatically extracted from text. In: Catarci T, Ferro N, Poggi A (eds) Bridging between cultural Heritage Institutions Proceedings of the 9th Italian research conference on digital libraries (IRCDL 2013), CCIS, vol 385. Springer
Leuzzi F, Ferilli S, Rotella F (2013) Improving robustness and flexibility of concept taxonomy learning from text. In: Appice A, Ceci M, Loglisci C, Manco G, Masciari E, Ras ZW (eds) New frontiers in mining complex patterns - first International Workshop, NFMCP 2012, Held in Conjunction with ECML/PKDD 2012, Bristol, UK, September 24, 2012 Revised Selected Papers, CCIS, vol 7765. Springer, pp 232–244
Leuzzi F, Ferilli S, Taranto C, Rotella F (2013) A relational unsupervised approach to author identification. In: Workshop new frontiers in mining complex patterns 2013 held at ECML-PKDD 2013
Maedche A, Staab S (2000) Mining ontologies from tex. In: EKAW, pp 189–202
Maedche A, Staab S (2000) The text-to-onto ontology learning environment. In: ICCS-2000 — 8th international conference on conceptual structures, software demonstration
Magnini B, Cavaglià G (2000) Integrating subject field codes into wordnet, pp 1413–1418
De Marneffe M-C, Maccartney B, Manning CD (2006) Generating typed dependency parses from phrase structure parses. In: Proceedings international conference on language resources and evaluation (LREC), pp 449–454
Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell Tools 13:2003
Mccarthy PM, Lewis GA, Dufty DF, Mcnamara DS (2006) Analyzing writing styles with coh-metrix. In: Sutcliffe G, Goebel R (eds) Proceedings of the Florida artificial intelligence research society international conference (FLAIRS). AAAI Press, pp 764–769
Miller GA (1995) Wordnet: A lexical database for English. Commun ACM 38(11):39–41
Ogata N (2001) A formal ontology discovery from web documents. In: Web intelligence: research and development, 1st Asia-Pacific conference (WI 2001), lecture notes on artificial intelligence, no 2198. Springer, pp 514–519
O’Madadhain J, Fisher D, White S, Boey Y (2003) The JUNG (Java Universal Network/Graph) framework. Technical report, UCI-ICS
Qiu L, Kan M-Y, Chua T-SA public reference implementation of the RAP anaphora resolution algorithm. In: Proceedings of the 4th international conference on language resources and evaluation, LREC 2004, May 26–28, 2004. European Language Resources Association, Lisbon, pp 291–294
De Raedt L, Kimmig A, Toivonen H (2007) Problog: a probabilistic prolog and its application in link discovery. In: Proceedings of 20th IJCAI. AAAI Press, pp 2468–2473
Raghavan S, Kovashka A, Mooney R (2010) Authorship attribution using probabilistic context-free grammars. In: Proceedings of the ACL 2010 conference short papers, ACLShort 10. Association for Computational Linguistics, Stroudsburg, pp 38–42
Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M (1996) Okapi at trec-3, pp 109–126
Rotella F, Ferilli S, Leuzzi F (2013) An approach to automated learning of conceptual graphs from text. In: Ali M, Bosse T,Hindriks KV, Hoogendoorn M, Jonker CM, Treur J (eds) Recent trends in applied artificial intelligence, 26th international conference on industrial, engineering and other applications of applied intelligent systems, IEA/AIE 2013, Amsterdam, The Netherlands, 17-21 June 2013, Proceedings of lecture notes in computer science, vol 7906. Springer, pp 341–350
Rotella F, Ferilli S, Leuzzi F (2013) A domain based approach to information retrieval in digital libraries. In: Agosti M, Esposito F, Ferilli S, Ferro N (eds) Digital Libraries and archives - 8th Italian research conference, IRCDL 2012, Bari, Italy, 9-10 Feb 2012. Revised selected papers, CCIS, vol 354. Springer-Verlag, Berlin Heidelberg, pp 129–140
Salton G (1971) The SMART retrieval system experiments in automatic document processing. Prentice-Hall, Upper Saddle River
Salton G (1980) Automatic term class construction using relevance–a summary of work in automatic pseudoclassification. Inf Process Manage 16(1):1–15
Salton G., McGill M. (1984) Introduction to modern information retrieval. McGraw-Hill Book Company
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18:613–620
Sato T (1995) A statistical learning method for logic programs with distribution semantics. In: Proceedings of the 12th ICLP 1995. MIT Press, pp 715–729
Semeraro G, Esposito F, Malerba D, Fanizzi N, Ferilli S (1997) A logic framework for the incremental inductive synthesis of datalog theories. In: Fuchs, NE (ed)LOPSTR, Lecture notes in computer science, vol 1463. Springer, pp 300–321
Shamsfard M, Barforoush AA (2004) Learning ontologies from natural language texts. Int J Hum-Comput Stud 60(1):17–63
Singhal A, Buckley C, Mitra M, Mitra A (1996) Pivoted document length normalization. ACM Press, pp 21–29
Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556
van Halteren H (2004) Linguistic profiling for author recognition and verification. In: Proceedings of the 42nd annual meeting on association for computational linguistics, ACL 04. Association or Computational Linguistics, Stroudsburg
Velardi P, Navigli R, Cucchiarelli A, Neri F (2006) Evaluation of OntoLearn, a methodology for automatic population of domain ontologies. In: Ontology learning from text: methods, applications and evaluation. IOS Press
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for Computational Linguistics, Morristown, pp 133–138
Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from wikipedia and wiktionary. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, Electronic proceedings
Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Inf Sci Technol 57(3):378–393
Acknowledgments
This work was partially funded by the Italian PON 2007-2013 project PON02_00563_3489339 “Puglia@Service”.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rotella, F., Leuzzi, F. & Ferilli, S. Learning and exploiting concept networks with ConNeKTion. Appl Intell 42, 87–111 (2015). https://doi.org/10.1007/s10489-014-0543-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-014-0543-z