Abstract
This paper describes a system which uses entity and topic coherence for improved Text Segmentation (TS) accuracy. First, Linear Dirichlet Allocation (LDA) algorithm was used to obtain topics for sentences in the document. We then performed entity mapping across a window in order to discover the transition of entities within sentences. We used the information obtained to support our LDA-based boundary detection for proper boundary adjustment. We report the significance of the entity coherence approach as well as the superiority of our algorithm over existing works.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Otherwise called coherence score.
- 3.
We used the Stanford POStagger. It is available at http://nlp.stanford.edu/software/tagger.shtml.
- 4.
Following our previous parameter \( w_{n} \), we use a window of 3 sentences as default.
- 5.
We use index here to mean the unique ID of a sentence, e.g., sentence 1 will have index 0, sentence 2 will have index 1 etc..
- 6.
i.e., the vector index which corresponds to the index of each sentence in the local minima.
- 7.
The wikipedia dump was downloaded on July 30, 2015. It is accessible at https://dumps.wikimedia.org/enwiki/.
- 8.
It is available at https://radimrehurek.com/gensim/.
References
Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Advances in Automatic Text Summarization, pp. 111–121 (1999)
Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. Comput. Linguist. 34(1), 1–34 (2008)
Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1–3), 177–210 (1999)
Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Boella, G., Di Caro, L., Humphreys, L., Robaldo, L., Rossi, R., van der Torre, L.: Eunomos, a legal document and knowledge management system for the web to provide relevant, reliable and up-to-date information on the law. Artif. Intell. Law 24(3), 245–283 (2016)
Boella, G., Di Caro, L., Ruggeri, A., Robaldo, L.: Learning from syntax generalizations for automatic semantic annotation. J. Intell. Inf. Syst. 43(2), 231–246 (2014)
Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 26–33. Association for Computational Linguistics (2000)
Choi, F.Y.Y., Wiemer-Hastings, P., Moore, J.: Latent semantic analysis for text segmentation. In: Proceedings of EMNLP. Citeseer (2001)
Dias, G., Alves, E., Lopes, J.G.P.: Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation. In: AAAI, vol. 7, pp. 1334–1339 (2007)
Du, L., Pate, J.K., Johnson, M.: Topic segmentation in an ordering-based topic model (2015)
Eisenstein, J.: Hierarchical text segmentation from multi-scale lexical cohesion. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 353–361. Association for Computational Linguistics (2009)
Grosz, B.J., Weinstein, S., Joshi, A.K.: Centering: a framework for modeling the local coherence of discourse. Comput. Linguist. 21(2), 203–225 (1995)
Alexander, M., Halliday, K., Hasan, R.: Cohesion in English. Routledge (2014)
Hearst, M.A.: Texttiling: a quantitative approach to discourse segmentation. Technical report. Citeseer (1993)
Hearst, M.A.: Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Kaufmann, S.: Cohesion and collocation: using context vectors in text segmentation. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 591–595. Association for Computational Linguistics (1999)
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)
Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-Interdiscip. J. Study Discourse 8(3), 243–281 (1988)
Misra, H., Yvon, F., Cappé, O., Jose, J.: Text segmentation: a topic modeling perspective. Inf. Process. Manage. 47(4), 528–544 (2011)
Misra, H., Yvon, F., Jose, J.M., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1553–1556. ACM (2009)
Passonneau, R.J., Litman, D.J.: Discourse segmentation by human and automated means. Comput. Linguist. 23(1), 103–139 (1997)
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Computat. Linguist. 28(1), 19–36 (2002)
Reynar, J.C.: Statistical models for topic segmentation. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 357–364. Association for Computational Linguistics (1999)
Riedl, M., Biemann, C.: Text segmentation with topic models. J. Lang. Technol. Comput. Linguist. 27(1), 47–69 (2012)
Riedl, M., Biemann, C.: Topictiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42. Association for Computational Linguistics (2012)
Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 499–506. Association for Computational Linguistics (2001)
Acknowledgments
Kolawole J. Adebayo has received funding from the Erasmus Mundus Joint International Doctoral (Ph.D.) programme in Law, Science and Technology. Luigi Di Caro and Guido Boella have received funding from the European Union’s H2020 research and innovation programme under the grant agreement No 690974 for the project “MIREL: MIning and REasoning with Legal texts”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
John, A.K., Di Caro, L., Boella, G. (2017). Text Segmentation with Topic Modeling and Entity Coherence. In: Abraham, A., Haqiq, A., Alimi, A., Mezzour, G., Rokbani, N., Muda, A. (eds) Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016). HIS 2016. Advances in Intelligent Systems and Computing, vol 552. Springer, Cham. https://doi.org/10.1007/978-3-319-52941-7_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-52941-7_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52940-0
Online ISBN: 978-3-319-52941-7
eBook Packages: EngineeringEngineering (R0)