• Open access
  • Published: 29 June 2017

Text mining and semantics: a systematic mapping study

  • Roberta Akemi Sinoara   ORCID: orcid.org/0000-0001-8572-2747 1 ,
  • João Antunes 1 &
  • Solange Oliveira Rezende 1  

Journal of the Brazilian Computer Society volume  23 , Article number:  9 ( 2017 ) Cite this article

16k Accesses

31 Citations

1 Altmetric

Metrics details

As text semantics has an important role in text meaning, the term semantics has been seen in a vast sort of text mining studies. However, there is a lack of studies that integrate the different research branches and summarize the developed works. This paper reports a systematic mapping about semantics-concerned text mining studies. This systematic mapping study followed a well-defined protocol. Its results were based on 1693 studies, selected among 3984 studies identified in five digital libraries. The produced mapping gives a general summary of the subject, points some areas that lacks the development of primary or secondary studies, and can be a guide for researchers working with semantics-concerned text mining. It demonstrates that, although several studies have been developed, the processing of semantic aspects in text mining remains an open research problem.

Introduction

Text mining techniques have become essential for supporting knowledge discovery as the volume and variety of digital text documents have increased, either in social networks and the Web or inside organizations. Text sources, as well as text mining applications, are varied. Although there is not a consensual definition established among the different research communities [ 1 ], text mining can be seen as a set of methods used to analyze unstructured data and discover patterns that were unknown beforehand [ 2 ].

A general text mining process can be seen as a five-step process, as illustrated in Fig. 1 . The process starts with the specification of its objectives in the problem identification step. The text mining analyst, preferably working along with a domain expert, must delimit the text mining application scope, including the text collection that will be mined and how the result will be used. The specifications stated in the problem identification step will guide the next steps of the text mining process, which can be executed in cycles of data preparation (pre-processing step), knowledge discovery (pattern extraction step), and knowledge evaluation (post-processing step).

A general text mining process

The pre-processing step is about preparing data for pattern extraction. In this step, raw text is transformed into some data representation format that can be used as input for the knowledge extraction algorithms. The activities performed in the pre-processing step are crucial for the success of the whole text mining process. The data representation must preserve the patterns hidden in the documents in a way that they can be discovered in the next step. In the pattern extraction step, the analyst applies a suitable algorithm to extract the hidden patterns. The algorithm is chosen based on the data available and the type of pattern that is expected. The extracted knowledge is evaluated in the post-processing step. If this knowledge meets the process objectives, it can be put available to the users, starting the final step of the process, the knowledge usage. Otherwise, another cycle must be performed, making changes in the data preparation activities and/or in pattern extraction parameters. If any changes in the stated objectives or selected text collection must be made, the text mining process should be restarted at the problem identification step.

Text data are not naturally in a format that is suitable for the pattern extraction, which brings additional challenges to an automatic knowledge discovery process. The meaning of natural language texts basically depends on lexical, syntactic, and semantic levels of linguistic knowledge. Each level is more complex and requires a more sophisticated processing than the previous level. This is a common trade-off when dealing with natural language processing: expressiveness versus processing cost. Thus, lexical and syntactic components have been more broadly explored in text mining than the semantic component [ 2 ]. Recently, text mining researchers have become more interested in text semantics, looking for improvements in the text mining results. The reason for this increasing interest can be assigned both to the progress of the computing capacity, which is constantly reducing the processing time, and to developments in the natural language processing field, which allow a deeper processing of raw texts.

In order to compare the expressiveness of each level of text interpretation (lexical, syntactic, and semantic), consider two simple sentences:

Company A acquired Company B.

Company B acquired Company A.

Sentences 1 and 2 have opposite meanings, but they have the same terms (“Company”, “A”, “B”, “acquired”). Thus, if we analyze these sentences only in the lexical level, it is not possible to differentiate them. However, considering the sentence syntax, we can see that they are opposite. They have the same verb, and the subject of one sentence is the object of the other sentence and vice versa. If we analyze a little deeper, now considering the sentence semantics, we find that in sentence 1, “Company A” has the semantic role of agent regarding the verb “acquire” and “Company B” has the semantic role of theme . The same can be said to a third sentence:

Company B was acquired by Company A.

Despite the fact that syntactically sentences 1 and 3 have opposite subjects and objects, they have the same semantic roles. Thus, at the semantic level, they have the same meaning. If we go deeper and consider semantic relations among words (as the synonymy, for example), we can find that sentence 4 also expresses the same event:

Company A purchased Company B.

Besides, going even deeper in the interpretation of the sentences, we can understand their meaning—they are related to some takeover—and we can, for example, infer that there will be some impacts on the business environment.

Traditionally, text mining techniques are based on both a bag-of-words representation and application of data mining techniques. In this approach, only the lexical component of the texts are considered. In order to get a more complete analysis of text collections and get better text mining results, several researchers directed their attention to text semantics.

Text semantics can be considered in the three main steps of text mining process: pre-processing, pattern extraction and post-processing. In the pre-processing step, data representation can be based on some sort of semantic aspect of the text collection. In the pattern extraction, semantic information can be used to guide the model generation or to refine it. In the post-processing step, the extracted patterns can be evaluated based on semantic aspects. Either way, text mining based on text semantics can go further than text mining based only on lexicon or syntax. A proper treatment of text semantics can lead to more appropriate results for certain applications [ 2 ]. For example, semantic information has an important impact on document content and can be crucial to differentiate documents which, despite the use of the same vocabulary, present different ideas about the same subject.

The term semantics has been seen in a vast sort of text mining studies. However, there is a lack of studies that integrate the different branches of research performed to incorporate text semantics in the text mining process. Secondary studies, such as surveys and reviews, can integrate and organize the studies that were already developed and guide future works.

Thus, this paper reports a systematic mapping study to overview the development of semantics-concerned studies and fill a literature review gap in this broad research field through a well-defined review process. Semantics can be related to a vast number of subjects, and most of them are studied in the natural language processing field. As examples of semantics-related subjects, we can mention representation of meaning, semantic parsing and interpretation, word sense disambiguation, and coreference resolution. Nevertheless, the focus of this paper is not on semantics but on semantics-concerned text mining studies. As the term semantics appears in text mining studies in different contexts, this systematic mapping aims to present a general overview and point some areas that lack the development of primary studies and those areas that secondary studies would be of great help. This paper aims to point some directions to the reader who is interested in semantics-concerned text mining researches.

As it covers a wide research field, this systematic mapping study started with a space of 3984 studies, identified in five digital libraries. Due to time and resource limitations, except for survey papers, the mapping was done primarily through information found in paper abstracts. Therefore, our intention is to present an overview of semantics-concerned text mining, presenting a map of studies that has been developed by the research community, and not to present deep details of the studies. The papers were analyzed in relation to their application domains, performed tasks, applied methods and resources, and level of user’s interaction. The contribution of this paper is threefold: (i) it presents an overview of semantics-concerned text mining studies from a text mining viewpoint, organizing the studies according to seven aspects (application domains, languages, external knowledge sources, tasks, methods and algorithms, representation models, and user’s interaction); (ii) it quantifies and confirms some previous feelings that we had about our study subject; and (iii) it provides a starting point for those, researchers or practitioners, who are initiating works on semantics-concerned text mining.

The remainder of this paper is organized as follows. The “ Method applied for systematic mapping ” section presents an overview of systematic mapping method, since this is the type of literature review selected to develop this study and it is not widespread in the text mining community. In this section, we also present the protocol applied to conduct the systematic mapping study, including the research questions that guided this study and how it was conducted. The results of the systematic mapping, as well as identified future trends, are presented in the “ Results and discussion ” section. The “ Conclusion ” section concludes this work.

Method applied for systematic mapping

The review reported in this paper is the result of a systematic mapping study, which is a particular type of systematic literature review [ 3 , 4 ]. Systematic literature review is a formal literature review adopted to identify, evaluate, and synthesize evidences of empirical results in order to answer a research question. It is extensively applied in medicine, as part of the evidence-based medicine [ 5 ]. This type of literature review is not as disseminated in the computer science field as it is in the medicine and health care fields 1 , although computer science researches can also take advantage of this type of review. We can find important reports on the use of systematic reviews specially in the software engineering community [ 3 , 4 , 6 , 7 ]. Other sparse initiatives can also be found in other computer science areas, as cloud-based environments [ 8 ], image pattern recognition [ 9 ], biometric authentication [ 10 ], recommender systems [ 11 ], and opinion mining [ 12 ].

A systematic review is performed in order to answer a research question and must follow a defined protocol. The protocol is developed when planning the systematic review, and it is mainly composed by the research questions, the strategies and criteria for searching for primary studies, study selection, and data extraction. The protocol is a documentation of the review process and must have all the information needed to perform the literature review in a systematic way. The analysis of selected studies, which is performed in the data extraction phase, will provide the answers to the research questions that motivated the literature review. Kitchenham and Charters [ 3 ] present a very useful guideline for planning and conducting systematic literature reviews. As systematic reviews follow a formal, well-defined, and documented protocol, they tend to be less biased and more reproducible than a regular literature review.

When the field of interest is broad and the objective is to have an overview of what is being developed in the research field, it is recommended to apply a particular type of systematic review named systematic mapping study [ 3 , 4 ]. Systematic mapping studies follow an well-defined protocol as in any systematic review. The main differences between a traditional systematic review and a systematic mapping are their breadth and depth. While a systematic review deeply analyzes a low number of primary studies, in a systematic mapping a wider number of studies are analyzed, but less detailed. Thus, the search terms of a systematic mapping are broader and the results are usually presented through graphs. Systematic mapping studies can be used to get a mapping of the publications about some subject or field and identify areas that require the development of more primary studies and areas in which a narrower systematic literature review would be of great help to the research community.

This paper reports a systematic mapping study conducted to get a general overview of how text semantics is being treated in text mining studies. It fills a literature review gap in this broad research field through a well-defined review process. As a systematic mapping, our study follows the principles of a systematic mapping/review. However, as our goal was to develop a general mapping of a broad field, our study differs from the procedure suggested by Kitchenham and Charters [ 3 ] in two ways. Firstly, Kitchenham and Charters [ 3 ] state that the systematic review should be performed by two or more researchers. Although our mapping study was planned by two researchers, the study selection and the information extraction phases were conducted by only one due to the resource constraints. In this process, the other researchers reviewed the execution of each systematic mapping phase and their results. Secondly, systematic reviews usually are done based on primary studies only, nevertheless we have also accepted secondary studies (reviews or surveys) as we want an overview of all publications related to the theme.

In the following subsections, we describe our systematic mapping protocol and how this study was conducted.

Systematic mapping planning

The first step of a systematic review or systematic mapping study is its planning. The researchers conducting the study must define its protocol, i.e., its research questions and the strategies for identification, selection of studies, and information extraction, as well as how the study results will be reported. The main parts of the protocol that guided the systematic mapping study reported in this paper are presented in the following.

Research question: the main research question that guided this study was “How is semantics considered in text mining studies?” The main question was detailed in seven secondary questions, all of them related to text mining studies that consider text semantics in some way:

What are the application domains that focus on text semantics?

What are the natural languages being considered when working with text semantics?

Which external sources are frequently used in text mining studies when text semantics is considered?

In what text mining tasks is the text semantics most considered?

What methods and algorithms are commonly used?

How can texts be represented?

Do users or domain experts take part in the text mining process?

Study identification: the study identification was performed through searches for studies conducted in five digital libraries: ACM Digital Library, IEEE Xplore, Science Direct, Web of Science, and Scopus. The following general search expression was applied in both Title and Keywords fields, when allowed by the digital library search engine: semantic* AND text* AND (mining OR representation OR clustering OR classification OR association rules) .

Study selection: every study returned in the search phase went to the selection phase. Studies were selected based on title, abstract, and paper information (as number of pages, for example). Through this analysis, duplicated studies (most of them were studies found in more than one database) were identified. Besides, studies which match at least one of the following exclusion criteria were rejected: (i) one page papers, posters, presentations, abstracts, and editorials; (ii) papers hosted in services with restricted access and not accessible; (iii) papers written in languages different from English or Portuguese; and (iv) studies that do not deal with text mining and text semantics.

Information extraction: the information extraction phase was performed with papers accepted in the selection phase (papers that were not identified as duplicated or rejected). The abstracts were read in order to extract the information presented in Fig. 2 .

Information extraction form

As any literature review, this study has some bias. The advantage of a systematic literature review is that the protocol clearly specifies its bias, since the review process is well-defined. There are bias related to (i) study identification, i.e., only papers matching the search expression and returned by the searched digital libraries were selected; (ii) selection criteria, i.e., papers that matches the exclusion criteria were rejected; and (iii) information extraction, i.e., the information were mainly extracted considering only title and abstracts. It is not feasible to conduct a literature review free of bias. However, it is possible to conduct it in a controlled and well-defined way through a systematic process.

Systematic mapping conduction

The conduction of this systematic mapping followed the protocol presented in the last subsection and is illustrated in Fig. 3 . The selection and the information extraction phases were performed with support of the Start tool [ 13 ].

Systematic mapping conduction phases. The numbers in the shaded areas indicate the quantity of studies involved

This paper reports the results obtained after the execution of two cycles of the systematic mapping phases. The first cycle was executed based on searches performed in January 2014. The second cycle was an update of the first cycle, with searches performed in February 2016 2 . A total of 3984 papers were found using the search expression in the five digital libraries. In the selection phase, 725 duplicated studies were identified and 1566 papers were rejected according to the exclusion criteria, mainly based on their title and abstract. Most of the rejected papers match the last exclusion criteria ( Studies that do not deal with text mining and text semantics ). Among them, we can find studies that deal with multimedia data (images, videos, or audio) and with construction, description, or annotation of corpus.

After the selection phase, 1693 studies were accepted for the information extraction phase. In this phase, information about each study was extracted mainly based on the abstracts, although some information was extracted from the full text. The results of the accepted paper mapping are presented in the next section.

Results and discussion

The mapping reported in this paper was conducted with the general goal of providing an overview of the researches developed by the text mining community and that are concerned about text semantics. This mapping is based on 1693 studies selected as described in the previous section. The distribution of these studies by publication year is presented in Fig. 4 . We can note that text semantics has been addressed more frequently in the last years, when a higher number of text mining studies showed some interest in text semantics. The peak was in 2011, with 223 identified studies. The lower number of studies in the year 2016 can be assigned to the fact that the last searches were conducted in February 2016.

Distribution of the 1693 accepted studies by publication year. Searches for studies identification were executed in January 2014 and February 2016

The results of the systematic mapping study is presented in the following subsections. We start our report presenting, in the “ Surveys ” section, a discussion about the eighteen secondary studies (surveys and reviews) that were identified in the systematic mapping. Then, each following section from “ Application domains ” to “ User’s interaction ” is related to a secondary research question that guided our study, i.e., application domains, languages, external knowledge sources, text mining tasks, methods and algorithms, representation model, and user’s interaction. In the “ Systematic mapping summary and future trends ” section, we present a consolidation of our results and point some gaps of both primary and secondary studies.

Some studies accepted in this systematic mapping are cited along the presentation of our mapping. We do not present the reference of every accepted paper in order to present a clear reporting of the results.

In this systematic mapping, we identified 18 survey papers associated to the theme text mining and semantics [ 14 – 31 ]. Each paper exploits some particularity of this broad theme. In the following, we present a short overview of these papers, which is based on the full text of the papers.

Grobelnik [ 14 ] presents, briefly but in a very clear form, an interesting discussion of text processing in his three-page paper. The author organizes the field in three main dimensions, which can be used to classify text processing approaches: representation, technique, and task. The task dimension is about the kind of problems, we solve through the text processing. Document search, clustering, classification, summarization, trend detection, and monitoring are examples of tasks. Considering how text representations are manipulated (technique dimension), we have the methods and algorithms that can be used, including machine learning algorithms, statistical analysis, part-of-speech tagging, semantic annotation, and semantic disambiguation. In the representation dimension, we can find different options for text representation, such as words, phrases, bag-of-words, part-of-speech, subject-predicate-object triples and semantically annotated triples.

Grobelnik [ 14 ] also presents the levels of text representations, that differ from each other by the complexity of processing and expressiveness. The most simple level is the lexical level, which includes the common bag-of-words and n-grams representations. The next level is the syntactic level, that includes representations based on word co-location or part-of-speech tags. The most complete representation level is the semantic level and includes the representations based on word relationships, as the ontologies. Several different research fields deal with text, such as text mining, computational linguistics, machine learning, information retrieval, semantic web and crowdsourcing. Grobelnik [ 14 ] states the importance of an integration of these research areas in order to reach a complete solution to the problem of text understanding.

Stavrianou et al. [ 15 ] present a survey of semantic issues of text mining, which are originated from natural language particularities. This is a good survey focused on a linguistic point of view, rather than focusing only on statistics. The authors discuss a series of questions concerning natural language issues that should be considered when applying the text mining process. Most of the questions are related to text pre-processing and the authors present the impacts of performing or not some pre-processing activities, such as stopwords removal, stemming, word sense disambiguation, and tagging. The authors also discuss some existing text representation approaches in terms of features, representation model, and application task. The set of different approaches to measure the similarity between documents is also presented, categorizing the similarity measures by type (statistical or semantic) and by unit (words, phrases, vectors, or hierarchies).

Stavrianou et al. [ 15 ] also present the relation between ontologies and text mining. Ontologies can be used as background knowledge in a text mining process, and the text mining techniques can be used to generate and update ontologies. The authors conclude the survey stating that text mining is an open research area and that the objectives of the text mining process must be clarified before starting the data analysis, since the approaches must be chosen according to the requirements of the task being performed.

Methods that deal with latent semantics are reviewed in the study of Daud et al. [ 16 ]. The authors present a chronological analysis from 1999 to 2009 of directed probabilistic topic models, such as probabilistic latent semantic analysis, latent Dirichlet allocation, and their extensions. The models are classified according to their main functionality. They describe their advantages, disadvantages, and applications.

Wimalasuriya and Dou [ 17 ], Bharathi and Venkatesan [ 18 ], and Reshadat and Feizi-Derakhshi [ 19 ] consider the use of external knowledge sources (e.g., ontology or thesaurus) in the text mining process, each one dealing with a specific task. Wimalasuriya and Dou [ 17 ] present a detailed literature review of ontology-based information extraction. The authors define the recent information extraction subfield, named ontology-based information extraction (OBIE), identifying key characteristics of the OBIE systems that differentiate them from general information extraction systems. Besides, they identify a common architecture of the OBIE systems and classify existing systems along with different dimensions, as information extraction method applied, whether it constructs and updates the ontology, components of the ontology, and type of documents the system deals with. Bharathi and Venkatesan [ 18 ] present a brief description of several studies that use external knowledge sources as background knowledge for document clustering. Reshadat and Feizi-Derakhshi [ 19 ] present several semantic similarity measures based on external knowledge sources (specially WordNet and MeSH) and a review of comparison results from previous studies.

Schiessl and Bräscher [ 20 ] and Cimiano et al. [ 21 ] review the automatic construction of ontologies. Schiessl and Bräscher [ 20 ], the only identified review written in Portuguese, formally define the term ontology and discuss the automatic building of ontologies from texts. The authors state that automatic ontology building from texts is the way to the timely production of ontologies for current applications and that many questions are still open in this field. Also, in the theme of automatic building of ontologies from texts, Cimiano et al. [ 21 ] argue that automatically learned ontologies might not meet the demands of many possible applications, although they can already benefit several text mining tasks. The authors divide the ontology learning problem into seven tasks and discuss their developments. They state that ontology population task seems to be easier than learning ontology schema tasks.

Jovanovic et al. [ 22 ] discuss the task of semantic tagging in their paper directed at IT practitioners. Semantic tagging can be seen as an expansion of named entity recognition task, in which the entities are identified, disambiguated, and linked to a real-world entity, normally using a ontology or knowledge base. The authors compare 12 semantic tagging tools and present some characteristics that should be considered when choosing such type of tools.

Specifically for the task of irony detection, Wallace [ 23 ] presents both philosophical formalisms and machine learning approaches. The author argues that a model of the speaker is necessary to improve current machine learning methods and enable their application in a general problem, independently of domain. He discusses the gaps of current methods and proposes a pragmatic context model for irony detection.

The application of text mining methods in information extraction of biomedical literature is reviewed by Winnenburg et al. [ 24 ]. The paper describes the state-of-the-art text mining approaches for supporting manual text annotation, such as ontology learning, named entity and concept identification. They also describe and compare biomedical search engines, in the context of information retrieval, literature retrieval, result processing, knowledge retrieval, semantic processing, and integration of external tools. The authors argue that search engines must also be able to find results that are indirectly related to the user’s keywords, considering the semantics and relationships between possible search results. They point that a good source for synonyms is WordNet.

Leser and Hakenberg [ 25 ] presents a survey of biomedical named entity recognition. The authors present the difficulties of both identifying entities (like genes, proteins, and diseases) and evaluating named entity recognition systems. They describe some annotated corpora and named entity recognition tools and state that the lack of corpora is an important bottleneck in the field.

Dagan et al. [ 26 ] introduce a special issue of the Journal of Natural Language Engineering on textual entailment recognition, which is a natural language task that aims to identify if a piece of text can be inferred from another. The authors present an overview of relevant aspects in textual entailment, discussing four PASCAL Recognising Textual Entailment (RTE) Challenges. They declared that the systems submitted to those challenges use cross-pair similarity measures, machine learning, and logical inference. The authors also describe tools, resources, and approaches commonly used in textual entailment tasks and conclude with the perspective that in the future, the constructed entailment “engines” will be used as a basic module by the text-understanding applications.

Irfan et al. [ 27 ] present a survey on the application of text mining methods in social network data. They present an overview of pre-processing, classification and clustering techniques to discover patterns from social networking sites. They point out that the application of text mining techniques can reveal patterns related to people’s interaction behaviors. The authors present two basic pre-processing activities: feature extraction and feature selection. The authors also review classification and clustering approaches. They present different machine learning algorithms and discuss the importance of ontology usage to introduce explicit concepts, descriptions, and the semantic relationships among concepts. Irfan et al. [ 27 ] identify the main challenges related to the manipulation of social network texts (such as large data, data with impurities, dynamic data, emotions interpretations, privacy, and data confidence) and to text mining infrastructure (such as usage of cloud computing and improvement of the usability of text mining methods).

In the context of semantic web, Sheth et al. [ 28 ] define three types of semantics: implicit semantics, formal semantics, and powerful (or soft) semantics. Implicit semantics are those implicitly present in data patterns and is not explicitly represented in any machine processable syntax. Machine learning methods exploit this type of semantics. Formal semantics are those represented in some well-formed syntactic form and are machine-processable. The powerful semantics are the sort of semantics that allow uncertainty (that is, the representation of degree of membership and degree of certainty) and, therefore, allowing abductive or inductive reasoning. The authors also correlates the types of semantics with some core capabilities required by a practical semantic web application. The authors conclude their review asserting the importance of focusing research efforts in representation mechanisms for powerful semantics in order to move towards the development of semantic applications.

The formal semantics defined by Sheth et al. [ 28 ] is commonly represented by description logics, a formalism for knowledge representation. The application of description logics in natural language processing is the theme of the brief review presented by Cheng et al. [ 29 ].

The broad field of computational linguistics is presented by Martinez and Martinez [ 30 ]. Considering areas of computational linguistics that can be interesting to statisticians, the authors describe three main aspects of computational linguistics: formal language, information retrieval, and machine learning. The authors present common models for knowledge representation, addressing their statistical characteristics and providing an overview of information retrieval and machine learning methods related to computational linguistics. They describe some of the major statistical contributions to the areas of machine learning and computational linguistics, from the point of view of classification and clustering algorithms. Martinez and Martinez [ 30 ] emphasize that machine translation, part-of-speech tagging, word sense disambiguation, and text summarization are some of the identified applications that statisticians can contribute.

Bos [ 31 ] presents an extensive survey of computational semantics, a research area focused on computationally understanding human language in written or spoken form. He discusses how to represent semantics in order to capture the meaning of human language, how to construct these representations from natural language expressions, and how to draw inferences from the semantic representations. The author also discusses the generation of background knowledge, which can support reasoning tasks. Bos [ 31 ] indicates machine learning, knowledge resources, and scaling inference as topics that can have a big impact on computational semantics in the future.

As presented in this section, the reviewed secondary studies exploit some specific issues of semantics-concerned text mining researches. In contrast to them, this paper reviews a broader range of text mining studies that deal with semantic aspects. To the best of our knowledge, this is the first report of a mapping of this field. We present the results of our systematic mapping study in the following sections, organized in seven dimensions of the text mining studies derived from our secondary research questions: application domains, languages, external knowledge usage, tasks, methods and algorithms, representation model, and user’s interaction.

Application domains

Research question:

Figure 5 presents the domains where text semantics is most present in text mining applications. Health care and life sciences is the domain that stands out when talking about text semantics in text mining applications. This fact is not unexpected, since life sciences have a long time concern about standardization of vocabularies and taxonomies. The building of taxonomies and ontologies is such a common practice in health care and life sciences that World Wide Web Consortium (W3C) has an interest group specific for developing, evaluating, and supporting semantic web technologies for this field [ 32 ]. Among the most common problems treated through the use of text mining in the health care and life science is the information retrieval from publications of the field. The search engine PubMed [ 33 ] and the MEDLINE database are the main text sources among these studies. There are also studies related to the extraction of events, genes, proteins and their associations [ 34 – 36 ], detection of adverse drug reaction [ 37 ], and the extraction of cause-effect and disease-treatment relations [ 38 – 40 ].

Application domains identified in the literature mapping accepted studies

The second most frequent identified application domain is the mining of web texts, comprising web pages, blogs, reviews, web forums, social medias, and email filtering [ 41 – 46 ]. The high interest in getting some knowledge from web texts can be justified by the large amount and diversity of text available and by the difficulty found in manual analysis. Nowadays, any person can create content in the web, either to share his/her opinion about some product or service or to report something that is taking place in his/her neighborhood. Companies, organizations, and researchers are aware of this fact, so they are increasingly interested in using this information in their favor. Some competitive advantages that business can gain from the analysis of social media texts are presented in [ 47 – 49 ]. The authors developed case studies demonstrating how text mining can be applied in social media intelligence. From our systematic mapping data, we found that Twitter is the most popular source of web texts and its posts are commonly used for sentiment analysis or event extraction.

Besides the top 2 application domains, other domains that show up in our mapping refers to the mining of specific types of texts. We found research studies in mining news, scientific papers corpora, patents, and texts with economic and financial content.

Whether using machine learning or statistical techniques, the text mining approaches are usually language independent. However, specially in the natural language processing field, annotated corpora is often required to train models in order to resolve a certain task for each specific language (semantic role labeling problem is an example). Besides, linguistic resources as semantic networks or lexical databases, which are language-specific, can be used to enrich textual data. Most of the resources available are English resources. Thus, the low number of annotated data or linguistic resources can be a bottleneck when working with another language. There are important initiatives to the development of researches for other languages, as an example, we have the ACM Transactions on Asian and Low-Resource Language Information Processing [ 50 ], an ACM journal specific for that subject.

In this study, we identified the languages that were mentioned in paper abstracts. The collected data are summarized in Fig. 6 . We must note that English can be seen as a standard language in scientific publications; thus, papers whose results were tested only in English datasets may not mention the language, as examples, we can cite [ 51 – 56 ]. Besides, we can find some studies that do not use any linguistic resource and thus are language independent, as in [ 57 – 61 ]. These facts can justify that English was mentioned in only 45.0% of the considered studies.

Languages identified in the literature mapping accepted studies

Chinese is the second most mentioned language (26.4% of the studies reference the Chinese language). Wu et al. [ 62 ] point two differences between English and Chinese: in Chinese, there are no white spaces between words in a sentence and there are a higher number of frequent words (the number of frequent words in Chinese is more than twice the number of English frequent words). These characteristics motivate the development of methods and experimental evaluations specifically for Chinese.

This mapping shows that there is a lack of studies considering languages other than English or Chinese. The low number of studies considering other languages suggests that there is a need for construction or expansion of language-specific resources (as discussed in “ External knowledge sources ” section). These resources can be used for enrichment of texts and for the development of language specific methods, based on natural language processing.

External knowledge sources

Text mining initiatives can get some advantage by using external sources of knowledge. Thesauruses, taxonomies, ontologies, and semantic networks are knowledge sources that are commonly used by the text mining community. Semantic networks is a network whose nodes are concepts that are linked by semantic relations. The most popular example is the WordNet [ 63 ], an electronic lexical database developed at the Princeton University. Depending on its usage, WordNet can also be seen as a thesaurus or a dictionary [ 64 ].

There is not a complete definition for the terms thesaurus, taxonomy, and ontology that is unanimously accepted by all research areas. Weller [ 65 ] presents an interesting discussion about the term ontology , including its origin and proposed definitions. She concluded the discussion stating that: “Ontologies should unambiguously represent shared background knowledge that helps people within a community of interest to understand each other. And they should make computer-readable indexing of information possible on the Web” [ 65 ]. The same can be said about thesauruses and taxonomies. In a general way, thesauruses, taxonomies, and ontologies are normally specialized in a specific domain and they usually differs from each other by their degree of expressiveness and complexity in their relational constructions [ 66 ]. Ontology would be the most expressive type of knowledge representation, having the most complex relations and formalized construction.

When looking at the external knowledge sources used in semantics-concerned text mining studies (Fig. 7 ), WordNet is the most used source. This lexical resource is cited by 29.9% of the studies that uses information beyond the text data. WordNet can be used to create or expand the current set of features for subsequent text classification or clustering. The use of features based on WordNet has been applied with and without good results [ 55 , 67 – 69 ]. Besides, WordNet can support the computation of semantic similarity [ 70 , 71 ] and the evaluation of the discovered knowledge [ 72 ].

External sources identified in the literature mapping accepted studies

The second most used source is Wikipedia [ 73 ], which covers a wide range of subjects and has the advantage of presenting the same concept in different languages. Wikipedia concepts, as well as their links and categories, are also useful for enriching text representation [ 74 – 77 ] or classifying documents [ 78 – 80 ]. Medelyan et al. [ 81 ] present the value of Wikipedia and discuss how the community of researchers are making use of it in natural language processing tasks (in special word sense disambiguation), information retrieval, information extraction, and ontology building.

The use of Wikipedia is followed by the use of the Chinese-English knowledge database HowNet [ 82 ]. Finding HowNet as one of the most used external knowledge source it is not surprising, since Chinese is one of the most cited languages in the studies selected in this mapping (see the “ Languages ” section). As well as WordNet, HowNet is usually used for feature expansion [ 83 – 85 ] and computing semantic similarity [ 86 – 88 ].

Web pages are also used as external sources [ 89 – 91 ]. Normally, web search results are used to measure similarity between terms. We also found some studies that use SentiWordNet [ 92 ], which is a lexical resource for sentiment analysis and opinion mining [ 93 , 94 ]. Among other external sources, we can find knowledge sources related to Medicine, like the UMLS Metathesaurus [ 95 – 98 ], MeSH thesaurus [ 99 – 102 ], and the Gene Ontology [ 103 – 105 ].

Text mining tasks

The distribution of text mining tasks identified in this literature mapping is presented in Fig. 8 . Classification and clustering are the most frequent tasks. Classification corresponds to the task of finding a model from examples with known classes (labeled instances) in order to predict the classes of new examples. On the other hand, clustering is the task of grouping examples (whose classes are unknown) based on their similarities. Classification was identified in 27.4% and clustering in 17.0% of the studies. As these are basic text mining tasks, they are often the basis of other more specific text mining tasks, such as sentiment analysis and automatic ontology building. Therefore, it was expected that classification and clustering would be the most frequently applied tasks.

Text mining tasks identified in the literature mapping accepted studies

Besides classification and clustering, we can note that semantic concern are present in tasks as information extraction [ 106 – 108 ], information retrieval [ 109 – 111 ], sentiment analysis [ 112 – 115 ], and automatic ontology building [ 116 , 117 ], as well as the pre-processing step itself [ 118 , 119 ].

Methods and algorithms

A word cloud 3 of methods and algorithms identified in this literature mapping is presented in Fig. 9 , in which the font size reflects the frequency of the methods and algorithms among the accepted papers. We can note that the most common approach deals with latent semantics through Latent Semantic Indexing (LSI) [ 2 , 120 ], a method that can be used for data dimension reduction and that is also known as latent semantic analysis. The Latent Semantic Index low-dimensional space is also called semantic space. In this semantic space, alternative forms expressing the same concept are projected to a common representation. It reduces the noise caused by synonymy and polysemy; thus, it latently deals with text semantics. Another technique in this direction that is commonly used for topic modeling is latent Dirichlet allocation (LDA) [ 121 ]. The topic model obtained by LDA has been used for representing text collections as in [ 58 , 122 , 123 ].

Word cloud of methods and algorithms identified in the literature mapping studies. To enable a better reading of the word cloud, the frequency of the methods and algorithms higher than one was rounded up to the nearest ten (for example, a method applied in 75 studies is represented in the word cloud in a word size corresponding to the frequency 80)

Beyond latent semantics, the use of concepts or topics found in the documents is also a common approach. The concept-based semantic exploitation is normally based on external knowledge sources (as discussed in the “ External knowledge sources ” section) [ 74 , 124 – 128 ]. As an example, explicit semantic analysis [ 129 ] rely on Wikipedia to represent the documents by a concept vector. In a similar way, Spanakis et al. [ 125 ] improved hierarchical clustering quality by using a text representation based on concepts and other Wikipedia features, such as links and categories.

The issue of text ambiguity has also been the focus of studies. Word sense disambiguation can contribute to a better document representation. It is normally based on external knowledge sources and can also be based on machine learning methods [ 36 , 130 – 133 ].

Other approaches include analysis of verbs in order to identify relations on textual data [ 134 – 138 ]. However, the proposed solutions are normally developed for a specific domain or are language dependent.

In Fig. 9 , we can observe the predominance of traditional machine learning algorithms, such as Support Vector Machines (SVM), Naive Bayes, K-means, and k-Nearest Neighbors (KNN), in addition to artificial neural networks and genetic algorithms. The application of natural language processing methods (NLP) is also frequent. Among these methods, we can find named entity recognition (NER) and semantic role labeling. It shows that there is a concern about developing richer text representations to be input for traditional machine learning algorithms, as we can see in the studies of [ 55 , 139 – 142 ].

Text representation models

The most popular text representation model is the vector space model. In this model, each document is represented by a vector whose dimensions correspond to features found in the corpus. When features are single words, the text representation is called bag-of-words. Despite the good results achieved with a bag-of-words, this representation, based on independent words, cannot express word relationships, text syntax, or semantics. Therefore, it is not a proper representation for all possible text mining applications.

The use of richer text representations is the focus of several studies [ 62 , 79 , 97 , 143 – 148 ]. Most of the studies concentrate on proposing more elaborated features to represent documents in the vector space model, including the use of topic model techniques, such as LSI and LDA, to obtain latent semantic features. Deep learning [ 149 ] is currently applied to represent independent terms through their associated concepts, in an attempt to narrow the relationships between the terms [ 150 , 151 ]. The use of distributed word representations (word embeddings) can be seen in several works of this area in tasks such as classification [ 88 , 152 , 153 ], summarization [ 154 ], and information retrieval [ 155 ].

Besides the vector space model, there are text representations based on networks (or graphs), which can make use of some text semantic features. Network-based representations, such as bipartite networks and co-occurrence networks, can represent relationships between terms or between documents, which is not possible through the vector space model [ 147 , 156 – 158 ].

In addition to the text representation model, text semantics can also be incorporated to text mining process through the use of external knowledge sources, like semantic networks and ontologies, as discussed in the “ External knowledge sources ” section.

User’s interaction

Text mining is a process to automatically discover knowledge from unstructured data. Nevertheless, it is also an interactive process, and there are some points where a user, normally a domain expert, can contribute to the process by providing his/her previous knowledge and interests. As an example, in the pre-processing step, the user can provide additional information to define a stoplist and support feature selection. In the pattern extraction step, user’s participation can be required when applying a semi-supervised approach. In the post-processing step, the user can evaluate the results according to the expected knowledge usage.

Despite the fact that the user would have an important role in a real application of text mining methods, there is not much investment on user’s interaction in text mining research studies. A probable reason is the difficulty inherent to an evaluation based on the user’s needs. In empirical research, researchers use to execute several experiments in order to evaluate proposed methods and algorithms, which would require the involvement of several users, therefore making the evaluation not feasible in practical ways.

Less than 1% of the studies that were accepted in the first mapping cycle presented information about requiring some sort of user’s interaction in their abstract. To better analyze this question, in the mapping update performed in 2016, the full text of the studies were also considered. Figure 10 presents types of user’s participation identified in the literature mapping studies. The most common user’s interactions are the revision or refinement of text mining results [ 159 – 161 ] and the development of a standard reference, also called as gold standard or ground truth, which is used to evaluate text mining results [ 162 – 165 ]. Besides that, users are also requested to manually annotate or provide a few labeled data [ 166 , 167 ] or generate of hand-crafted rules [ 168 , 169 ].

Types of user participation identified in the literature mapping accepted studies

Systematic mapping summary and future trends

How is semantics considered in text mining studies?

Semantics is an important component in natural language texts. Consequently, in order to improve text mining results, many text mining researches claim that their solutions treat or consider text semantics in some way. However, text mining is a wide research field and there is a lack of secondary studies that summarize and integrate the different approaches. How is semantics considered in text mining studies? Looking for the answer to this question, we conducted this systematic mapping based on 1693 studies, accepted among the 3984 studies identified in five digital libraries. In the previous subsections, we presented the mapping regarding to each secondary research question. In this subsection, we present a consolidation of our results and point some future trends of semantics-concerned text mining.

As previously stated, the objective of this systematic mapping is to provide a general overview of semantics-concerned text mining studies. The papers considered in this systematic mapping study, as well as the mapping results, are limited by the applied search expression and the research questions. It is not feasible to cover all published papers in this broad field. Therefore, the reader can miss in this systematic mapping report some previously known studies. It is not our objective to present a detailed survey of every specific topic, method, or text mining task. This systematic mapping is a starting point, and surveys with a narrower focus should be conducted for reviewing the literature of specific subjects, according to one’s interests.

The quantitative analysis of the scientific production by each text mining dimension (presented from the “ Application domains ” section to the “ User’s interaction ” section) confirmed some previous feelings that we had about our study subject and highlighted other interesting characteristics of the field. Text semantics is closely related to ontologies and other similar types of knowledge representation. We also know that health care and life sciences is traditionally concerned about standardization of their concepts and concepts relationships. Thus, as we already expected, health care and life sciences was the most cited application domain among the literature accepted studies. This application domain is followed by the Web domain, what can be explained by the constant growth, in both quantity and coverage, of Web content.

It was surprising to find the high presence of the Chinese language among the studies. Chinese language is the second most cited language, and the HowNet, a Chinese-English knowledge database, is the third most applied external source in semantics-concerned text mining studies. Looking at the languages addressed in the studies, we found that there is a lack of studies specific to languages other than English or Chinese. We also found an expressive use of WordNet as an external knowledge source, followed by Wikipedia, HowNet, Web pages, SentiWordNet, and other knowledge sources related to Medicine.

Text classification and text clustering, as basic text mining tasks, are frequently applied in semantics-concerned text mining researches. Among other more specific tasks, sentiment analysis is a recent research field that is almost as applied as information retrieval and information extraction, which are more consolidated research areas. SentiWordNet, a lexical resource for sentiment analysis and opinion mining, is already among the most used external knowledge sources.

The treatment of latent semantics, through the application of LSI, stands out when looking at methods and algorithms. Besides that, traditional text mining methods and algorithms, like SVM, KNN, and K-means, are frequently applied and researches tend to enhance the text representation by applying NLP methods or using external knowledge sources. Thus, text semantics can be incorporated to the text mining process mainly through two approaches: the construction of richer terms in the vector space representation model or the use of networks or graphs to represent semantic relations between terms or documents.

In real application of the text mining process, the participation of domain experts can be crucial to its success. However, the participation of users (domain experts) is seldom explored in scientific papers. The difficulty inherent to the evaluation of a method based on user’s interaction is a probable reason for the lack of studies considering this approach.

The mapping indicates that there is space for secondary studies in areas that has a high number of primary studies, such as studies of feature enrichment for a better text representation in the vector space model; use of classification methods; use of clustering methods; and the use of latent semantics in text mining. A detailed literature review, as the review of Wimalasuriya and Dou [ 17 ] (described in “ Surveys ” section), would be worthy for organization and summarization of these specific research subjects.

Considering the development of primary studies, we identified three main future trends: user’s interaction, non-English text processing, and graph-based representation. We expect an increase in the number of studies that have some level of user’s interaction to bring his/her needs and interests to the process. This is particularly valuable for the clustering task, because a considered good clustering solution can vary from user to user [ 170 ]. We also expect a raise of resources (linguistic resources and annotated corpora) for non-English languages. These resources are very important to the development of semantics-concerned text mining techniques. Higher availability of non-English resources will allow a higher number of studies dealing with these languages. Another future trend is the development and use of graph-based text representation. Nowadays, there are already important researches in this direction, and we expect that it will increase as graph-based representations are more expressive than traditional representations in the vector space model.

As an alternative summary of this systematic mapping, additional visualizations of both the selected studies and systematic mapping results can be found online at http://sites.labic.icmc.usp.br/pinda_sm . For this purpose, the prototype of the Pinda tool was adapted for hierarchical visualization of the textual data, using K-means algorithm to group the results. The tool allows the analysis of data (title + abstract of selected studies or information extracted from them) through multiple visualization techniques (Thumbnail, Snippets, Directories, Scatterplot, Treemap, and Sunburst), coordinating the user’s interactions for a better understanding of existing relationships. Figure 11 illustrates the Scatterplot visualization of studies accepted in this systematic mapping. Some of the possible visualizations of the systematic mapping results are presented in Fig. 12 .

Scatterplot visualization of accepted studies of the systematic mapping

Directories and Treemap visualizations of the systematic mapping results

Text semantics are frequently addressed in text mining studies, since it has an important influence in text meaning. However, there is a lack of secondary studies that consolidate these researches. This paper reported a systematic mapping study conducted to overview semantics-concerned text mining literature. The scope of this mapping is wide (3984 papers matched the search expression). Thus, due to limitations of time and resources, the mapping was mainly performed based on abstracts of papers. Nevertheless, we believe that our limitations do not have a crucial impact on the results, since our study has a broad coverage.

The main contributions of this work are (i) it presents a quantitative analysis of the research field; (ii) its conduction followed a well-defined literature review protocol; (iii) it discusses the area regarding seven important text mining dimensions: application domain, language, external knowledge source, text mining task, method and algorithm, representation model, and user’s interaction; and (iv) the produced mapping can give a general summary of the subject and can be of great help for researchers working with semantics and text mining. Thus, this work filled a gap in the literature as, to the best of our knowledge, this is the first general literature review of this wide subject.

Although several researches have been developed in the text mining field, the processing of text semantics remains an open research problem. The field lacks secondary studies in areas that has a high number of primary studies, such as feature enrichment for a better text representation in the vector space model. Another highlight is about a language-related issue. We found considerable differences in numbers of studies among different languages, since 71.4% of the identified studies deal with English and Chinese. Thus, there is a lack of studies dealing with texts written in other languages. When considering semantics-concerned text mining, we believe that this lack can be filled with the development of good knowledge bases and natural language processing methods specific for these languages. Besides, the analysis of the impact of languages in semantic-concerned text mining is also an interesting open research question. A comparison among semantic aspects of different languages and their impact on the results of text mining techniques would also be interesting.

1 A simple search for “systematic review” on the Scopus database in June 2016 returned, by subject area, 130,546 Health Sciences documents (125,254 of them for Medicine) and only 5,539 Physical Sciences (1328 of them for Computer Science). The coverage of Scopus publications are balanced between Health Sciences (32% of total Scopus publication) and Physical Sciences (29% of total Scopus publication).

2 It was not possible to perform the second cycle of searches in ACM Digital Library because of a change in the interface of this search engine. However, it must be notice that only eight studies that was found only in this database was accepted in the first cycle. All other studies was also retrieved by other search engines (specially Scopus, which retrieved more than 89% of accepted studies.)

3 Word cloud created with support of Wordle [ 171 ].

Miner G, Elder J, Hill T, Nisbet R, Delen D, Fast A (2012) Practical text mining and statistical analysis for non-structured text data applications. 1st edn. Academic Press, Boston.

Google Scholar  

Aggarwal CC, Zhai C (eds)2012. Mining text data. Springer, Durham.

Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering. EBSE Technical Report EBSE-2007-01. Keele University and Durham University Joint Report, Durham, UK.

Petersen K, Feldt R, Mujtaba S, Mattsson M (2008) Systematic mapping studies in software engineering In: EASE 2008: Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering. EASE’08, 68–77. British Computer Society, Swinton, UK.

Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw80(4): 571–583.

Article   Google Scholar  

Kitchenham B, Pretorius R, Budgen D, Brereton OP, Turner M, Niazi M, et al (2010) Systematic literature reviews in software engineering—a tertiary study. Inf Softw Technol52(8): 792–805.

Felizardo KR, Nakagawa EY, MacDonell SG, Maldonado JC (2014) A visual analysis approach to update systematic reviews In: EASE’14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, 4:1–4:10. ACM, New York.

Moghaddam FA, Lago P, Grosso P (2015) Energy-efficient networking solutions in cloud-based environments: a systematic literature review. ACM Comput Surv47(4): 64:1–64:32.

Pedro RWD, Nunes FLS, Machado-Lima A (2013) Using grammars for pattern recognition in images: a systematic review. ACM Comput Surv46(2): 26:1–26:34.

Pisani PH, Lorena AC (2013) A systematic review on keystroke dynamics. J Braz Comput Soc19(4): 573–587.

Park DH, Kim HK, Choi IY, Kim JK (2012) A literature review and classification of recommender systems research. Expert Syst Appl39(11): 10059–10072.

Khan K, Baharudin BB, Khan A, et al (2009) Mining opinion from text documents: a survey In: DEST’09: Proceedings of the 3rd IEEE International Conference on Digital Ecosystems and Technologies, 217–222. IEEE.

Laboratory of Research on Software Engineering (LaPES) - StArt Tool. http://lapes.dc.ufscar.br/tools/start_tool . Accessed 8 June 2016.

Grobelnik M (2011) Many faces of text processing In: WIMS’11: Proceedings of the International Conference on Web Intelligence, Mining and Semantics, 5. ACM.

Stavrianou A, Andritsos P, Nicoloyannis N (2007) Overview and semantic issues of text mining. SIGMOD Rec36(3): 23–34.

Daud A, Li J, Zhou L, Muhammad F (2010) Knowledge discovery through directed probabilistic topic models: a survey. Front Comput Sci China4(2): 280–301.

Wimalasuriya DC, Dou D (2010) Ontology-based information extraction: an introduction and a survey of current approaches. J Inf Sci36(3): 306–323.

Bharathi G, Venkatesan D (2012) Study of ontology or thesaurus based document clustering and information retrieval. J Eng Appl Sci7(4): 342–347.

Reshadat V, Feizi-Derakhshi MR (2012) Studying of semantic similarity methods in ontology. Res J Appl Sci Eng Technol4(12): 1815–1821.

Schiessl M, Bräscher M (2012) Do texto às ontologias: uma perspectiva para a ciência da informação. Ciência da Informação40(2): 301–311.

Cimiano P, Völker J, Studer R (2006) Ontologies on demand?—a description of the state-of-the-art, applications, challenges and trends for ontology learning from text. Inf Wiss Prax57(6-7): 315–320.

Jovanovic J, Bagheri E, Cuzzola J, Gasevic D, Jeremic Z, Bashash R (2014) Automated semantic tagging of textual content. IT Prof16(6): 38–46.

Wallace BC (2015) Computational irony: a survey and new perspectives. Artif Intell Rev43(4): 467–483.

Winnenburg R, Wächter T, Plake C, Doms A, Schroeder M (2008) Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?Brief Bioinform9(6): 466–478.

Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform6(4): 357–369.

Dagan I, Dolan B, Magnini B, Roth D (2009) Recognizing textual entailment: rational, evaluation and approaches. Nat Lang Eng15(04): i–xvii.

Irfan R, King CK, Grages D, Ewen S, Khan SU, Madani SA, et al. (2015) A survey on text mining in social networks. Knowl Eng Rev30(02): 157–170.

Sheth A, Ramakrishnan C, Thomas C (2005) Semantics for the semantic web: the implicit, the formal and the powerful. Int J Semant Web Inf Syst1(1): 1–18.

Cheng XY, Cheng C, Zhu Q (2011) The applications of description logics in natural language processing. Adv Mater Res204: 381–386.

Martinez A, Martinez W (2015) At the interface of computational linguistics and statistics. Wiley Interdiscip Rev Comput Stat7(4): 258–274.

Article   MathSciNet   Google Scholar  

Bos J (2011) A survey of computational semantics: representation, inference and knowledge in wide-coverage text understanding. Lang Linguist Compass5(6): 336–366.

W, 3C - Semantic Web Health Care and Life Sciences Interest Group. https://www.w3.org/blog/hcls/ . Accessed 8 June 2016.

National Center for Biotechnology Information - PubMed. http://www.ncbi.nlm.nih.gov/pubmed/ . Accessed 8 June 2016.

Miwa M, Thompson P, McNaught J, Kell DB, Ananiadou S (2012) Extracting semantically enriched events from biomedical literature. BMC Bioinforma13(1): 1–24.

Ravikumar KE, Liu H, Cohn JD, Wall ME, Verspoor K (2011) Pattern learning through distant supervision for extraction of protein-residue associations in the biomedical literature, vol. 2. pp 59–65. IEEE, Honolulu. http://ieeexplore.ieee.org/document/6147049/ .

Xia N, Lin H, Yang Z, Li Y (2011) Combining multiple disambiguation methods for gene mention normalization. Expert Syst Appl38(7): 7994–7999.

Sarker A, Gonzalez G (2015) Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J Biomed Inform53: 196–207.

Wu JL, Yu LC, Chang PC (2012) Detecting causality from online psychiatric texts using inter-sentential language patterns. BMC Med Inform Dec Making12(1): 1–10.

Abacha AB, Zweigenbaum P (2011) A hybrid approach for the extraction of semantic relations from MEDLINE abstracts. Lect Notes Comput Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)6609 LNCS(PART 2): 139–150.

Yu LC, Wu CH, Jang FL (2007) Psychiatric consultation record retrieval using scenario-based representation and multilevel mixture model. IEEE IEEE Trans Inf Technol Biomed11(4): 415–427.

Musto C, Semeraro G, Lops P, Gemmis MD (2015) CrowdPulse: a framework for real-time semantic analysis of social streams. Inf Syst54: 127–146.

García-Moya L, Kudama S, Aramburu MJ, Berlanga R (2013) Storing and analysing voice of the market data in the corporate data warehouse. Inf Syst Front15(3): 331–349.

Eugenio BD, Green N, Subba R (2013) Detecting life events in feeds from Twitter In: ICSC 2013: Proceedings of the IEEE Seventh International Conference on Semantic Computing, 274–277. IEEE, Irvine, http://ieeexplore.ieee.org/document/6693529/ .

Chapter   Google Scholar  

Torunoglu D, Telseren G, Sagturk O, Ganiz MC (2013) Wikipedia based semantic smoothing for twitter sentiment classification In: INISTA 2013: Proceedings of the IEEE International Symposium on Innovations in Intelligent Systems and Applications, 1–5. IEEE, Albena.

Cao Q, Duan W, Gan Q (2011) Exploring determinants of voting for the “helpfulness” of online user reviews: a text mining approach. Decis Support Syst50(2): 511–521.

Levi A, Mokryn O, Diot C, Taft N (2012) Finding a needle in a haystack of reviews: cold start context-based hotel recommender system In: RecSys’12: Proceedings of the sixth ACM Conference on Recommender Systems, 115–122. ACM, New York.

He W, Shen J, Tian X, Li Y, Akula V, Yan G, et al (2015) Gaining competitive intelligence from social media data: evidence from two largest retail chains in the world. Ind Manag Data Syst115(9): 1622–1636.

He W, Tian X, Chen Y, Chong D (2016) Actionable social media competitive analytics for understanding customer experiences. J Comput Inf Syst56(2): 145–155.

Tian X, He W, Tao R, Akula V (2016) Mining online hotel reviews: a case study from hotels in China In: AMCIS 2016: Proceedings of the 22nd Americas Conference on Information Systems, 1–8.

ACM - Asian and Low-Resource Language Information Processing (TALLIP). http://tallip.acm.org/ . Accessed 8 June 2016.

Chen CL, Liu CL, Chang YC, Tsai HP (2011) Mining opinion holders and opinion patterns in US financial statements In: TAAI 2011: Proceedings of the International Conference on Technologies and Applications of Artificial Intelligence, 62–68. IEEE, Chung-Li,

Chen J, Liu J, Yu W, Wu P (2009) Combining lexical stability and improved lexical chain for unsupervised word sense disambiguation In: KAM’09: Proceedings of the Second International Symposium on Knowledge Acquisition and Modeling, 430–433. IEEE, Wuhan. http://ieeexplore.ieee.org/document/5362135/ .

Rusu D, Fortuna B, Grobelnik M, Mladenic D (2009) Semantic graphs derived from triplets with application in document summarization. Informatica (Slovenia)33(3): 357–362.

Krachina O, Raskin V, Triezenberg K (2007) Reconciling privacy policies and regulations: ontological semantics perspective In: Human Interface and the Management of Information. Interacting in Information Environments, 730–739. Springer, Berlin,

Mansuy T, Hilderman RJ (2006) A characterization of WordNet features in Boolean models for text classification In: AusDM 2006: Proceedings of the fifth Australasian Conference on Data Mining and Analystics, 103–109. Australian Computer Society, Inc, Darlinghurst,

Ciaramita M, Gangemi A, Ratsch E, Šaric J, Rojas I (2005) Unsupervised learning of semantic relations between concepts of a molecular biology ontology In: IJCAI’05: Proceedings of the 19th International Joint Conference on Artificial Intelligence, 659–664. Morgan Kaufmann Publishers Inc., San Francisco, CA.

Kim K, Chung BS, Choi Y, Lee S, Jung JY, Park J (2014) Language independent semantic kernels for short-text classification. Expert Syst Appl41(2): 735–743.

Gujraniya D, Murty MN (2012) Efficient classification using phrases generated by topic models In: ICPR 2012: Proceedings of the 21st International Conference on Pattern Recognition, 2331–2334. IEEE, Tsukuba,

Du C, Zhuang F, He Q, Shi Z (2012) Multi-task semi-supervised semantic feature learning for classification In: ICDM 2012: Proceedings of the IEEE 12th International Conference on Data Mining, 191–200. IEEE, Brussels, http://ieeexplore.ieee.org/document/6413903/ .

Wu Q, Zhang C, Deng X, Jiang C (2011) LDA-based model for topic evolution mining on text In: ICCSE 2011: Proceedings of the 6th International Conference on Computer Science & Education, 946–949. IEEE, Singapore,

Lu X, Zheng B, Velivelli A, Zhai C (2006) Enhancing text categorization with semantic-enriched representation and training data augmentation. J Am Med Inform Assoc13(5): 526–535.

Wu J, Dang Y, Pan D, Xuan Z, Liu Q (2010) Textual knowledge representation through the semantic-based graph structure in clustering applications In: HICSS 2010: Proceedings of the 43rd Hawaii International Conference on System Sciences, 1–8. IEEE, Washington,

Princeton University - WordNet. http://wordnet.princeton.edu/ . Accessed 8 June 2016.

Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge.

MATH   Google Scholar  

Weller K (2010) Knowledge representation in the social semantic web. Walter de Gruyter.

Weller K, et al (2007) Folksonomies and ontologies: two new players in indexing and knowledge representation In: Proceedings of the Online Information Conference, 108–115.

Wei TA, Lu YC, Chang HB, Zhou QA, Bao XD (2015) A semantic approach for text clustering using WordNet and lexical chains. Expert Syst Appl42(4): 2264–2275.

Li J, Zhao Y, Liu B (2009) Fully automatic text categorization by exploiting wordnet In: Information Retrieval Technology, 1–12. Springer, Berlin,

Mansuy TN, Hilderman RJ (2006) Evaluating WordNet features in text classification models In: FLAIRS Conference 2006: Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference, 568–573. AAAI PRESS, Florida,

Shin Y, Ahn Y, Kim H, Lee SG (2015) Exploiting synonymy to measure semantic similarity of sentences In: IMCOM ’15: Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, 40:1–40:4. ACM, New York,

Batet M, Valls A, Gibert K (2010) Performance of ontology-based semantic similarities in clustering In: Artificial Intelligence and Soft Computing, 281–288. Springer, Berlin,

Basu S, Mooney RJ, Pasupuleti KV, Ghosh J (2001) Evaluating the novelty of text-mined rules using lexical knowledge In: KDD’01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 233–238. ACM, San Francisco,

Wikipedia. https://www.wikipedia.org/ . Accessed 8 June 2016.

Kim HJA, Hong KJA, Chang JYb (2015) Semantically enriching text representation model for document clustering In: Proceedings of the ACM Symposium on Applied Computing,922–925. ACM, New York, http://dl.acm.org.ez67.periodicos.capes.gov.br/citation.cfm?id=2696055 .

Yun J, Jing L, Yu J, Huang H (2011) Unsupervised feature weighting based on local feature relatedness In: Advances in Knowledge Discovery and Data Mining, 38–49. Springer, Berlin,

Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res34: 443–498.

Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting Wikipedia as external knowledge for document clustering In: KDD’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 389–396. ACM, New York,

Mizzaro S, Pavan M, Scagnetto I, Valenti M (2014) Short text categorization exploiting contextual enrichment and external knowledge In: Proceedings of the First International Workshop on Social Media Retrieval and Analysis, 57–62. ACM, New York,

Janik M, Kochut KJ (2008) Wikipedia in action: ontological knowledge in text categorization In: ICSC 2008: Proceedings of the International Conference on Semantic Computing, 268–275. IEEE, Santa Monica,

Chang MW, Ratinov LA, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification In: AAAI-08: Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, 830–835.

Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Human-Computer Stud67(9): 716–754.

HowNet Knowledge Database. http://www.keenage.com/ . Accessed 8 June 2016.

Jin CX, Zhou HY, Bai QC (2012) Short text clustering algorithm with feature keyword expansion. Adv Mater Res532: 1716–1720.

Liu Z, Yu W, Chen W, Wang S, Wu F (2010) Short text feature selection for micro-blog mining In: CiSE 2010: Proceedings of the International Conference on Computational Intelligence and Software Engineering, 1–4. IEEE, Wuhan,

Hu P, He T, Ji D, Wang M (2004) A study of Chinese text summarization using adaptive clustering of paragraphs In: CIT’04: Proceedings of the Fourth International Conference on Computer and Information Technology, 1159–1164. IEEE, Wuhan,

Zhu ZY, Dong SJ, Yu CL, He J (2011) A text hybrid clustering algorithm based on HowNet semantics. Key Eng Mater474: 2071–2078.

Zheng D, Liu H, Zhao T (2011) Search results clustering based on a linear weighting method of similarity In: IALP 2011: Proceedings of the International Conference on Asian Language Processing, 123–126. IEEE, Penang,

Wang R (2010) Cognitive-based emotion classifier of Chinese vocabulary design In: ISISE 2010: Proceedings of the International Symposium on Information Science and Engineering, 582–585. IEEE.

Thorleuchter D, Van den Poel D (2014) Semantic compared cross impact analysis. Expert Syst Appl41(7): 3477–3483.

Roussinov D, Turetken O (2009) Exploring models for semantic category verification. Inf Syst34(8): 753–765.

Zelikovitz S, Kogan M (2006) Using Web searches on important words to create background sets for LSI classification In: FLAIRS Conference 2006: Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference, 298–603.

SentiWordNet. http://sentiwordnet.isti.cnr.it/ . Accessed 8 June 2016.

Al Nasseri A, Tucker A, de Cesare S (2015) Quantifying StockTwits semantic terms’ trading behavior in financial markets: an effective application of decision tree algorithms. Expert Syst Appl42(23): 9192–9210.

Kumar V, Minz S (2013) Mood classifiaction of lyrics using SentiWordNet In: ICCCI 2013: Proceedings of the International Conference on Computer Communication and Informatics, 1–5. IEEE, Coimbatore,

Unified Medical Language System (UMLS) Metathesaurus. https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/ . Accessed 8 June 2016.

Garla VN, Brandt C (2012) Ontology-guided feature engineering for clinical text classification. J Biomed Inform45(5): 992–998.

Plaza L, Díaz A, Gervás P (2011) A semantic graph-based approach to biomedical summarisation. Artif Intell Med53(1): 1–14.

Aljaber B, Martinez D, Stokes N, Bailey J (2011) Improving MeSH classification of biomedical articles using citation contexts. J Biomed Inform44(5): 881–896.

Medical Subject Headings (MeSH). https://www.nlm.nih.gov/mesh/ . Accessed 8 June 2016.

Logeswari S, Premalatha K (2013) Biomedical document clustering using ontology based concept weight In: ICCCI 2013: Proceedings of the International Conference on Computer Communication and Informatics, 1–4. IEEE, Coimbatore,

Nguyen SH, Jaśkiewicz G, Świeboda W, Nguyen HS (2012) Enhancing search result clustering with semantic indexing In: SoICT’12: Proceedings of the Third Symposium on Information and Communication Technology, 71–80. ACM, New York,

Ginter F, Pyysalo S, Boberg J, Järvinen J, Salakoski T (2004) Ontology-based feature transformations: a data-driven approach In: Advances in Natural Language Processing, 279–290. Springer, Berlin,

Kanavos A, Makris C, Theodoridis E (2012) On topic categorization of PubMed query results In: Artificial Intelligence Applications and Innovations, 556–565. Springer.

Zheng HT, Borchert C, Kim HG (2008) Exploiting gene ontology to conceptualize biomedical document collections In: The Semantic Web, 375–389. Springer, Berlin,

Jin B, Muller B, Zhai C, Lu X (2008) Multi-label literature classification based on the Gene Ontology graph. BMC Bioinforma9(1): 525.

Mannai M, Ben Abdessalem Karaa W (2013) Bayesian information extraction network for Medline abstract. In: 2013 World Congress on Computer and Information Technology (WCCIT), 1–3. IEEE, Sousse,

Jiana B, Tingyu L, Tianfang Y (2012) Event information extraction approach based on complex Chinese texts In: IALP 2012: Proceedings of the International Conference on Asian Language Processing, 61–64.

Hengliang W, Weiwei Z (2012) A web information extraction method based on ontology. Adv Inf Sci Serv Sci4(8): 199–206.

Aghassi H, Sheykhlar Z (2012) Extending information retrieval by adjusting text feature vectors. Commun Comput Inform Sci295 CCIS: 133–142.

Bharathi G, Venkatesan D (2012) Improving information retrieval using document clusters and semantic synonym extraction. J Theor Appl Inf Technol36(2): 167–173.

Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst29(2): 8:1–8:34.

Nassirtoussi AK, Aghabozorgi S, Wah TY, Ngo DCL (2015) Text mining of news-headlines for FOREX market prediction: a multi-layer dimension reduction algorithm with semantics and sentiment. Expert Syst Appl42(1): 306–324.

Batool R, Khattak AM, Maqbool J, Lee S (2013) Precise tweet classification and sentiment analysis In: 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), 461–466. IEEE, Niigata,

Veselovská K (2012) Sentence-level sentiment analysis in Czech In: WIMS’12:Proceedings of the 2Nd International Conference on Web Intelligence, Mining and Semantics, 65:1–65:4. ACM, New York,

Petersen MK, Hansen LK (2012) On an emotional node: modeling sentiment in graphs of action verbs In: 2012 International Conference on Audio, Language and Image Processing, 308–313. IEEE, Shanghai,

Domínguez García R, Schmidt S, Rensing C, Steinmetz R (2012) Automatic taxonomy extraction in different languages using wikipedia and minimal language-specific information. Lect Notes Comp Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)7181 LNCS(PART 1): 42–53.

Punuru J, Chen J (2012) Learning non-taxonomical semantic relations from domain texts. J Intell Inf Syst38(1): 191–207.

Stenetorp P, Soyer H, Pyysalo S, Ananiadou S, Chikayama T (2012) Size (and domain) matters: evaluating semantic word space representations for biomedical text In: SMBM 2012: Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine,42–49.

Froud H, Lachkar A, Ouatik SA (2012) Stemming versus light stemming for measuring the simitilarity between Arabic words with latent semantic analysis model In: 2012 Colloquium in Information Science and Technology, 69–73. IEEE, Fez,

Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol49: 230–243.

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res3(Jan): 993–1022.

Zrigui M, Ayadi R, Mars M, Maraoui M (2012) Arabic text classification framework based on latent dirichlet allocation. J Comput Inf Technol20(2): 125–140.

Liu Z, Li M, Liu Y, Ponraj M (2011) Performance evaluation of latent Dirichlet allocation in text mining In: FSKD 2011: Proceedings of the Eighth International Conference on Fuzzy Systems and Knowledge Discovery, 2695–2698. IEEE, Shanghai.

Xiang W, Yan J, Ruhua C, Hua F (2013) Improving text categorization with semantic knowledge in Wikipedia. IEICE Trans Inf Syst96(12): 2786–2794.

Spanakis G, Siolas G, Stafylopatis A (2012) Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. Comput J55(3): 299–312.

Article   MATH   Google Scholar  

Andreasen T, Bulskov H, Jensen PA, Lassen T (2011) Extracting conceptual feature structures from text In: ISMIS 2011: Proceedings 19th International Symposium on Methodologies for Intelligent Systems, 396–406. Springer, Berlin,

Goossen F, IJntema W, Frasincar F, Hogenboom F, Kaymak U (2011) News personalization using the CF-IDF semantic recommender In: WIMS’11: Proceedings of the International Conference on Web Intelligence, Mining and Semantics, 10. ACM, New York,

Huang A, Milne D, Frank E, Witten IH (2008) Clustering documents with active learning using Wikipedia In: ICDM’08: Eighth IEEE International Conference on Data Mining, 839–844. IEEE, Pisa,

Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis In: IJCAI-07: Proceedings of the 20th International Joint Conference on Artifical Intelligence, 1606–1611. Morgan Kaufmann Publishers Inc, San Francisco, http://dl.acm.org.ez67.periodicos.capes.gov.br/citation.cfm?id=1625535 .

Navigli R, Faralli S, Soroa A, de Lacalle O, Agirre E (2011) Two birds with one stone: learning semantic models for text Categorization and word sense disambiguation In: CIKM’11: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, 2317–2320. ACM, Glasgow,

Mostafa MS, Haggag MH, Gomaa WH (2008) Document clustering using word sense disambiguation In: SEDE 2008: Proceedings of 17th International Conference on Software Engineering and Data Engineering, 19–24.

Andreopoulos B, Alexopoulou D, Schroeder M (2008) Word sense disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering. Int J Data Min Bioinforma2(3): 193–215.

Koeling R, McCarthy D, Carroll J (2007) Text categorization for improved priors of word meaning In: Computational Linguistics and Intelligent Text Processing, 241–252. Springer, Berlin,

Sharma A, Swaminathan R, Yang H (2010) A verb-centric approach for relationship extraction in biomedical text In: ICSC 2010: Proceedings of the IEEE Fourth International Conference on Semantic Computing, 377–385. IEEE, Pittsburgh,

Wang W, Zhao D, Zou L, Wang D, Zheng W (2010) Extracting 5W1H event semantic elements from Chinese online news In: WAIM 2010: Proceedings of the Workshops of the 11th International Conference on Web-Age Information Management, 644–655. Springer, Berlin,

Rebholz-Schuhmann D, Jimeno-Yepes A, Arregui M, Kirsch H (2010) Measuring prediction capacity of individual verbs for the identification of protein interactions. J Biomed Inform43(2): 200–207.

Van Der Horn P, Bakker B, Geleijnse G, Korst J, Kurkin S (2008) Classifying verbs in biomedical text using subject-verb-object relationships In: SMBM 2008: Proceedings of the 3rd International Symposium on Semantic Mining in Biomedicine, 137–140.

Kontos J, Malagardi I, Alexandris C, Bouligaraki M (2000) Greek verb semantic processing for stock market text mining In: NLP’00: Proceedings of the Second International Conference on Natural Language Processing, 395–405. Springer-Verlag, London.

Stankov I, Todorov D, Setchi R (2013) Enhanced cross-domain document clustering with a semantically enhanced text stemmer (SETS). Int J Knowl-Based Intell Eng Syst17(2): 113–126.

Huang CH, Yin J, Hou F (2011) A text similarity measurement combining word semantic information with TF-IDF method. Jisuanji Xuebao(Chin J Comput)34(5): 856–864.

Doan S, Kawazoe A, Conway M, Collier N (2009) Towards role-based filtering of disease outbreak reports. J Biomed Inform42(5): 773–780.

Meng X, Chen Q, Wang X (2008) Semantic feature reduction in chinese document clustering In: SMC 2008: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 3721–3726. IEEE, Singapore,

Freitas A, O’Riain S, Curry E, da Silva JCP, Carvalho DS (2013) Representing texts as contextualized entity-centric linked data graphs In: DEXA 2013: Proceedings of the 24th International Workshop on Database and Expert Systems Applications, 133–137. IEEE, Los Alamitos,

Fathy I, Fadl D, Aref M (2012) Rich semantic representation based approach for text generation In: INFOS 2012: Proceedings of the 8th International Conference on Informatics and Systems, NLP–20. IEEE, Cairo,

Wu J, Xuan Z, Pan D (2011) Enhancing text representation for classification tasks with semantic graph structures. Int J Innov Comput Inf Control (ICIC)7(5): 2689–2698.

Alencar ROD, Davis Jr CA, Gonçalves MA (2010) Geographical classification of documents using evidence from Wikipedia In: GIR ’10: Proceedings of the 6th Workshop on Geographic Information Retrieval, 12. ACM, New York,

Smirnov I, Tikhomirov I (2009) Heterogeneous semantic networks for text representation in intelligent search engine EXACTUS In: SENSE’09: Proceedings of the Workshop on Conceptual Structures for Extracting Natural Language Semantics, 1–9.

Chau R, Tsoi AC, Hagenbuchner M, Lee V (2009) A conceptlink graph for text structure mining In: ACSC’09: Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91, 141–150. Australian Computer Society, Inc., Darlinghurst,

Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw61: 85–117.

Lebret R, Collobert R (2015) Rehabilitation of count-based models for word vector representations. Lect Notes Comput Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)9041: 417–429.

Li R, Shindo H (2015) Distributed document representation for document classification. Lect Notes Comput Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)9077: 212–225.

Sohrab MG, Miwa M, Sasaki Y (2015) Centroid-means-embedding: an approach to infusing word embeddings into features for text classification. Lect Notes Comput Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)9077: 289–300.

Wang P, Xu B, Xu J, Tian G, Liu CL, Hao H (2016) Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing174: 806–814.

Zhang C, Zhang L, Wang CJ, Xie JY (2014) Text summarization based on sentence selection with semantic representation In: Proceedings of the International Conference on Tools with Artificial Intelligence,Vol. 2014-December. IEEE, Limassol. 584–590.

Vulić I, Moens MF (2015) Monolingual and cross-lingual information retrieval models based on (Bilingual) word embeddings In: SIGIR’15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 363–372. ACM, New York,

Kamal A, Abulaish M, Anwar T (2012) Mining feature-opinion pairs and their reliability scores from web opinion sources In: WIMS’12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, 15. ACM, New York,

Kong L, Yan R, He Y, Zhang Y, Zhang Z, Fu L (2011) DVD: a model for event diversified versions discovery In: Web Technologies and Applications, 168–180. Springer, Berlin,

Jing L, Yun J, Yu J, Huang J (2011) High-order co-clustering text data on semantics-based representation model In: Advances in Knowledge Discovery and Data Mining, 171–182. Springer, Berlin,

Krajewski R, Rybinski H, Kozlowski M (2016) A novel method for dictionary translation. J Intell Inf Syst47(3): 491–514.

Luo Z, Miotto R, Weng C (2013) A human–computer collaborative approach to identifying common data elements in clinical trial eligibility criteria. J Biomed Inform46(1): 33–39.

Kayed A (2005) Building e-laws ontology: new approach In: Proceedings of the On the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops, 826–835. Springer, Berlin,

Sevenster M, van Ommering R, Qian Y (2012) Automatically correlating clinical findings and body locations in radiology reports using MedLEE. J Digit Imaging25(2): 240–249.

Volkova S, Caragea D, Hsu WH, Drouhard J, Fowles L (2010) Boosting biomedical entity extraction by using syntactic patterns for semantic relation discovery In: WI-IAT 2010: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 272–278. IEEE, Toronto.

Waltinger U, Mehler A (2009) Social semantics and its evaluation by means of semantic relatedness and open topic models In: WI-IAT’09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01, 42–49. IEEE Computer Society, Milan,

Kass A, Cowell-Shah C (2006) Using lightweight NLP and semantic modeling to realize the internet’s potential as a corporate radar In: AAAI Fall Symposium. AAAI PRESS.

Blake C (2010) Beyond genes, proteins, and abstracts: identifying scientific claims from full-text biomedical articles. J Biomed Inform43(2): 173–189.

Hu J, Fang L, Cao Y, Zeng HJ, Li H, Yang Q, et al (2008) Enhancing text clustering by leveraging Wikipedia semantics In: SIGIR’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 179–186. ACM, New York,

Lu CY, Lin SH, Liu JC, Cruz-Lara S, Hong JS (2010) Automatic event-level textual emotion sensing using mutual action histogram between entities. Expert Syst Appl37(2): 1643–1653.

Ahmed ST, Nair R, Patel C, Davulcu H (2009) BioEve: bio-molecular event extraction from text using semantic classification and dependency parsing In: BioNLP’09: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, 99–102. Association for Computational Linguistics.

Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett31(8): 651–666.

Wordle. http://www.wordle.net/ . Accessed 15 June 2016.

Download references

Acknowledgements

The authors would like to thank the financial support of grant #132666/2016-2, National Council for Scientific and Technological Development (CNPq); grants #2013/14757-6, #2014/08996-0, and #2016/07620-2, São Paulo Research Foundation (FAPESP); and Coordination for the Improvement of Higher Education Personnel (CAPES).

Authors’ contributions

RAS and SOR planned this systematic mapping study. RAS conducted its first cycle (searches performed in January 2014). JA and RAS conducted its second cycle (searches performed in February 2016). RAS and SOR analyzed the results and drafted the manuscript after the first cycle and updated it after the second cycle. JA was involved in updating the manuscript with the second cycle results. All authors revised and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and affiliations.

Laboratório de Inteligência Computacional (LABIC), Instituto de Ciências Matemáticas e de Computação (ICMC), Universidade de São Paulo (USP), São Carlos, P.O. Box 668, 13561-970, SP, Brazil

Roberta Akemi Sinoara, João Antunes & Solange Oliveira Rezende

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Roberta Akemi Sinoara .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Sinoara, R., Antunes, J. & Rezende, S. Text mining and semantics: a systematic mapping study. J Braz Comput Soc 23 , 9 (2017). https://doi.org/10.1186/s13173-017-0058-7

Download citation

Received : 24 March 2017

Accepted : 01 June 2017

Published : 29 June 2017

DOI : https://doi.org/10.1186/s13173-017-0058-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Systematic review
  • Text mining
  • Text semantics

term paper on text mining

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Applications of text mining within systematic reviews

Affiliations.

  • 1 Institute of Education EPPI-Centre, SSRU 18 Woburn Square, London WC1H 0NR, U.K.. [email protected].
  • 2 University of Manchester, National Centre for Text Mining, Manchester, U.K.
  • PMID: 26061596
  • DOI: 10.1002/jrsm.27

Systematic reviews are a widely accepted research method. However, it is increasingly difficult to conduct them to fit with policy and practice timescales, particularly in areas which do not have well indexed, comprehensive bibliographic databases. Text mining technologies offer one possible way forward in reducing the amount of time systematic reviews take to conduct. They can facilitate the identification of relevant literature, its rapid description or categorization, and its summarization. In this paper, we describe the application of four text mining technologies, namely, automatic term recognition, document clustering, classification and summarization, which support the identification of relevant studies in systematic reviews. The contributions of text mining technologies to improve reviewing efficiency are considered and their strengths and weaknesses explored. We conclude that these technologies do have the potential to assist at various stages of the review process. However, they are relatively unknown in the systematic reviewing community, and substantial evaluation and methods development are required before their possible impact can be fully assessed. Copyright © 2011 John Wiley & Sons, Ltd.

Keywords: automatic summarization; document classification; document clustering; research synthesis; screening; searching; systematic review; term recognition; text mining.

Copyright © 2011 John Wiley & Sons, Ltd.

PubMed Disclaimer

Related information

  • Cited in Books

Grants and funding

  • MR/J005037/1/MRC_/Medical Research Council/United Kingdom

LinkOut - more resources

Full text sources.

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • Open access
  • Published: 02 November 2020

Comprehensive review of text-mining applications in finance

  • Aaryan Gupta 1 ,
  • Vinya Dengre 1 ,
  • Hamza Abubakar Kheruwala 1 &
  • Manan Shah 2  

Financial Innovation volume  6 , Article number:  39 ( 2020 ) Cite this article

37k Accesses

68 Citations

1 Altmetric

Metrics details

Text-mining technologies have substantially affected financial industries. As the data in every sector of finance have grown immensely, text mining has emerged as an important field of research in the domain of finance. Therefore, reviewing the recent literature on text-mining applications in finance can be useful for identifying areas for further research. This paper focuses on the text-mining literature related to financial forecasting, banking, and corporate finance. It also analyses the existing literature on text mining in financial applications and provides a summary of some recent studies. Finally, the paper briefly discusses various text-mining methods being applied in the financial domain, the challenges faced in these applications, and the future scope of text mining in finance.

Introduction

Today, technology is deeply integrated with everyone’s lives. Nearly every activity in modern life, from phone calls to satellites sent into space, has evolved exponentially with technology (Patel et al. 2020a , b , c ; Panchiwala and Shah 2020 ). The increasing ability to create and manage information has been an influential factor in the development of technology. According to the National Security Agency of the United States, 1826 petabytes on average are handled daily over the Internet (Hariri et al. 2019 ; Jaseena and David 2014 ). With the rapid increase in data and information communicated over the Internet, it has become necessary to regulate and ease the flow of the same (Ahir et al. 2020 ; Gandhi et al. 2020 ). A number of commercial and social applications have been introduced for these purposes. Aspects of data and information, such as security, research, and sentiment analysis, can be of great help to organisations, governments, and the public (Jani et al. 2019 ; Jha et al. 2019 ). There are various optimized techniques that aid us in tasks such as classification, summarisation, and ease of access and management of data, among others (Shah et al. 2020a , b ; Talaviya et al. 2020 ). Algorithms related to machine learning and deep learning (DL) are just some of the many algorithms that can be used to process the available information (Kakkad et al. 2019 ; Kundalia et al. 2020 ). Even though there is a massive amount of available information, the use of computational techniques can help us process information from top to bottom and analyse entire documents as well as individual words (Pandya et al. 2019 ; Parekh et al. 2020 ).

Human-generated ‘natural’ data in the form of text, audio, video, and so on are rapidly increasing (Shah et al. 2020a , b ). This has led to a rise in interest in methods and tools that can help extract useful information automatically from enormous amounts of unstructured data (Jaseena and David 2014 ; David and Balakrishnan 2011 ). One crucial method is text mining, which is a combined derivative of techniques such as data mining, machine learning, and computational linguistics, among others. Text mining aims to extract information and patterns from textual data (Talib et al. 2016b ; Fan et al. 2006 ). The trivial approach to text mining is manual, in which a human reads the text and searches for useful information in it. A more logical approach is automatic, which mines text in an efficient way in terms of speed and cost (Herranz et al. 2018 ; Sukhadia et al. 2020 ; Pathan et al. 2020 ).

According to the India Brand Equity Foundation (IBEF 2019 ), the Indian financial industry alone had US $340.48 billion in assets under management as of February 2019. This value only provides us with a limited indication of the actual size and reach of the global finance industry. Technology has paved the way for digitalisation in this rapidly growing behemoth. ‘FinTech’ is a developing domain in the finance industry, which has been defined as a union of finance and information technology (Zavolokina et al. 2016 ). Marrara et al. ( 2019 ) examined how FinTech relates to Italian small and medium-sized enterprises (SMEs), where FinTech has witnessed huge growth in terms of investment and development, and how it has proved fruitful for the SME market in a short amount of time. FinTech has popularised the use of data in the financial industry. This data is substantially in the form of structured or unstructured text. Therefore, traditionally and technically, textual data can be regarded as always having been a prevailing and essential element in the finance sector.

Unstructured textual data have been increasing rapidly in the finance industry (Lewis and Young 2019 ). This is where text mining has a lot of potential. Kumar and Ravi ( 2016 ) explored various applications in the financial domain in which text mining could play a significant role. They concluded that it had numerous applications in this industry, such as various kinds of predictions, customer relationship management, and cybersecurity issues, among others. Many novel methods have been proposed for analysing financial results in recent years, and artificial intelligence has made it possible to analyse and even predict financial outcomes based on historical data.

Finance has been an important force in human life since the earliest civilisations. It is noteworthy that from barter systems to cryptocurrencies, finance has always been associated with data, such as transactions, accounts, prices, and reports. Manual approaches to processing data have been reduced in use and significance over time. Researchers and practitioners have come to prefer digitised and automated approaches for studying and analysing financial data. Financial data contain a significant amount of latent information. If the latent information were to be extracted manually from a huge corpus of data, it might take years. Advancements in text mining have made it possible to efficiently examine textual data pertaining to finance. Bach et al. ( 2019 ) published a literature review on text mining for big-data analysis in finance. They structured the review in terms of three critical questions. These questions pertained to the intellectual core of finance, the text-mining techniques used in finance, and the data sources of financial sectors. Kumar and Ravi ( 2016 ) discussed the model presented by Vu et al. ( 2012 ) that implemented text mining on Twitter messages to perform sentiment analysis for the prediction of stock prices. They also mentioned the model of Lavrenko et al. ( 2000 ), which could classify news stories in a way that could help identify which of them affected trends in finance and to what degree. We will further discuss text-mining applications in finance in subsequent sections.

Apart from finance, we present a brief overview of text mining in other industries. On social media, people generate text data in the form of posts, blogs, and web forum activity, among many others (Agichtein et al. 2008 ). Despite the vast quantity of data available, the relatively low proportion of content of significant quality is still a problem (Kinsella et al. 2011 ), which is an issue that can be solved by text mining (Salloum et al. 2017 ). In the biomedical field too, there is a need for effective text-mining and classification methods (Krallinger et al. 2011 ). On e-commerce websites, text mining is used to prevent the repetition of information to the same audience (Da-sheng et al. 2009 ) and improve product listings through reviews (Kang and Park 2016 ; Ur-Rahman and Harding 2012 ). In healthcare, researchers have worked on applications such as the identification of healthcare topics directly from personal messages over the Internet (Lu 2013 ), classification of online data (Srivastava et al. 2018 ), and analysis of patient feedback (James et al. 2017 ). The agriculture industry has also used text mining in, for example, the classification of agricultural regulations (Espejo-Garcia et al. 2018 ), ontology-based agricultural text clustering (Su et al. 2012 ), and analysis of agricultural network public opinions (Lee 2019 ). Text mining has also been utilised in the detection of malicious web URLs which evolve over time and have complex features (Li et al. 2020a ; b , c ).

This paper discusses the use of text mining in the financial domain in detail, taking into consideration three major areas of application: financial forecasting, banking, and corporate finance. We also discuss the widely used methodologies and techniques for text mining in finance, the challenges faced by researchers, and the future scope for text-mining methods in finance.

Overview of text-mining methodologies

Text mining is a process through which the user derives high-quality information from a given piece of text. Text mining has seen a significant increase in demand over the last few years. Coupled with big data analytics, the field of text mining is evolving continuously. Finance is one major sector that can benefit from these techniques; the analysis of large volumes of financial data is both a need and an advantage for corporates, government, and the general public. This section discusses some important and widely used techniques in the analysis of textual data in the context of finance.

Sentiment analysis (SA)

One of the most important techniques in the field is SA. It has applications in numerous sectors. This technique extracts the underlying opinions within textual data and is therefore also referred to as opinion mining (Akaichi et al. 2013 ). It is of prime use in a number of domains, such as e-commerce platforms, blogs, online social media, and microblogs. The motives behind sentiment analysis can be broadly divided into emotion recognition and polarity detection. Emotion detection is focused on the extraction of a set of emotion labels, and polarity detection is more of a classifier-oriented approach with discrete outputs (e.g., positive and negative) (Cambria 2016 ).

There are two main approaches for SA, namely lexicon-based (dictionary-based) and machine learning (ML). The latter is further classified into supervised and unsupervised learning approaches (Xu et al. 2019 ; Pradhan et al. 2016 ). Lexicon-based approaches use SentiWordNet word maps, whereas ML considers SA as a classification problem and uses established techniques for it. In lexicon-based approaches, the overall score for sentiment is calculated by dividing the sentiment frequency by the sum of positive and negative sentiments. In ML approaches, the major techniques that are used are Naïve Bayes (NB) classifier and support vector machines (SVMs), which use labelled data for classification. SA using ML has an edge over the lexicon approach, as it doesn’t require word dictionaries that are highly costly. However, ML requires domain-specific datasets, which can be considered as a limitation (Al-Natour and Turetken 2020 ). After data preprocessing, feature selection is performed as per the requirement, following which one obtains the final results after the analysis of the given data as per the adopted approach (Hassonah et al. 2019 ).

In the financial domain, stock market prediction is one of the applications in which SA has been used to predict future stock market trends and prices from the analysis of financial news articles. Joshi et al. ( 2016 ) compared three ML algorithms and observed that random forest (RF) and SVMs performed better than NB. Renault ( 2019 ) used StockTwits (a platform where people share ideas about the stock market) as a data source and applied five algorithms, namely NB, a maximum entropy method, a linear SVM, an RF, and a multilayer perceptron and concluded that the maximum entropy and linear SVM methods gave the best results. Over the years, researchers have combined deep learning methods with traditional machine learning techniques (e.g., construction of sentiment lexicon), thus obtaining more promising results (Yang et al. 2020 ).

Information extraction

Information extraction (IE) is used to extract predefined data types from a text document. IE systems mainly aim for object identification by extracting relevant information from the fragments and then putting all the extracted pieces in a framework. Post extraction, DiscoTEX (Discovery from TextEXtraction) is one of the core methods used to convert the structured data into meaningful data to discover knowledge from it (Salloum et al. 2018 ).

In finance, named-entity recognition (NER) is used for extracting predefined types of data from a document. In banking, transaction order documents of customers may come via fax, which results in very diverse documents because of the lack of a fixed template and creates the need for proper feature extraction to obtain a structured document (Emekligil et al. 2016 ).

Natural language processing (NLP)

NLP is a part of the artificial intelligence domain and attempts to help transform imprecise and ambiguous messages into unambiguous and precise messages. In the financial sector, it has been used to assess a firm’s current and future performance, domain standards, and regulations. It is often used to mine documents to obtain insights for developing conclusions (Fisher et al. 2016 ). NLP can help perform various analyses, such as NER, which further helps in identifying the relationships and other information to identify the key concept. However, NLP lacks a dictionary list for all the named entities used for identification (Talib et al. 2016a ; b ).

As NLP is a pragmatic research approach to analyse the huge amount of available data, Xing et al. ( 2017 ) applied it to bridge the gap between NLP and financial forecasting by considering topics that would interest both the research fields. Figure  1 provides an intuitive grasp of natural language-based financial forecasting (NLFF).

figure 1

An intersection of NLP and financial forecasting to illustrate the concept of NLFF (Xing et al. 2017 )

Chen et al. ( 2020 ) discussed the role of NLP in FinTech in the past, present, and future. They reviewed three aspects, namely know your customer (KYC), know your product (KYP), and satisfy your customer (SYC). In KYC, a lot of textual data is generated in the process of acquiring information about customers (corporate sector and retail). With respect to KYP, salespersons are required to know all the attributes of their product, which again requires data in order to know the prospects, risks, and opportunities of the product. In SYC, salespersons/traders and researchers try to make the financial activities more efficient to satisfy the customers in the business-to-customer as well as customer-to-customer business models. Herranz et al. ( 2018 ) discussed the role of NLP in teaching finance and reported that it enhanced the transfer of knowledge within an environment overloaded with information.

  • Text classification

Text classification is a four-step process comprising feature extraction, dimension reduction, classifier selection, and evaluation. Feature extraction can be done with common techniques such as term frequency and Word2Vec; then, dimensionality reduction is performed using techniques such as principal component analysis and linear discriminant analysis. Choosing a classifier is an important step, and it has been observed that deep learning approaches have surpassed the results of other machine learning algorithms. The evaluation step helps in understanding the performance of the model; it is conducting using various parameters, such as the Matthews correlation coefficient (MCC), area under the ROC curve (AUC), and accuracy. Accuracy is the simplest of these to evaluate. Figure  2 shows an overview of the text classification process (Kowsari et al. 2019 ).

figure 2

A general overview of the text classification process (Kowsari et al. 2019 )

Brindha et al. ( 2016 ) compared the performance of various text classification techniques, namely NB, k-nearest neighbour (KNN), SVM, decision tree, and regression, and found that based on the precision, recall, and F1 measures, SVM provided better results than the others.

Deep learning

Deep learning is a part of machine learning, which trains a data model to make predictions about new data. Deep learning has a layered architecture, where the input data goes into the lowest level and the output data is generated at the highest level. The input is transformed at the various middle levels by applying algorithms to extract features, transform features into factors, and then input the factors into the deeper layer again to obtain transformed features (Heaton et al. 2016 ). Widiastuti ( 2018 ) focused on the input data, as it plays an important role in the performance of any algorithm. The author concluded that modification of the network architecture with deep learning algorithms can markedly affect performance and provide good results.

In finance, deep learning solves the problem of complexity and ambiguity of natural language. Kraus and Feuerriegel ( 2017 ) used a corpus of 13,135 German ad hoc announcements in English to predict stock market movements and concluded that deep learning was better than the traditional bag-of-words approach. The results also showed that the long short-term memory models outperformed all the existing machine learning algorithms when transfer learning was performed to pre-train word embeddings.

Review of text-mining applications in finance

As mentioned in earlier sections, this paper focuses on the applications of text mining in three sectors of finance, namely financial predictions, banking, and corporate finance. In the subsections, we review various studies. Some literature has been summarised in detail, and in the end, a tabular summary of some more studies is included. Figure  3 shows a summarised link between the text-mining techniques and their corresponding applications in the respective domains. Although the following subsections discuss the studies pertaining to each sector individually, there has also been research on techniques that can be applied to multiple financial sectors. One such system was proposed by Li et al. ( 2020a ), which was a classifier based on adaptive hyper-spheres. It could be helpful in tasks such as credit scoring, stock price prediction, and anti-fraud analysis.

figure 3

An overview of how text mining can be used in the financial domain. This paper follows a systematic approach for reviewing text-mining applications, as depicted by the flowchart in the figure. The two independent entities, namely finance and text mining, are linked together to show the possible applications of various text-mining techniques in various financial domains

Prediction of financial trends

Using the ever-expanding pool of textual data to improve the dynamics of the market has long been a practice in the financial industry. The increasing volume of press releases, financial data, and related news articles have been motivating continued and sophisticated analysis, dating back to the 1980s, in order to derive a competitive advantage (Xing et al. 2017 ). Abundant data investigated with text mining can deliver an advantage in a variety of scenarios. As per Tkáč and Verner ( 2016 ) and Schneider and Gupta ( 2016 ), among the many ideas covered in financial forecasting, from credit scoring to inflation rate prediction, a large proportion of focus is on stock market and forex prediction. Wen et al. ( 2019 ) proposed an idea regarding how retail investor attention can be used for evaluation of the stock price crash risk.

Wu et al. ( 2012 ) proposed a model that combined the features of technical analysis of stocks with sentiment analysis, as stock prices also depend on the decisions of investors who read stock news articles. They focused on obtaining the overall sentiment behind each news article and assigned it the respective sentiment based on the weight it carried. Next, using different indicators, such as price, direction, and volume, technical analysis was performed and the learning prediction model was generated. The model was used to predict Taiwan’s stock market, and the results proved to be more promising than models that employed either of the two. This indicates an efficient system that can be integrated with even better features in the future.

Al-Rubaiee et al. ( 2015 ) analysed the relationship between Saudi Twitter posts and the country’s stock market (Tadawul). They used a number of algorithms such as SVM, KNN, and NB algorithms to classify Arabic text for the purpose of stock trading. Their major focus was on properly preprocessing data before the analysis. By comparing the results, they found that SVM had the best recall, and KNN had the best precision. The one-to-one model that they built showcased the positive and negative sentiments as well as the closing values of the Tadawul All Share Index (TASI). The relationship between a rise in the TASI index and an increase in positive sentiments was found out to be greater than that of a decline in the index and negative sentiments. The researchers mentioned that in future work they would incorporate the Saudi stock market closing values and sentiment features on tweets to explore the patterns between the Saudi stock index and public opinion on Twitter.

Vijayan and Potey ( 2016 ) proposed a model based on recent news headlines that predicted the forex trends based on the given market situations. The information about the past forex currency pair trends was analysed along with the news headlines corresponding to that timeline, and it was assumed that the market would behave in the future as it had done in the past. The researchers focused on the elimination of redundancy, and their model focused on news headlines rather on entire articles. Multilayer dimension reduction algorithms were used for text mining, the Synchronous Targeted Label Prediction algorithm was used for optimal feature reduction, and the J48 algorithm was used for the generation of decision trees. The main focus was on fundamental analysis that targeted unstructured textual data in addition to technical analysis to make predictions based on historical data. The J48 algorithm resulted in an improvement in the accuracy and performance of the overall system, better efficiency, and less runtime. In fact, the researchers reported that the algorithm could be applied to diverse subjects, such as movie reviews.

Nassirtoussi et al. ( 2015 ) proposed an approach for forex prediction wherein the major focus was on strengthening text-mining aspects that had not been focused upon in previous studies. Dimensionality reduction, semantic integration, and sentiment analysis enabled efficient results. The system predicted the directional movement of a currency pair based on news headlines in the sector from a few hours before. Again, headlines were taken into consideration for the analysis, and a multilayer algorithm was used to address semantics, sentiments, and dimensionality reduction. This model’s process was highly accurate, with results of up to 83%. The strong results obtained in that study demonstrate that the studied relationships exist. The models can be applied to other contexts as well.

Nikfarjam et al. ( 2010 ) discussed the components that constitute a forecasting model in this sector and the prototypes that had been recently introduced. The main components were compared with each other. Feature selection and feature weighting were used to select a piece of news and assign weights to them, used either individually or in combination for feature selection. Next, feature weighting was used to calculate the weights for the given terms. The feature weighting methodology was based on the study by Fung et al. ( 2002 ), who had assigned more weights to enhance the term frequency-inverse document frequency (TF-IDF) weighting. For text classification, most researchers have applied SVMs to classify the input text into either good or bad news. Some researchers have used Bayesian classifiers, and some others have used a combination of binary classifiers to achieve the final classification decision. Many authors have focused on news features but not equally addressed the available market data. The focus of most studies has been on the analysis of news and indicator values separately, which has proved to be less efficient. The combination of both market news and the status of market trends at the same time is expected to provide stronger results.

Gupta et al. ( 2019 ) proposed a combination of two models: the primary model obtained the dataset for prediction, preprocessed the dataset using logistic regression to remove redundancy, and employed a genetic algorithm, KNN, and support vector regression (SVR). In a comparison of all three, KNN was the basis for their predictions, with an efficiency of more than 50%. The genetic algorithm was used next in search for better accuracy. In an attempt to further support the genetic algorithm, SVR was used, which gave the opening price for any day in the future. For sentiment analysis, Twitter was used, as it was considered the most popular source for related news. The model divided the tweets into two categories, and the rise or fall of the market was predicted taking into consideration the huge pool of keywords. In the end, the model had an accuracy of about 70–75%, which seems reasonable for a dynamic environment.

Nguyen et al. ( 2015 ) focused on sentiment analysis of social media. They obtained the sentiments behind specific topics of discussion of the company on social media and achieved promising results in comparison with the accuracy of stocks in the preceding year. Sentiments annotated by humans on social media with regards to stock prediction were analysed, and the percentage of desired sentiments was calculated for each class. For a remaining lot of messages without explicit sentiments, a classification model was trained using the annotated sentiments on the dataset. For both of these tasks, an SVM was used as the classification model. In another study, after lemmatisation by CoreNLP, latent Dirichlet allocation (LDA) (Blei et al. 2003 ) was used as the generative probabilistic model. The authors also implemented the JST model (Lin and He 2009 ) and Aspect-based Sentiment Analysis for analysing topic sentiments for stock prediction. The study’s limitation was that the topics and models were selected beforehand. The accuracy was around 54%; however, the overall prediction in the model passed only if the stock went up or down. As the model just focused on sentiments and historical prices, the authors intended to add more factors to build a more accurate model.

Li et al. ( 2009 ) approached financial risk analysis through the available financial data on sentiments and used machine learning and sentiment analysis. The uniqueness of their study was the volume of data and the information sentiments. A generalised autoregressive conditional heteroskedasticity modelling (GARCH)-based artificial neural network and a GARCH-based SVM were used. A special training process, named the ‘dynamic training technique’, was applied because the data was non-stationary and noisy and could have resulted in overfitting. For analysing news, the semantic orientation-based approach was adopted, mainly because of the number of articles that were analysed in the study. The future work on this model was expected to include more input data and better sentiment analysis algorithms to obtain better results.

The use of sentiment analysis as a tool to facilitate investment and risk decisions by stock investors was demonstrated by Wu et al. ( 2014 ). Sina Finance, an experimental platform, was the basis for the collection of financial data for this model. The method incorporated machine learning based on SVM and GARCH with sentiment analysis. At the specific opening and closing times for each day, the GARCH-based SVM was used to identify the relations between the obtained information’s sentiment and stock price volatility. This model showed better results when predicting individual stocks rather than at the industry level. The machine learning approach was about 6% more accurate than the lexicon-based semantic approach, and it performed better with bigger datasets. The model performed better on datasets relating to small companies, as small companies were observed to be more sensitive to online reviews. The authors mentioned their future scope as expanding their dataset and attempting to create a more efficient sentiment calculation algorithm to increase the overall accuracy, similar to the one made by Li et al. ( 2009 ).

A slightly different approach was used by Ahmad et al. ( 2006 ), who focused on sentiment analysis of financial news streams in multiple languages. Three widely spoken languages, namely Arabic, Chinese, and English, were used for replication for automatic sentiment analysis. The authors adopted a local grammar approach using a local archive of the three languages. A statistical criterion in the training collection of texts helped in the identification of keywords. The most widely available corpus was for English, followed by Chinese and Arabic. Based on the frequencies of various words, the most widely utilised words were ranked and selected. Through manual evaluation, the accuracy of extraction ranged from 60 to 75%. A more robust evaluation of this model would be necessary for use in real-time markets, with the inclusion of more than one news vendor at a time.

Over the years, deep learning has become acknowledged as a useful machine learning technique that enables state-of-the-art results. It uses multiple layers to create representations and features from the input data. Text-mining analysis has also continuously evolved. The early basic model used lexicon-based analysis to account for a particular entity (sentiment analysis). Considering the complexity of language, a complete understanding of what any piece of text aims to convey requires a more complex analysis to identify and target relevant entities and related aspects (Dohaiha et al. 2018 ). The most important aspect is the relationship between the words in the text, and how the same is dominant in determining the meaning of the content. Several language elements, such as implications (Ray and Chakrabarti 2019 ) and sarcasm, require high-level methods for handling. This problem requires the use of deep learning models that can help completely understand a given piece of text. Deep learning may incorporate time series analysis and aspect-based sentiment analysis, which enhances data mining, feature selection, and fast information retrieval. Deep learning models learn features during the process of learning. They create abstract representations of the given data and therefore are unchanged with local changes to the input data (Sohangir et al. 2018 ). Word embeddings target words that are similar in context. By the measurement of similarities between words (e.g., cosine similarity in the case of vectors), one can employ word embeddings in the initial data preprocessing layers for faster and more efficient NLP execution (Young et al. 2018 ).

The huge amount of streaming financial news and articles are impossible to be processed by humans for interpretation and application on a daily basis. In a number of uses, such as portfolio construction, forecasting a financial time series is essential. The application of DL techniques on such data for forecasting purposes is of interest to industry professionals. It has been reported that repeated patterns of price movements can be estimated using econometric and statistical models (Souma et al. 2019 ). Even though the market is dynamic, a combination of deep learning models and past market trends is very useful for accurate predictions. In a comparison of real trades with the generated market trades with the use of SA, Kordonis et al. ( 2016 ) found a considerable effect of sentiments on the predictions. Because of the promising results, the use of artificial intelligence and deep learning has attracted the interests of many researchers and practitioners to improve forecasting.

With the use of deep learning, one has to perform little work by hand, while being able to harness a large amount of computation and data. DL techniques that use distributed representation are considered state-of-the-art methods for a large variety of NLP problems. We expect these models to improve and get better at handling unlabelled data through the development and use of approaches such as reinforcement learning.

Owing to the advancements in technology, there are several factors that can be used in models that aim to predict market movements. Not only the price models but also a number of different related models include macroeconomic variables (e.g., investment). Although macroeconomic indicators are important, they tend to be updated infrequently. Unlike such economic factors, public mood and sentiments (Xing et al. 2018a , b ) are dynamic and can be instantaneously monitored. For instance, behavioural science researchers have found that the stock market is affected by the investors’ psychology (Daniel et al. 2001 ). Depending on their mood states, investors make numerous decisions, a big proportion of which are risky. The impact of sentiment and attention measures on stock market volatility (Audrino et al. 2018 ) can be gauged through news articles, social media, and search engine results. The models that incorporate technical indicators of the market with sentiments obtained from the aforementioned sources outperform those that rely on only one of the two (Li et al. 2009 ). In a study pertaining to optimal portfolio allocation, Malandri et al. ( 2018 ) used historical data of the New York Stock Exchange and combined it with sentiment data to get comparatively better returns for the portfolios taken under consideration.

Empirical studies have shown that current market prices are a reflection of recently published news; this has been clearly shown by the Efficient Market Hypothesis (Fama 1991 ). Rather than being dependent on the existing information, price changes are markedly affected by new information or news. ML and DL methods have allowed data scientists to play a part in financial sector analysis and prediction (Picasso et al. 2019 ). There has been an increasing use of text-mining methods to make trading decisions (Wu et al. 2012 ). Different kinds of models, including neural networks, are used for sentiment embeddings from news, tweets, and financial blogs. Mudinas et al. ( 2019 ) studied the change of Granger-caused stocks based on sentiments alone—although this did not provide promising results, the integration with prediction models gave better results. This is because sentiments cannot be determinant factors alone, but they can be used with prediction models to lead to better and dynamic results.

As discussed above, a plethora of proposals and approaches in relation to financial forecasting have been studied, the two main applications of which have been stock prediction and forex. The main focus of these studies was on obtaining sentiments from news headlines and not from entire articles. Researchers have used a variety of text-mining approaches to integrate the abundant amount of useful information with financial patterns. Table  1 summarises some more research studies that have been conducted in recent years on the subject of text mining in financial predictions.

Banking and related applications

Banking is one of the largest and fastest-growing industries in this era of globalisation. The industry is heading towards adopting the most efficient practices for each of its departments. The total lending in the financial year 2017–2018 increased from US $429.92 billion to $1347.18 billion at a CAGR of 10.94% (Ministry of Commerce and Industry, Government of India, 2019). This huge rise is promoting strong economic growth, increasing incomes, enhancing trouble-free access to bank credit, and increasing consumerism. In the midst of an IT revolution, competitive reasons have led to the rising importance and adoption of banking automation. IT enables the implementation of various techniques for risk controls and smooth flow of transactions over electronic mediums and supports financial product innovation and development.

Gao and Ye ( 2007 ) proposed a framework for preventing money laundering with the help of the transaction histories of customers. They did this by identifying suspicious data from various textual reports from law enforcement agencies. They also mined unstructured databases and text documents for knowledge discovery in order to automatically extract the profiles of the entities that could be involved in money laundering. They employed SVM, decision trees, and Bayesian inference to develop a hierarchical structure of the suspicious reports and regression to identify hidden patterns.

Bholat et al. ( 2015 ) analysed the utility of text mining in central banks (CB), as a wide range of data sources is required for evaluating monetary and financial stability and for achieving policy objectives. Therefore, text-mining techniques are more powerful than manual means. The authors elucidated two major approaches: the use of text as data for research purposes in CB, and the various text-mining techniques for this purpose. For the former, they suggested that textual data in the form of social narratives can be used by central banks as financial indicators for risk and uncertainty management by employing topic clustering on the narratives. The latter aspect involved preprocessing of data to de-duplicate it, convert it into text files, and reduce it into tokens by various tokenisation techniques. Thereafter, text-mining techniques, such as dictionary techniques, vector space models, latent semantic analysis, LDA, and NB algorithm, were applied to the tokenised data. The authors concluded that aggregately, these can be a very useful addition to the efficient functioning of the CB.

Bach et al. ( 2019 ) stated that a huge amount of unstructured data from various sources has created a requirement for the extraction of keywords in the banking sector. They mentioned four different procedures for the extraction of keywords, which were obtained from the study by Bharti and Babu ( 2017 ). Bach et al. further discussed how keyword extraction can be implemented to extract related useful comments and documents and to compare the banking institutions as well. They also reviewed some other text-mining techniques that can be utilised by banks. NER was used on large datasets for the extraction of entities such as a person, location, and organisation. Sentiment analysis was done to analyse customer opinions, which is crucial for a bank’s functioning. Topic extraction was found to be useful mainly in credit banking. Social network analysis, a graph theory-based methodology to study the social media user structure, provided an outlook on how the customers are connected on the social media and how impactful they were in sharing information to the network of interests. This social network analysis could then be coupled with text mining to identify the keywords which correspond to the customers’ common interest.

Yap et al. ( 2011 ) discussed the issue faced by recreational clubs with respect to potential defaulters and non-defaulters. They proposed a credit scoring model that utilised text mining for estimating the financial obligations of credit applicants. A scorecard was built with the help of past performance reports of the borrowers wherein different clubs used different criteria for evaluating the historic data. The data was split into a 70:30 ratio for training and validating, respectively. They used three different models, namely a credit scorecard model, logistic regression model, and decision tree model, with an accuracy rate of 72.0%, 71.9%, and 71.2% respectively. Although the model benefitted the club administration, it also had a few limitations, such as poor quality of the scorecard and biased samples used to evaluate new applicants, as the model was built on historic data.

Xiong et al. ( 2013 ) devised a model for personal bankruptcy prediction using sequence mining techniques. The sequence results showed good prediction ability. This model has potential value in many industries. For clustering categorical sequences, a model-based k-means algorithm was designed. A comparative study of three models, namely SVM, credit scoring, and the one proposed by them, found that the accuracies were 89.3%, 80.54%, and 94.07% respectively. The sequence mining used in the proposed model outperformed the other two models. In terms of loss prediction, the KNN algorithm had the potential to identify bad accounts with promising predictive ability.

Bhattacharyya et al. ( 2011 ) explored the use of text mining in credit card fraud detection by evaluating two predictive models: one based on SVM, and the other based on a combination of random forest with logistic regression. They discussed various challenges and problems in the implementation of the models. They recommended that the models should always be kept updated to account for the growing malpractices. The original dataset used in the study comprised more than 50 million real-time credit card transactions. The dataset was split into multiple datasets as per the requirements of different techniques. Because of imbalanced data, the performance was not solely measured by the overall accuracy but also by sensitivity, specificity, and area under the curve. Although the random forest model showed the highest overall accuracy of 96.2%, the study provided some other noteworthy observations. The accuracy of each model varied according to the proportion of the fraudulent cases, with all of them having more than 99% accuracy for a dataset with 2% fraud rates. The authors concluded with suggestions for future exploration: modifying the models to make them more accurate and devising a more reliable approach to split datasets into training and testing sets.

Kou et al. ( 2014 ) used data regarding credit approval and bankruptcy risk from credit card applications to analyse financial risks using clustering algorithms. They made evaluations based on 11 performance measures using multicriteria decision-making (MCDM) methods. A previous study by Kou et al. ( 2012 ) had proposed these MCDM methods for the evaluation of classification algorithms. In a later study (Kou et al. 2019 ), they employed these methods for assessing the feature selection methods for text classification.

In addition to the above-discussed literature in this section, Table  2 provides a summary of some more studies related to the banking finance industry. As visible in Table  2 , banking has a lot of different text-mining applications. Risk assessment, quality assessment, money laundering detection, and customer relationship management are just a few examples from the wide pool of possible text-mining applications in banking.

Applications in corporate finance

Corporate finance is an important aspect of the financial domain because it integrates a company’s functioning with its financial structure. Various corporate documents such as the annual reports of a company have a lot of hidden financial context. Text-mining techniques can be employed to extract this hidden information and also to predict the company’s future financial sustainability.

Guo et al. ( 2016 ) implemented text-mining algorithms that are widely used in accounting and finance. They merged the Thomson Reuters News Archive database and the News Analytics database. The former provides original news, and the latter provides sentiment scores ranging from − 1 to 1 with positive, negative, and neutral scores. To balance the dataset, 3000 news articles were randomly selected for training and 500 for testing. Three algorithms, namely NB, SVM, and neural network, were run on the dataset. The overall output accuracies were 58.7%, 78.2%, and 79.6%, respectively. With the neural network having the highest accuracy, it was concluded that it can be used for text mining-based finance studies. Another model based on semantic analysis was also implemented, which used LDA. LDA was used to extract document relationships and the most relevant information from the documents. According to the authors, in accounting and finance, this technique has proven to be advantageous for examining analyst reports and financial reporting.

Lewis and Young ( 2019 ) discussed the importance of text mining in financial reports. They preferred NLP methods. They highlighted the exploding growth of unstructured textual data in corporate reporting, which opens numerous possibilities for financial applications. According to the authors, NLP methods for text mining provide solutions for two significant problems. One, they prevent overload through automated procedures to deal with immense amounts of data. Two, unlike human cognition, they are able to identify the underlying important latent features. The authors reviewed the widely used methodologies for financial reporting. These include keyword searches and word counts, attribute dictionaries, NB classification, cosine similarity, and LDA. Some factors, such as limited access to the text data resources and insufficient collaboration between various sectors and disciplines, were identified as challenges that are hindering progress in the application of text mining to finance.

Arguing that corporate sustainability reports (CSR) have increased dramatically, become crucial from the financial reporting perspective, and are not amenable to manual analysis processes, Shahi et al. ( 2014 ) proposed an automated model based on text-mining approaches for more intelligent scoring of CSR reports. After preprocessing of the dataset, four classification algorithms were implemented, namely NB, random subspace, decision table, and neural networks. Various parameters were evaluated and the training categories and feature selection algorithms were tuned to determine the most effective model. NB with the Correlation-based Feature Selection (CFS) filter was chosen as the preferred model. Based on this model, software was designed for CSR report scoring that lets the user input a CSR report to get its score as an automated output. The software was tested and had an overall effectiveness of 81.10%. The authors concluded that the software could be utilised for other purposes such as the popularity of performance indicators as well.

Holton ( 2009 ) implemented a model for preventing corporate financial fraud with a different and interesting perspective. The author considered employee disgruntlement or employee dissatisfaction as a hidden indicator that is responsible for fraud. A minimal dataset of intra-company communication messages and emails on online discussion groups was prepared. After using document clustering for estimating that the data possess sufficient predictive power, the NB classifier was implemented to classify the messages into disgruntled/non-disgruntled classes, and an accuracy of 89% was achieved. The author proposed the use of the model for fraud risk assessment in corporations and organisations with the motivation that it can be used to prevent huge financial losses. The performance of other models such as neural networks and decision trees was to be compared in future work.

Chan and Franklin ( 2011 ) developed a new decision-support system to predict the occurrence of an event by analysing patterns and extracting sequences from financial reports. After text preprocessing, textual information generalisation was performed with the help of a shallow parser, which had an F-measure of 85%. The extracted information was stored in a separate database. From this database, the event sequences were identified and extracted. A decision tree model was then implemented on these sequences to create an inference engine that could predict the occurrence of new events based on the training sequences. With an 85: 15% training-to-testing split, the model achieved an overall accuracy of 89.09%. The authors concluded by highlighting that their model had better and robust performance compared to the prevailing models.

Humpherys et al. ( 2011 ) reviewed various text-mining methods and theories that have been proposed for the detection of corporate fraud in financial statements and subsequently devised a methodology of their own. Their dataset comprised the Management’s Discussion and Analysis section of corporate annual financial reports. After basic analysis and reduction, various statistical and machine learning algorithms were implemented on the dataset, among which the NB and C4.5 decision tree models both gave the highest accuracy of 67.3% for classifying 10-K reports into fraudulent and non-fraudulent. The authors suggested that their model can be used by auditors for detecting fraudulent statements in reports with the aid of the Agent99 analyser tool.

Loughran and McDonald ( 2011 ) came up with the argument that the word lists contained in the Harvard Dictionary, which is commonly used for textual analysis, are not suitable for financial text classification because a lot of negative words in the Harvard list are not actually considered a negative in the financial context. Corporate 10-K reports were taken as data sources to create a new dictionary with new word lists for financial purposes. The authors advised the use of term weighting for the word lists. The new word lists were compared with the Harvard word lists on multiple financial data items, such as 10-K filing returns, material weaknesses, and standardised unexpected earnings. Although a significant difference between the word lists was not observed for classification, the authors still suggested the use of their lists in order to be more careful and prevent any erroneous results.

Whereas other researchers have mostly focused on fraud detection and financial predictions from corporate financial reports, Song et al. ( 2018 ) focused on sentiment analysis of these reports with respect to the CSR score. The sentences in the sample reports were manually labelled as positive and negative in order to create sample data for the machine learning algorithm. SVM was implemented on the dataset with a 3:1 training to test split, which achieved a precision ratio of 86.83%. Following this, an object library was created, with objects referring to the internal and external environment of the company. Sentiment analysis was conducted on these objects. Then, six regression models were developed to get the CSR score, with the model comprising of the Political, Economic, Social, Technological, Environmental and Legal (PESTEL), Porter’s Five Forces, and Primary and Support Activities showing the best performance in predicting the CSR score. The authors concluded that CSR plays a vital role in a company’s sustainability, and their research could aid stakeholders in their company-related decision-making.

There have been more studies on CSR reports and sustainability. Liew et al. ( 2014 ) analysed process industries for their sustainability trends with the help of CSR and sustainability reports of a large number of big companies. The RapidMiner tool was used for text preprocessing followed by generating frequency statistics, pruning, and further text refinement, which generated sustainability-related terms for analysis. The most occurring terms were taken into consideration to create a hierarchical tree model. Environment, health and safety, and social were identified as the key concepts for sustainability. Based on term occurrence and involvement, the authors classified the sustainability issues as specific, critical, rare, and general.

Table  3 presents some more studies on the applications of text mining in corporate finance. As evident from the table and the above-mentioned studies, the annual corporate reports are the most commonly used data source for text-mining applications.

Challenges and future scope

The financial sector is a significant driver of broader industry, and the increasing amount of data in this field has given rise to a number of applications that can be used to improve the field and achieve commercial objectives.

Figure  4 shows some common challenges faced by various text-mining techniques in the financial sector. The huge amount of data available is highly unstructured and has explicit meanings in addition to implicit ones. The data needs to undergo proper preprocessing before it can be used for analysis. Although lexicon lists are available for various domains, the financial sector has to have a specific dictionary for such approaches, so as to assign proper weights to corresponding aspects in the document. In addition to this, there is still restricted access to classified information, which is a significant obstacle. Lastly, the current techniques focus on obtaining static results statically that are true for a given period of time. There is a need for a system that performs text-mining techniques on dynamically obtained data to output real-time results to enable even better insights.

figure 4

Major challenges to text mining in finance

The combination of text-mining techniques and financial data analytics can produce a model that can potentially be the most efficient model for this problem domain. The results obtained from mining textual data can be integrated with those from financial analysis, thereby providing models that focus on historical data as well as opinions from diverse sources.

This paper conducted an organised qualitative review of recent literature pertaining to three specific sectors of finance. First, this paper analysed the growing importance of text mining in predicting financial trends. While the prior consensus may have been that financial markets are unpredictable, text mining has challenged this notion. The second area of study was banking, which has seen constant growth in technological innovation over the years, especially in digitisation. Text mining has played a key role in supporting these advancements both directly and indirectly through combination with other technologies. Corporate finance was the third study area. We discussed the importance of text mining in enabling the utilisation of corporate reports and financial statements for serving various purposes in addition to supporting corporate sustainability goals. The use of text mining in financial applications is not limited to these sectors. Researchers are increasingly showing interest in text-mining applications and constantly seeking to build more accurate models. There are still many unexplored possibilities in the financial domain, and the related research can help develop more robust and accurate predictive and analytic systems.

Availability of data and materials

All relevant data and material are presented in the main paper.

Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the international conference on web search and web data mining—WSDM ’08. https://doi.org/10.1145/1341531.1341557

Ahir K, Govani K, Gajera R, Shah M (2020) Application on virtual reality for enhanced education learning, military training and sports. Augment Hum Res 5:7

Article   Google Scholar  

Ahmad K, Cheng D, Almas Y (2006) Multi-lingual sentiment analysis of financial news streams. In: Proceedings of science, pp 1–8

Akaichi J, Dhouioui Z, López-Huertas Pérez MJ (2013) Text mining facebook status updates for sentiment classification. In: 2013 17th international conference on system theory, control and computing (ICSTCC), Sinaia, 2013, pp 640–645. https://doi.org/10.1109/ICSTCC.2013.6689032

Al-Natour S, Turetken O (2020) A comparative assessment of sentiment analysis and star ratings for consumer reviews. Int J Inf Manage. https://doi.org/10.1016/j.ijinfomgt.2020.102132

AL-Rubaiee H, Qiu R, Li D (2015) Analysis of the relationship between Saudi twitter posts and the Saudi stock market. In: 2015 IEEE seventh international conference on intelligent computing and information systems (ICICIS). https://doi.org/10.1109/intelcis.2015.7397193

Audrino F, Sigrist F, Ballinari D (2018) The impact of sentiment and attention measures on stock market volatility. Available at SSRN: https://ssrn.com/abstract=3188941 or https://doi.org/10.2139/ssrn.3188941

Aureli S (2017) A comparison of content analysis usage and text mining in CSR corporate disclosure. Int J Digit Account Res 17:1–32

Bach MP, Krsti Z, Seljan S, Turulja L (2019) Text mining for big data analysis in financial sector: a literature review. Sustainability 2019(11):1277

Bharti SK, Babu KS (2017) Automatic keyword extraction for text summarization: a survey. CoRR. abs/1704.03242.

Bhattacharyya S, Jha S, Tharakunnel K, Westland JC (2011) Data mining for credit card fraud: a comparative study. Decis Support Syst 50(3):602–613

Bholat D, Hansen S, Santos P, Schonhardt-Bailey C (2015) Text mining for central banks: handbook. Centre Cent Bank Stud 33:1–19

Google Scholar  

Bidulya Y, Brunova E (2016) Sentiment analysis for bank service quality: a rule-based classifier. In: 2016 IEEE 10th international conference on application of information and communication technologies (AICT). https://doi.org/10.1109/icaict.2016.7991688

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(2003):993–1022

Brindha S, Prabha K, Sukumaran S (2016) A survey on classification techniques for text mining. In: 2016 3rd international conference on advanced computing and communication systems (ICACCS), Coimbatore, 2016, pp 1–5. https://doi.org/10.1109/ICACCS.2016.7586371

Bruno G (2016) Text mining and sentiment extraction in central bank documents. In: 2016 IEEE international conference on big data (big data). https://doi.org/10.1109/bigdata.2016.7840784

Cambria E (2016) Affective computing and sentiment analysis. IEEE Intell Syst 31(2):102–107. https://doi.org/10.1109/MIS.2016.31

Chakraborty V, Chiu V, Vasarhelyi M (2014) Automatic classification of accounting literature. Int J Account Inf Syst 15(2):122–148

Chan SWK, Franklin J (2011) A text-based decision support system for financial sequence prediction. Decis Support Syst 52(1):189–198

Chaturvedi D, Chopra S (2014) Customers sentiment on banks. Int J Comput Appl 98(13):8–13

Chen CC, Huang HH, Chen HH (2020) NLP in FinTech applications: past, present and future

Cook A, Herron B (2018) Harvesting unstructured data to reduce anti-money laundering (AML) compliance risk, pp 1–10

Daniel K, Hirshleifer D, Teoh S (2001) Investor psychology in capital markets: evidence and policy implications. J Monet Econ 49:139–209. https://doi.org/10.1016/S0304-3932(01)00091-5

Da-sheng W, Qin-fen Y, Li-juan L (2009) An efficient text classification algorithm in E-commerce application. In: 2009 WRI world congress on computer science and information engineering. https://doi.org/10.1109/csie.2009.346

David JM, Balakrishnan K (2011) Prediction of key symptoms of learning disabilities in school-age children using rough sets. Int J Comput Electr Eng Hong Kong 3(1):163–169

Dohaiha H, Prasad PWC, Maag A, Alsadoon A (2018) Deep learning for aspect-based sentiment analysis: a comparative review. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2018.10.003

Elagamy MN, Stanier C, Sharp B (2018) Stock market random forest-text mining system mining critical indicators of stock market movements. In: 2018 2nd international conference on natural language and speech processing (ICNLSP). https://doi.org/10.1109/icnlsp.2018.8374370

Emekligil E, Arslan S, Agin O (2016) A bank information extraction system based on named entity recognition with CRFs from noisy customer order texts in Turkish. In: Knowledge engineering and semantic web, pp 93–102

Espejo-Garcia B, Martinez-Guanter J, Pérez-Ruiz M, Lopez-Pellicer FJ, Javier Zarazaga-Soria F (2018) Machine learning for automatic rule classification of agricultural regulations: a case study in Spain. Comput Electron Agric 150:343–352

Fama EF (1991) Efficient capital markets: II. J Finance 46(5):1575–1617. https://doi.org/10.2307/2328565

Fan W, Wallace L, Rich S, Zhang Z (2006) Tapping the power of text mining. Commun ACM 49(9):76–82

Feuerriegel S, Gordon J (2018) Long-term stock index forecasting based on text mining of regulatory disclosures. Decis Support Syst 112:88–97

Fisher I, Garnsey M, Hughes M (2016) Natural language processing in accounting, auditing and finance: a synthesis of the literature with a roadmap for future research. Intell Syst Account Finance Manag. https://doi.org/10.1002/isaf.1386

Fritz D, Tows E (2018) Text mining and reporting quality in German banks—a cooccurrence and sentiment analysis. Univers J Account Finance 6(2):54–81

Fung G, Yu J, Lam W (2002) News sensitive stock trend prediction. Adv Knowl Discov Data Min. https://doi.org/10.1007/3-540-47887-6_48

Gandhi M, Kamdar J, Shah M (2020) Preprocessing of non-symmetrical images for edge detection. Augment Hum Res 5:10. https://doi.org/10.1007/s41133-019-0030-5

Gao Z, Ye M (2007) A framework for data mining-based anti-money laundering research. J Money Laund Control 10(2):170–179

Gemar G, Jiménez-Quintero JA (2015) Text mining social media for competitive analysis. Tour Manag Stud 11(1):84–90

Gulaty M (2016) Aspect-based sentiment analysis in bank reviews. https://doi.org/10.13140/RG.2.1.2072.3445

Guo L, Shi F, Tu J (2016) Textual analysis and machine leaning: crack unstructured data in finance and accounting. J Finance Data Sci 2(3):153–170

Gupta R, Gill NS (2012) Financial statement fraud detection using text mining. Int J Adv Comput Sci Appl 3(12):189–191

Gupta A, Simaan M, Zaki MJ (2016) Investigating bank failures using text mining. In: 2016 IEEE symposium series on computational intelligence (SSCI). https://doi.org/10.1109/ssci.2016.7850006

Gupta A, Bhatia P, Dave K, Jain P (2019) Stock market prediction using data mining techniques. In: 2nd international conference on advances in science and technology, pp 1–5

Hagenau M, Liebmann M, Neumann D (2013) Automated news reading: stock price prediction based on financial news using context-capturing features. Decis Support Syst 55(3):685–697

Hájek P, Olej V (2013) Evaluating sentiment in annual reports for financial distress prediction using neural networks and support vector machines. In: Communications in computer and information science, pp 1–10.

Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. https://doi.org/10.1186/s40537-019-0206-3

Hassonah M, Al-Sayyed R, Rodan A, Al-Zoubi A, Aljarah I, Faris H (2019) An efficient hybrid filter and evolutionary wrapper approach for sentiment analysis of various topics on Twitter. Knowl Based Syst. https://doi.org/10.1016/j.knosys.2019.105353

Heaton JB, Polson NG, Witte JH (2016) Deep learning in finance. arXiv:1602.06561

Heidari M, Felden C (2015) Financial footnote analysis: developing a text mining approach. In: Int'l conf. data mining, pp 10–16

Herranz S, Palomo J, Cruz M (2018) Building an educational platform using NLP: a case study in teaching finance. J Univ Comput Sci 24:1403

Holton C (2009) Identifying disgruntled employee systems fraud risk through text mining: a simple solution for a multi-billion dollar problem. Decis Support Syst 46(4):853–864

Humpherys SL, Moffitt KC, Burns MB, Burgoon JK, Felix WF (2011) Identification of fraudulent financial statements using linguistic credibility analysis. Decis Support Syst 50(3):585–594

IBEF (2019) https://www.ibef.org/download/financial-services-april-2019.pdf

James TL, Calderon EDV, Cook DF (2017) Exploring patient perceptions of healthcare service quality through analysis of unstructured feedback. Expert Syst Appl 71:479–492

Jani K, Chaudhuri M, Patel H, Shah M (2019) Machine learning in films: an approach towards automation in film censoring. J Data Inf Manag. https://doi.org/10.1007/s42488-019-00016-9

Jaseena KU, David JM (2014) Issues, challenges, and solutions: big data mining. In: Natarajan Meghanathan et al. (eds) NeTCoM, CSIT, GRAPH-HOC, SPTM—2014, pp 131–140

Jha K, Doshi A, Patel P, Shah M (2019) A comprehensive review on automation in agriculture using artificial intelligence. Artif Intell Agric 2:1–12

Joshi K, Bharathi N, Jyothi R (2016) Stock trend prediction using news sentiment analysis. Int J Comput Sci Inf Technol 8:67–76. https://doi.org/10.5121/ijcsit.2016.8306

Junqué de Fortuny E, De Smedt T, Martens D, Daelemans W (2014) Evaluating and understanding text-based stock price prediction models. Inf Process Manag 50(2):426–441

Kakkad V, Patel M, Shah M (2019) Biometric authentication and image encryption for image security in cloud framework. Multiscale Multidiscip Model Exp Des. https://doi.org/10.1007/s41939-019-00049-y

Kamaruddin SS, Hamdan AR, Bakar AA (2007) Text mining for deviation detection in financial statements. In: Proceedings of the international conference on electrical engineering and informatics. Institut Teknologi Bandung, Indonesia, 2007, June 17–19

Kang T, Park DH (2016) The effect of expert reviews on consumer product evaluations: a text mining approach. J Intell Inf Syst 22(1):63–82

Kinsella S, Passant A, Breslin JG (2011) Topic classification in social media using metadata from hyperlinked objects. Adv Inf Retr. https://doi.org/10.1007/978-3-642-20161-5_20

Kloptchenko A, Eklund T, Karlsson J, Back B, Vanharanta H, Visa A (2004) Combining data and text mining techniques for analysing financial reports. Intell Syst Account Finance Manag 12(1):29–41

Kordonis J, Symeonidis S, Arampatzis A (2016) Stock price forecasting via sentiment analysis on twitter. https://doi.org/10.1145/3003733.3003787 .

Kou G, Lu Y, Peng Y, Shi Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. Int J Inf Technol Decis Mak. https://doi.org/10.1142/S0219622012500095

Kou G, Peng Yi, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12. https://doi.org/10.1016/j.ins.2014.02.137

Kou G, Yang P, Peng Yi, Xiao F, Chen Y, Alsaadi F (2019) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836. https://doi.org/10.1016/j.asoc.2019.105836

Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10:150

Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M, Castagnoli L, Cesareni G, Tyers M, Schneider G, Rinaldi F, Leaman R, Gonzalez G, Matos S, Kim S, Wilbur WJ, Rocha L, Shatkay H, Tendulkar AV, Agarwal S, Liu F, Wang X, Rak R, Noto K, Elkan C, Lu Z, Dogan RI, Fontaine JF, Andrade-Navarro MA, Valencia A (2011) The protein–protein interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform 12(Suppl 8):S3. https://doi.org/10.1186/1471-2105-12-s8-s3

Kraus M, Feuerriegel S (2017) Decision support from financial disclosures with deep neural networks and transfer learning. Decis Support Syst. https://doi.org/10.1016/j.dss.2017.10.001

Krstić Ž, Seljan S, Zoroja J (2019) Visualization of big data text analytics in financial industry: a case study of topic extraction for Italian banks (September 12, 2019). In: 2019 ENTRENOVA conference proceedings. https://ssrn.com/abstract=3490108 or https://doi.org/10.2139/ssrn.3490108

Kumar BS, Ravi V (2016) A survey of the applications of text mining in financial domain. Knowl Based Syst 114:128–147

Kundalia K, Patel Y, Shah M (2020) Multi-label movie genre detection from a movie poster using knowledge transfer learning. Augment Hum Res 5:11. https://doi.org/10.1007/s41133-019-0029-y

Lavrenko V, Schmill M, Lawrie D, Ogilvie P, Jensen D, Allan J (2000) Mining of concurrent text and time series. In: KDD-2000 Workshop on text mining, vol 2000. Citeseer, pp 37–44

Lee CT (2019) Early warning mechanism of agricultural network public opinion based on text mining. Revista De La Facultad De Agronomia De La Universidad Del Zulia, 36

Lee B, Park JH, Kwon L, Moon YH, Shin Y, Kim G, Kim H (2018) About relationship between business text patterns and financial performance in corporate data. J Open Innov Technol Market Complex. https://doi.org/10.1186/s40852-018-0080-9

Lewis C, Young S (2019) Fad or future? Automated analysis of financial text and its implications for corporate reporting. Account Bus Res 49(5):587–615

Li N, Liang X, Li X, Wang C, Wu DD (2009) Network Environment and Financial Risk Using Machine Learning and Sentiment Analysis. Human Ecol Risk Assess Int J 15(2):227–252. https://doi.org/10.1080/10807030902761056

Li T, Kou G, Peng Y, Shi Y (2020a) Classifying with adaptive hyper-spheres: an incremental classifier based on competitive learning. IEEE Trans Syst Man Cybern Syst 50(4):1218–1229. https://doi.org/10.1109/TSMC.2017.2761360

Li X, Wu P, Wang W (2020b) Incorporating stock prices and news sentiments for stock market prediction: a case of Hong Kong. Inf Process Manag. https://doi.org/10.1016/j.ipm.2020.102212

Li T, Kou G, Peng Yi (2020c) Improving malicious URLs detection via feature engineering: linear and nonlinear space transformation methods. Inf Syst 91:101494. https://doi.org/10.1016/j.is.2020.101494

Liew WT, Adhitya A, Srinivasan R (2014) Sustainability trends in the process industries: a text mining-based analysis. Comput Ind 65(3):393–400

Lin C, He Y (2009) Joint sentiment/topic model for sentiment analysis. In: Proceeding of the 18th ACM conference on information and knowledge management—CIKM ’09. https://doi.org/10.1145/1645953.1646003

Loughran T, Mcdonald B (2011) When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J Finance 66(1):35–65

Lu Y (2013) Automatic topic identification of health-related messages in online health community using text classification. SpringerPlus 2(1):309

Malandri L, Xing F, Orsenigo C, Vercellis C, Cambria E (2018) Public mood-driven asset allocation: the importance of financial sentiment in portfolio management. Cogn Comput. https://doi.org/10.1007/s12559-018-9609-2

Marrara S, Pejic Bach M, Seljan S, Topalovic A (2019) FinTech and SMEs—the Italian case. https://doi.org/10.4018/978-1-5225-7805-5.ch002

Matthies B, Coners A (2015) Computer-aided text analysis of corporate disclosures—demonstration and evaluation of two approaches. Int J Digit Account Res 15:69–98

Mudinas A, Zhang D, Levene M (2019) Market trend prediction using sentiment analysis: lessons learned and paths forward. arXiv:1903.05440

Nan L, Xun L, Xinli L, Chao W, Desheng DW (2009) Network environment and financial risk using machine learning and sentiment analysis. Hum Ecol Risk Assess Int J 15(2):227–252

Nassirtoussi AK, Aghabozorgi S, Wah TY, Ngo DC (2015) Text mining of news-headlines for FOREX market prediction: A Multi-layer Dimension Reduction Algorithm with semantics and sentiment. Expert Syst Appl 42(1):306–324. https://doi.org/10.1016/j.eswa.2014.08.004

Nguyen TH, Shirai K, Velcin J (2015) Sentiment analysis on social media for stock movement prediction. Expert Syst Appl 42(24):9603–9611

Nikfarjam A, Emadzadeh E, Muthaiyah S (2010) Text mining approaches for stock market prediction. In: 2010 the 2nd international conference on computer and automation engineering (ICCAE). https://doi.org/10.1109/iccae.2010.5451705

Nopp C, Hanbury A (2015) Detecting risks in the banking system by sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 591–600

Panchiwala S, Shah MA (2020) Comprehensive study on critical security issues and challenges of the IoT world. J Data Inf Manag. https://doi.org/10.1007/s42488-020-00030-2

Pandya R, Nadiadwala S, Shah R, Shah M (2019) Buildout of methodology for meticulous diagnosis of K-complex in EEG for aiding the detection of Alzheimer’s by artificial intelligence. Augment Hum Res. https://doi.org/10.1007/s41133-019-0021-6

Parekh V, Shah D, Shah M (2020) Fatigue detection using artificial intelligence framework. Augment Hum Res 5:5

Patel D, Shah Y, Thakkar N, Shah K, Shah M (2020a) Implementation of artificial intelligence techniques for cancer detection. Augment Hum Res. https://doi.org/10.1007/s41133-019-0024-3

Patel D, Shah D, Shah M (2020b) The intertwine of brain and body: a quantitative analysis on how big data influences the system of sports. Ann Data Sci. https://doi.org/10.1007/s40745-019-00239-y

Patel H, Prajapati D, Mahida D, Shah M (2020c) Transforming petroleum downstream sector through big data: a holistic review. J Petrol Explor Prod Technol. https://doi.org/10.1007/s13202-020-00889-2

Pathan M, Patel N, Yagnik H, Shah M (2020) Artificial cognition for applications in smart agriculture: a comprehensive review. Artif Intell Agric. https://doi.org/10.1016/j.aiia.2020.06.001

Pejic Bach M, Krstić Ž, Seljan S, Turulja L (2019) Text mining for big data analysis in financial sector: a literature review. Sustainability 11:1277. https://doi.org/10.3390/su11051277

Picasso A, Merello S, Ma Y, Oneto L, Cambria E (2019) Technical analysis and sentiment embeddings for market trend prediction. Expert Syst Appl 135:60–70. https://doi.org/10.1016/j.eswa.2019.06.014

Pradhan MV, Vala J, Balani P (2016) A survey on sentiment analysis algorithms for opinion mining. Int J Comput Appl 133:7–11. https://doi.org/10.5120/ijca2016907977

Ray P, Chakrabarti A (2019) A mixed approach of deep learning method and rule-based method to improve aspect level sentiment analysis. Appl Comput Inform. https://doi.org/10.1016/j.aci.2019.02.002

Renault T (2019) Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digit Finance. https://doi.org/10.1007/s42521-019-00014-x

Sabo T (2017) Applying text analytics and machine learning to assess consumer financial complaints. In: Proceedings of the SAS global forum 2017 conference. SAS Institute Inc., Cary NC. https://support.sas.com/resources/papers/proceedings17/SAS0282-2017.pdf

Salloum S, Al-Emran M, Monem A, Shaalan K (2017) A survey of text mining in social media: facebook and twitter perspectives. Adv Sci Technol Eng Syst J 2:127–133. https://doi.org/10.25046/aj020115

Salloum S, Mostafa A, Monem A, Shaalan K (2018) Using text mining techniques for extracting information from research articles. https://doi.org/10.1007/978-3-319-67056-0_18

Schneider MJ, Gupta S (2016) Forecasting sales of new and existing products using consumer reviews: a random projections approach. Int J Forecast 32(2):243–256

Schumaker RP, Chen H (2009) Textual analysis of stock market prediction using breaking financial news. ACM Trans Inf Syst 27(2):1–19

Shah D, Isah H, Zulkernine F (2018a) Predicting the effects of news sentiments on the stock market. In: 2018 IEEE international conference on big data (big data). https://doi.org/10.1109/bigdata.2018.8621884

Shah T, Shaikh I, Patel A (2018b) Comparison of different kernels of support vector machine for predicting stock prices. Int J Eng Technol 9(6):4288–4291

Shah G, Shah A, Shah M (2019) Panacea of challenges in real-world application of big data analytics in healthcare sector. Data Inf Manag. https://doi.org/10.1007/s42488-019-00010-1

Shah D, Dixit R, Shah A, Shah P, Shah M (2020) A Comprehensive analysis regarding several breakthroughs based on computer intelligence targeting various syndromes. Augment Hum Res 5:14. https://doi.org/10.1007/s41133-020-00033-z

Shah K, Patel H, Sanghvi D, Shah M (2020) A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augment Hum Res 5:12. https://doi.org/10.1007/s41133-020-00032-0

Shahi AM, Issac B, Modapothala JR (2014) Automatic analysis of corporate sustainability reports and intelligent SCORING. Int J Comput Intell Appl 13(01):1450006. https://doi.org/10.1142/s1469026814500060

Shirata CY, Takeuchi H, Ogino S, Watanabe H (2011) Extracting key phrases as predictors of corporate bankruptcy: empirical analysis of annual reports by text mining. J Emerg Technol Account 8(1):31–44

Sohangir S, Wang D, Pomeranets A et al (2018) Big data: deep learning for financial sentiment analysis. J Big Data 5:3. https://doi.org/10.1186/s40537-017-0111-6

Song Y, Wang H, Zhu M (2018) Sustainable strategy for corporate governance based on the sentiment analysis of financial reports with CSR. Financ Innov. https://doi.org/10.1186/s40854-018-0086-0

Souma W, Vodenska I, Aoyama H (2019) Enhanced news sentiment analysis using deep learning methods. J Comput Soc Sci 2:33–46. https://doi.org/10.1007/s42001-019-00035-x

Srivastava SK, Singh SK, Suri JS (2018) Healthcare text classification system and its performance evaluation: a source of better intelligence by characterizing healthcare text. J Med Syst. https://doi.org/10.1007/s10916-018-0941-6

Su Y, Wang R, Chen P, Wei Y, Li C, Hu Y (2012) Agricultural ontology based feature optimization for agricultural text clustering. J Integr Agric 11(5):752–759

Sukhadia A, Upadhyay K, Gundeti M, Shah S, Shah M (2020) Optimization of smart traffic governance system using artificial intelligence. Augment Hum Res 5:13. https://doi.org/10.1007/s41133-020-00035-x

Sumathi N, Sheela T (2017) Opinion mining analysis in banking system using rough feature selection technique from social media text. Int J Mech Eng Technol 8(12):274–289

Talaviya T, Shah D, Patel N, Yagnik H, Shah M (2020) Implementation of artificial intelligence in agriculture for optimisation of irrigation and application of pesticides and herbicides. Artif Intell Agric. https://doi.org/10.1016/j.aiia.2020.04.002

Talib R, Hanif MK, Ayesha S, Fatima F (2016a) Text mining: techniques. Appl Issues 7(11):414–418

Talib R, Kashif M, Ayesha S, Fatima F (2016b) Text mining: techniques, applications and issues. Int J Adv Comput Sci Appl. https://doi.org/10.14569/IJACSA.2016.071153

Tkáč M, Verner R (2016) Artificial neural networks in business: two decades of research. Appl Soft Comput 38:788–804

Ur-Rahman N, Harding JA (2012) Textual data mining for industrial knowledge management and text classification: a business oriented approach. Expert Syst Appl 39(5):4729–4739

Vijayan R, Potey MA (2016) Improved accuracy of FOREX intraday trend prediction through text mining of news headlines using J48. Int J Adv Res Comput Eng Technol 5(6):1862–1866

Vu TT, Chang S, Ha QT, Collier N (2012) An experiment in integrating sentiment features for tech stock prediction in twitter. In: Workshop on information extraction and entity analytics on social media data, COLING, Mumbai, India, pp 23–38

Wang B, Huang H, Wang X (2012) A novel text mining approach to financial time series forecasting. Neurocomputing 83:136–145

Wen F, Xu L, Ouyang G, Kou G (2019) Retail investor attention and stock price crash risk: evidence from China. Int Rev Financ Anal 65:101376. https://doi.org/10.1016/j.irfa.2019.101376

Widiastuti N (2018) Deep learning—now and next in text mining and natural language processing. IOP Conf Ser Mater Sci Eng 407:012114. https://doi.org/10.1088/1757-899X/407/1/012114

Wu JL, Su CC, Yu LC, Chang PC (2012) Stock price predication using combinational features from sentimental analysis of stock news and technical analysis of trading information. Int Proc Econ Dev Res. https://doi.org/10.7763/ipedr

Wu DD, Zheng L, Olson DL (2014) A decision support approach for online stock forum sentiment analysis. IEEE Trans Syst Man Cybern Syst 44(8):1077–1087

Xing FZ, Cambria E, Welsch RE (2017) Natural language based financial forecasting: a survey. Artif Intell Rev 50(1):49–73

Xing FZ, Cambria E, Welsch RE (2018a) Natural language based financial forecasting: a survey. Artif Intell Rev 50:49–73. https://doi.org/10.1007/s10462-017-9588-9

Xing F, Cambria E, Welsch R (2018b) Intelligent asset allocation via market sentiment views. IEEE Comput Intell Mag 13:25–34. https://doi.org/10.1109/MCI.2018.2866727

Xiong T, Wang S, Mayers A, Monga E (2013) Personal bankruptcy prediction by mining credit card data. Expert Syst Appl 40(2):665–676

Xu G, Yu Z, Yao H, Li F, Meng Y, Wu X (2019) Chinese text sentiment analysis based on extended sentiment dictionary. IEEE Access 7:43749–43762. https://doi.org/10.1109/ACCESS.2019.2907772

Yang Li, Li Y, Wang J, Sherratt R (2020) Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access 8:1–1. https://doi.org/10.1109/ACCESS.2020.2969854

Yap BW, Ong SH, Husain NHM (2011) Using data mining to improve assessment of credit worthiness via credit scoring models. Expert Syst Appl 38(10):13274–13283

Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [review article]. IEEE Comput Intell Mag 13(3):55–75. https://doi.org/10.1109/MCI.2018.2840738

Yusuuf H, Shihabeldeen A (2019) Using text mining to predicate exchange rates with sentiment indicators. J Bus Theory Pract 7(2):60–75

Zavolokina L, Dolata M, Schwabe G (2016) The FinTech phenomenon: antecedents of financial innovation perceived by the popular press. Financ Innov. https://doi.org/10.1186/s40854-016-0036-7

Download references

Acknowledgements

The authors are grateful to Nirma University and Department of Chemical Engineering, School of Technology, Pandit Deendayal Petroleum University for the permission to publish this research.

Not applicable.

Author information

Authors and affiliations.

Department of Computer Science, Nirma University, Ahmedabad, Gujarat, India

Aaryan Gupta, Vinya Dengre & Hamza Abubakar Kheruwala

Department of Chemical Engineering, School of Technology, Pandit Deendayal Petroleum University, Gandhinagar, Gujarat, 382007, India

You can also search for this author in PubMed   Google Scholar

Contributions

All the authors make substantial contribution in this manuscript. AG, VD, HA and MS participated in drafting the manuscript. AG, VD and HA wrote the main manuscript, all the authors discussed the results and implication on the manuscript at all stages. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Manan Shah .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Gupta, A., Dengre, V., Kheruwala, H.A. et al. Comprehensive review of text-mining applications in finance. Financ Innov 6 , 39 (2020). https://doi.org/10.1186/s40854-020-00205-1

Download citation

Received : 29 January 2020

Accepted : 17 September 2020

Published : 02 November 2020

DOI : https://doi.org/10.1186/s40854-020-00205-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Text mining
  • Machine learning
  • Financial forecasting
  • Sentiment analysis
  • Corporate finance

term paper on text mining

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 13 September 2024

Text mining method to unravel long COVID’s clinical condition in hospitalized patients

  • Pilar Tavares Veras Florentino 1 , 2   na1 ,
  • Vinícius de Oliveira Araújo 2 , 3   na1 ,
  • Henrique Zatti 2   na1 ,
  • Caio Vinícius Luis 4 ,
  • Célia Regina Santos Cavalcanti 4 ,
  • Matheus Henrique Citibaldi de Oliveira 4 ,
  • Anderson Henrique França Figueredo Leão   ORCID: orcid.org/0000-0002-0721-0866 4 ,
  • Juracy Bertoldo Junior 2 ,
  • George G. Caique Barbosa 2 ,
  • Ernesto Ravera 4 ,
  • Alberto Cebukin 4 ,
  • Renata Bernardes David 4 ,
  • Danilo Batista Vieira de Melo 5 ,
  • Tales Mota Machado 2 , 6 ,
  • Nancy C. J. Bellei 7 ,
  • Viviane Boaventura 1 , 3 ,
  • Manoel Barral-Netto   ORCID: orcid.org/0000-0002-5823-7903 1 , 3   na2 &
  • Soraya S. Smaili   ORCID: orcid.org/0000-0001-5844-1368 4   na2  

Cell Death & Disease volume  15 , Article number:  671 ( 2024 ) Cite this article

2 Altmetric

Metrics details

  • Epidemiology
  • Viral infection

Long COVID is characterized by persistent that extends symptoms beyond established timeframes. Its varied presentation across different populations and healthcare systems poses significant challenges in understanding its clinical manifestations and implications. In this study, we present a novel application of text mining technique to automatically extract unstructured data from a long COVID survey conducted at a prominent university hospital in São Paulo, Brazil. Our phonetic text clustering (PTC) method enables the exploration of unstructured Electronic Healthcare Records (EHR) data to unify different written forms of similar terms into a single phonemic representation. We used n-gram text analysis to detect compound words and negated terms in Portuguese-BR, focusing on medical conditions and symptoms related to long COVID. By leveraging text mining, we aim to contribute to a deeper understanding of this chronic condition and its implications for healthcare systems globally. The model developed in this study has the potential for scalability and applicability in other healthcare settings, thereby supporting broader research efforts and informing clinical decision-making for long COVID patients.

Similar content being viewed by others

term paper on text mining

Towards a practical use of text mining approaches in electrodiagnostic data

term paper on text mining

Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

term paper on text mining

Multiscale classification of heart failure phenotypes by unsupervised clustering of unstructured electronic medical record data

Introduction.

Advances in emerging technologies such as artificial intelligence (AI) and machine learning (ML) hold promise for the development of healthcare transformation in prediction, contact tracing, screening, diagnosis and treatment, significantly improving medical practice [ 1 , 2 , 3 , 4 , 5 , 6 ]. One potential utility of AI is to assist in the extraction of information from electronic medical records. Natural language processing (NLP) is a subfield of AI that enables computers to learn from unstructured medical records and adapt to new language patterns over time, which can be useful for administrative and research purposes. Unlike NLP, which seeks to comprehend the overall meaning of text, text mining focus on addressing a particular problem in a specific domain determined in advance, potentially employing some NLP techniques in the process [ 7 ]. Although efforts have been made to use text mining to extract information from medical records in the English language, studies in languages other than English are still emerging [ 8 , 9 ] and are urgently needed for specific systems.

Historically, clinically relevant information from electronic health records (EHRs) has been extracted via manual review by clinical experts, resulting in scalability and cost [ 10 ]. This is particularly evident for chronic diseases, where clinical notes are more common than structured medical records data [ 11 ]. These unstructured data provide a great opportunity to test the performance of text mining in automatically extracting clinically meaningful information, which may be useful for research and administrative purpose [ 10 ].

Long COVID is a chronic disease characterized by persistence of symptoms for more than 1 month (according to Central of Disease Control—CDC) or more than 3 months (according to World Health Organization—WHO) that still lacks a definitive clinical characterization. New tools that help perform a meticulous analysis of vast amounts of unstructured data from EHR can uncover patterns, symptoms, and outcomes that might otherwise elude traditional research methods enabling a deeper comprehension of long COVID. Recent descriptions of long COVID are based on studies conducted and compiled in 2021 and more detailed studies conducted in 2022 with the aim of reaching a consensus [ 12 , 13 , 14 ]. Based on these studies, many works have sought to highlight what actually occurs after SARS-CoV-2 infection. It remains unclear why the virus causes so many different symptoms affecting a variety of systems and what defines their frequency and prevalence. Thus, understanding what happens after COVID-19 and identifying which sequelae correspond to the postacute disease are increasingly important. Several symptoms can persist for many months or years, in addition to elevated risks of complications and death [ 15 , 16 , 17 ]. However, there are still may doubts and inconsistencies related to long COVID, especially regarding patients who were hospitalized, as well as the correlation between the length of hospitalization and the severity of the disease. Therefore, it is necessary to investigate and follow patients who were hospitalized with COVID-19 in different hospitals and to clearly identify the risk factors that require attention after COVID-19 and how these factors affect patients’ lives after the disease.

In this study, we sought to classify and automatically extract data from a long Covid survey from a large hospital in the city of São Paulo, one of the Brazilian cities most affected by the pandemic. We analyzed the EHR and created a model that can be applied in other hospitals.

Materials and methods

Study design and data source.

This is a cross sectional study that uses a national database for training a model of text mining and EHR from a referral hospital for testing the model. The training dataset provided the tokens for text mining. We then performed text mining on the testing dataset and compared the results with those of manual human classification.

The training dataset was built from the information system for severe acute respiratory illness (SIVEP-Gripe), in which all COVID-19 hospital admissions and deaths in Brazil were registered by federal law. The SIVEP-Gripe is the national registration database for severe acute respiratory syndrome (SARS) in Brazil including COVID-19 data, and all COVID-19 hospitalizations and deaths. SARS is defined as an individual who presents with dyspnea/respiratory discomfort, persistent pressure or pain in the chest, oxygen saturation less than 95% without oxygen, or cyanosis of the lips or face ( https://www.gov.br/saude/pt-br/assuntos/coronavirus/artigos/definicao-e-casos-suspeitos ). This database has been widely used as a source for other epidemiological studies [ 18 , 19 , 20 , 21 ]. In the present work, the SIVEP-Gripe was used to create a token dictionary for unstructured text from clinical questionnaires. The dataset is publicly available at https://opendatasus.saude.gov.br/group/dados-sobre-srag .

The testing dataset was built from the EHR of patients who were hospitalized for COVID-19 at the Hospital São Paulo, the University Hospital (UH) of the Federal University of São Paulo (Unifesp), from March 2020 to June 2022 and were followed after discharge at a Post-COVID-19 Disease Unit (PCDU). The PCDU is a multidisciplinary unit where health professionals assist patients and administer a questionnaire to gather information on any prolonged signs or symptoms after the acute phase of COVID-19. This questionnaire includes information on acute COVID-19, evolution of the infection, medical history, and post-acute phase signs and symptoms. These data were also linked to demographic information and SARS-CoV-2 PCR results from patients.

Hospital São Paulo uses multiple information systems, resulting in the dispersion of relevant information across different databases. As an initial search strategy to extract data, terminologies related to the infectious disease COVID-19 were applied to the Clinical Notes Database (MongoDB) for the period from March 1, 2020 to September 30, 2022. A total of eight clinical encounter forms were preselected from this data structure, and the clinical records of 16,017 patients were collected. The form data were extracted and grouped into eight JSON (JavaScript object notation) files that were converted to the Tidy Data format and saved as CSV (comma separated values) text files. Following the analysis of the files by the technical-scientific team, the clinical encounter form “Post-COVID Care (Pneumo)”, containing records of 440 patients, was chosen for the analysis of demographic and hospital historical data. The second database of interest was the general patient record, a relational database (Oracle) that was queried using the Standard Query Language (SQL). We extracted data on emergency room visits, outpatient consultations, appointments, exam results, hospitalizations, and surgeries. The extracted data from this database were stored in *.XLSX (Microsoft Excel 2007) file format. The extracted medical conditions and patient symptoms were validated with the assistance of ambulatory pneumology and infectiology at the UH.

Clinical data collection

The SIVEP-Gripe dataset (training dataset), which contains ≥2.6 million entries and information on medical conditions and symptoms in the form of unstructured text, was used for the training dataset. Comorbidity and symptom information was extracted from this system via a phoneme approach using the metaphonept-br library ( https://github.com/carlosjordao/metaphone-ptbr ). All data processing was run in Python (Version: 3.10) using Jupyter Notebooks.

Before beginning the text analysis, it was necessary to normalize and clean the text strings. To this end, we utilized regular expressions to clean special characters and to make a diverse set of separators between words uniform to a whitespace. The phonetic text clustering (PTC) method groups terms together according to their phonetics, effectively consolidating variations of similar terms into a single phonemic representation and using n-gram text analysis to detect compound words [ 22 ]. This method not only captured and grouped terms but also allowed the accommodation of synonyms, abbreviations, typographical errors, and the different conjunctions found in Brazilian Portuguese.

To ensure the accuracy of these results, a dictionary of similar terms was carefully curated by five specialists in internal medicine, infectology, pharmacology, pathology, otorhinolaryngology and public health and three medical students. This step prevented the grouping of different terms into the same phoneme. This curation process was crucial in enabling the use of the dictionary to identify medical conditions from unstructured text from different and more complex contexts. Scripts from methodology employed are found in https://github.com/CampusVirtualFiocruz/Text-Mining-Clinical-Data-UNIFESP .

PTC validation with long COVID questionnaires

To validate the PTC method, the clinical information of patients collected from the long COVID questionnaire was organized into structured and unstructured data (Testing dataset). The structured data consisted of yes–no and multiple-choice answers, as well as numerical variables. The unstructured data comprised textual patient reports, including records of symptoms, clinical signs, laboratory tests, previous medical conditions, and lifestyle habits, such as smoking.

Subsequently, an automated approach to process the unstructured variables from the long COVID questionnaires was applied. We searched for all previously defined terms in the curated dictionary and focused on the most frequent comorbidities associated with COVID-19, which included obesity, hypertension, diabetes mellitus, chronic obstructive pulmonary disease (COPD), asthma, hypothyroidism, and hyperthyroidism. Information on patients’ smoking history was also collected to classify patients as smokers or former smokers. Medical records that did not include information on comorbidities or smoking history were classified as having no comorbidities or having a non-smoking history.

In addition, information on long COVID symptoms (cough, fatigue, headache and myalgia) ( https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects/index.html ) was collected. For symptoms, the focus was on terms in the questionnaires that described patient information in the present, excluding the text referring to symptoms reported in the past (acute phase).

Importantly, the medical condition terms detected in the unstructured text of the long COVID dataset could be in the context of a negative report, i.e., the patient confirming or denying the medical condition. To address this issue, we assessed negative operators such as “deny” ( nega , in Brazilian Portuguese) that appeared before the comorbidity of interest and until the following sentence with regular expressions ( https://docs.python.org/3/library/re.html ). This allowed us to capture instances where patients denied having the specified condition. Figure 1 shows a diagram detailing the method developed in this study using the extracted dataset. Scripts from methodology employed can be found in https://github.com/CampusVirtualFiocruz/Text-Mining-Clinical-Data-UNIFESP .

figure 1

The figure shows the process of extracting and processing clinical data using phonetic text clustering (PTC) from the SARS database (2.6 million entries) and UNIFESP unstructured medical data. n-grams (1–4) are extracted using the NLTK library, grouped by phonemes with the metaphone-ptbr library, and used to create a dictionary. A medical specialist validates the dictionary by excluding incorrect terms. UNIFESP unstructured medical data are automatically read, with negated terms recognized using regular expressions (re lib), and integrated into the UNIFESP structured medical data. This process combines automated text processing with manual validation to improve data accuracy and completeness for analysis.

Long COVID study population

The study population included adults (≥18 years) who were hospitalized due to acute COVID-19 at the Hospital São Paulo and discharged. The data were collected between March 2020 and June 2022. We excluded individuals who (1) had no SARS-CoV-2 PCR result or had a negative result; (2) did not have a recorded date of first symptoms; (3) completed questionnaires less than 30 days after the first occurrence of symptoms; (4) were hospitalized for more than 120 days; (5) did not have records of COVID-19 evolution during the acute phase; (6) had date inconsistencies; (7) had encounters after the first questionnaire; or (8) had no severity classification in the acute phase.

Demographic variables from the long COVID dataset, such as sex, age (stratified into 18–39, 40–59, 60–79, and ≥80 years), and race (divided into white, black, mixed-brown, Asian, and Indigenous), were evaluated. Due to the small sample size, Asian and Indigenous individuals were combined for the analysis. Additionally, other variables, such as medical history, length of hospital stay (stratified into 0–14, 15–30, 31–60, and ≥60 days), and severity of the acute phase of COVID-19, were included. The severity categories were defined as moderate (non-ICU ward), severe (intensive care unit; ICU), or critical (ICU with mechanical ventilation).

Statistical analysis

To validate the accuracy of the automated approach, we compared the automated results with manually searched and labeled clinical data. The manual labeling was performed by six of the authors with clinical training, and each record was individually labeled three times. We conducted Pearson’s chi-square test to compare the automated and manual term counts to assess the accuracy of the text mining. Then, we performed a descriptive analysis of the long COVID findings to validate our findings against those of previous studies on the topic. Also, we performed a generalized linear model (GLM) to assess the relationship between the demographic characteristics and the presence of post-COVID-19 symptoms.

Ethical approval

The Brazilian National Commission in Research Ethics approved the research protocol (CONEP approval number 4.921.308 and CAAE registration no. 58619822.6.1001.5505).

Automated labeling of the training dataset

First, we investigated the records from the SIVEP-Gripe. A total of 2,490,196 SARS records of patients admitted to hospitals between December 31, 2019, and March 27, 2023 were collected. All records were then analyzed as input for the PTC tokenization of medical conditions and symptoms, which were used to create the dictionary that was used to structure the data and to create the database for long COVID. Overall, 635,921 (25.5%) records reported one or more medical conditions, and 849,976 (34.1%) reported one or more SARS-related symptoms in the unstructured text field (Fig. S1 and Table 1 ). From the unstructured clinical data, a dictionary collecting synonyms, misspelled and derivative words into a unique term (Table S1 ) was produced. Based on this dictionary, 20 of the most frequent medical conditions and 10 of the most frequent symptoms (Table 1 ) were captured for further analyses.

SARS patient records were stratified by medical conditions and symptoms in a “yes/no” format, such as diabetes mellitus, obesity, cardiopathy, loss of smell, loss of taste, fatigue and cough. The results showed that 22,458 of the terms containing medical conditions captured from the unstructured text overlapped with at least one of the binary comorbidities with a “yes” response in the questionnaire. In addition, 1418 terms overlapped with at least one of the binary symptom variables with a “yes” response. Thus, to evaluate the gain of information, records with overlapping medical conditions or symptoms were excluded. The terms that were not included in the binary variables from the questionnaire appeared more frequently in the unstructured text annotation (Fig. 2 ).

figure 2

The figure shows the most frequent terms captured using PTC. A displays the most frequent terms related to medical conditions, highlighting conditions such as hypertension, smoker status, and hypothyroidism, among others. B shows the most frequent terms associated with symptoms, including headache, myalgia, and asthenia. The bar colors indicate whether the terms appeared in the structured data before the application of PTC (red for “Yes”, green for “No”).

Among medical conditions, the most frequent term captured by the automated reading was “hypertension”, present in 303,109 entries, representing 11.3% of the total database (Fig. 2A ), followed by “smoker” in 61,110 entries, representing 2.3%; “hypothyroidism” in 39,550 entries, representing 1.5%; and “COPD” in 30,387 entries, representing 1.1%. Additionally, “smoking” was found in 61,110 entries (2.3%). Among symptoms, the most frequent terms were “headache”, present in 215,225 entries, representing 8.0% of the total database; “myalgia”, present in 213,035 entries, representing 7.9%; “asthenia”, present in 124,086 entries, representing 4.6%; and “runny nose”, present in 122,766 entries, representing 4.5% (Fig. 2B ).

Validating text mining on EHRs

Data from patients who were admitted with COVID-19 at the Hospital São Paulo, stayed in the hospital for more than 30 days, and were followed at the PCDU after discharge were evaluated.

To validate the PTC method on these data obtained from Hospital São Paulo, 398 post-COVID patient questionnaires collected from the PCDU (Fig. S2 ) were cross-checked. The dictionary derived from records of SARS-hospitalized patients was applied. Medical conditions and symptoms from these post-COVID-19 patients were extracted and studied by using an automated method. The results obtained were compared with those obtained through manual searches conducted by specialists, which showed a high degree of similarity in present, absent and negated terms. According to this method, the similarity ranged from 93% to 99% for medical condition terms and from 87% to 95% for symptom terms (Table S2 ). The statistical significance of these findings is reflected in the p values for all terms, which were less than 0.01 (Fig. 3 and Table S2 ).

figure 3

This figure compares the performance of manual versus automated methods in identifying the most frequently reported terms in unstructured text for A symptoms such as cough, myalgia, fatigue, and headache, and B medical conditions including hypertension, diabetes, former smoker status, and obesity. The bar plots indicate the counts of terms identified as present, absent, or negated, with significant differences between methods ( p  < 0.001). The upper bars represent the manual method, and the lower bars represent the automated method.

The study population was divided into individuals who reported no symptoms (29.1%) and those with at least one symptom 30 days after COVID-19 onset (70.9%) (Table 2 ). Data revealed that 24% of patients with three or more medical conditions showed post-COVID-19 symptoms after 30 days of discharge from the hospital (24% with symptoms against 17% without symptoms).

The demographic data revealed that both groups had a similar gender distribution, with slightly more males (58% without symptoms, 56% with symptoms). Ethnicity distribution indicated that self-declared white individuals were the majority in both groups (55% without symptoms, 60% with symptoms), followed by mixed brown (34% without symptoms, 24% with symptoms). Furthermore, the symptomatic group had a higher percentage of older individuals (43% aged 60–79) compared to the asymptomatic group (32% aged 60–79). Although a slight difference was observed between both groups. The results from generalized linear model (GLM) to assess the relationship between the demographic characteristics and the presence of post-COVID-19 symptoms showed no statistically significant differences (Table S3 ).

For patients who presented with at least one symptom, the most prevalent symptom was dyspnea (77.7%), followed by cough (21.3%) and fatigue (13.5%). Low oxygen saturation (below 27.3%) was the most common continuous variable reported. In terms of lifestyle, 25.9% were former smokers. A total of 48.6% of the population with symptoms after 30 days had hypertension, 26.9% had diabetes, and 15.2% had obesity (Fig. 4 ).

figure 4

The most common symptoms and medical conditions reported by patients in the study are represented in the figure. A shows the percentage of patients with symptoms and B displays the percentage of patients with medical conditions. The data were collected through questionnaires and reveal the prevalence of these conditions in the study population.

Three different developments resulted from this study. First, we built a text mining workflow that was able to extract structured medical information from clinical notes in Brazilian Portuguese. Second, this method, in conjunction with the validated text tokens, could be used as a platform for future analyses of long COVID in hospitals that use different systems. Finally, the method was applied back to the training dataset (SIVEP-Gripe), enriching the national database and resulting in more detailed clinical characterizations of SARS in Brazil in the last decade.

The method developed for text mining of clinical data was based on grouping synonyms by phoneme. Our method was able to extract clinical information that was not available previously as variables, with a total informational gain of 32.30% for the 30 categories of comorbidities and symptoms from the records of hospitalized SARS patients. Furthermore, we validated our method against human labeling using electronic records from patients who returned to the post-COVID-19 unit after being discharged for 30 days, which allowed us to describe the clinical findings related to long COVID in those patients.

The initial difficulty was structuring a database from a set of unstructured data that would allow subsequent analysis of a disease such as COVID-19 and post acute symptoms, characterized as long COVID. The benchmarks used were previous studies on COVID-19 and vaccine effectiveness using national health system datasets, from which cohorts for studies on the effectiveness of different vaccines administered in Brazil were formed, and gathered at the national databases [ 20 , 23 ]. Thus, it was possible to enrich the same dataset and cross-check the informational gain using data from patients who were admitted to the UH and who, after discharge, were followed up at the PCDU due to various symptoms.

The method developed in this study exhibited robust performance and was subsequently used to investigate the effects of long COVID in patients who were admitted to the UH and were followed for several months after being discharged. Phonemic representation has been used previously to cluster variations in writing and represent these clusters as an n-gram [ 24 , 25 ], but this is, for the best of our knowledge, the first time that it has been used for clinical notes in Brazilian Portuguese [ 22 ]. This plot captured groups of variations in terms, such as close synonyms, abbreviations, and typographical errors typical of the language, which confirmed the validation and interpretability of the PTC method.

Importantly, the construction of this method allowed for a more accurate analysis of symptoms in patients followed by the PCDU of Hospital São Paulo, which showed that the majority of individuals presented dyspnea as a prevalent symptom, often accompanied by low oxygen saturation. These data are in accordance with other studies that used different methods, including the studies that reported low oxygen saturation during physical exercise [ 26 , 27 ]. Since dyspnea is one of the most frequent and well-documented symptoms of long COVID [ 28 ], it is notable that, besides its detection, our study and method provided further information concerning low oxygen saturation. In addition, other symptoms, such as fatigue and muscle pain, were detected and had been described by other authors [ 29 ], corroborating the quality of the new method to extract symptoms from non-structured data.

Importantly, the curation and constant maintenance of the dictionary will be continued, and we will update it with new information and terms used by other health services. Thus, new qualifiers of clinical conditions, such as different degrees of dyspnea and the evolution of these clinical conditions over time, which may encompass periods of improvement and worsening, will be included in the dictionary. In addition, creating specific platforms to characterize and identify a little-known and difficult-to-diagnose condition, such as long COVID, represents an important advance for data modeling and decision-making after the occurrence of COVID-19. The tool created from the methods used in this study has characteristics that indicate the possibility of analyzing data in the language in which medical records are written, in addition to machine and human checking, which can overcome the lack of homogeneity in different records and allow more accurate results. These results are important, although it is necessary to emphasize that the risks of death and hospitalization remained statistically high in different phases of the pandemic, particularly in those who were hospitalized during the acute phase of SARS-CoV-2 infection and in countries such as Brazil [ 30 ], in which a high number of cases were reported. Therefore, these countries must also consider the substantial number of individuals with COVID-19 sequelae and provide health care to the population. Since there is also evidence of COVID-19 sequelae in individuals who were not hospitalized, it is crucial to emphasize the importance of treating those who were infected and prevent reinfections. Therefore, reducing the risk of long-term sequelae remains a need in terms of public health and health policies.

Finally, there are still many gaps and regional disparities in long COVID research. In particular, there are significant geographic gaps in the available research data, with an abundance of studies originating from Northern Hemisphere populations and a paucity of information regarding long COVID in low- and middle-income countries. There is a critical need for more focused research in these regions. Therefore, the use of text mining to evaluate non-structured EHRs provides a great opportunity to improve the knowledge of long COVID in areas with resource-limited settings.

The method and modeling presented in this work and the use of cohorts of data to predict and treat long COVID patients will be crucial, and more studies should be performed to not only increase knowledge but also develop the necessary care and rehabilitation methods in addition to the planning of the primary health care system. In this context, studies such as the present one should be expanded to help understand long COVID and predict its effects. These studies will allow the development of prevention and treatment that will lead to higher quality standards in population health even in the face of the a pandemic.

Data availability

Due to ethical and legal reasons, supporting data are not available.

Elpeltagy M, Sallam H. Automatic prediction of COVID–19 from chest images using modified ResNet50. Multimed Tools Appl. 2021;80:26451–63.

Article   PubMed   PubMed Central   Google Scholar  

Abbar S, Mokbel M. The role of AI in digital contact tracing. In: Leveraging artificial intelligence in global epidemics. Gruenwald L, Jain S, Groppe S, editors. Academic Presss, Elsevier; 2021. pp. 203–21.

Chowdhury MEH, Rahman T, Khandakar A, Mazhar R, Kadir MA, Mahbub ZB, et al. Can AI help in screening viral and COVID-19 pneumonia? IEEE Access. 2020;8:132665–76.

Article   Google Scholar  

Cau R, Faa G, Nardi V, Balestrieri A, Puig J, Suri JS, et al. Long-COVID diagnosis: from diagnostic to advanced AI-driven models. Eur J Radiol. 2022;148:110164.

Ke Y-Y, Peng T-T, Yeh T-K, Huang W-Z, Chang S-E, Wu S-H, et al. Artificial intelligence approach fighting COVID-19 with repurposing drugs. Biomed J. 2020;43:355–62.

Chang Z, Zhan Z, Zhao Z, You Z, Liu Y, Yan Z, et al. Application of artificial intelligence in COVID-19 medical area: a systematic review. J Thorac Dis. 2021;13:7034–53.

Cohen AM. A survey of current work in biomedical text mining. Brief Bioinforma. 2005;6:57–71.

Article   CAS   Google Scholar  

Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical natural language processing in languages other than English: opportunities and challenges. J Biomed Semant. 2018;9:12.

Rocha HAL, Solha EZM, Furtado V, Justino FL, Barreto LAL, Da Silva RG, et al. COVID-19 outbreaks surveillance through text mining applied to electronic health records. BMC Infect Dis. 2024;24:359.

Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med Inf. 2019;7:e12239.

Wei W-Q, Teixeira PL, Mo H, Cronin RM, Warner JL, Denny JC. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J Am Med Inform Assoc. 2016;23:e20–7.

Article   PubMed   Google Scholar  

Nurek M, Rayner C, Freyer A, Taylor S, Järte L, MacDermott N, et al. Recommendations for the recognition, diagnosis, and management of long COVID: a Delphi study. Br J Gen Pr. 2021;71:e815–25.

Soriano JB, Murthy S, Marshall JC, Relan P, Diaz JV. A clinical case definition of post-COVID-19 condition by a Delphi consensus. Lancet Infect Dis. 2022;22:e102–7.

Article   CAS   PubMed   Google Scholar  

McGrath LJ, Scott AM, Surinach A, Chambers R, Benigno M, Malhotra D. Use of the postacute sequelae of COVID-19 diagnosis code in routine clinical practice in the US. JAMA Netw Open. 2022;5:e2235089.

Kingery JR, Safford MM, Martin P, Lau JD, Rajan M, Wehmeyer GT, et al. Health status, persistent symptoms, and effort intolerance one year after acute COVID-19 infection. J Gen Intern Med. 2022;37:1218–25.

Bowe B, Xie Y, Al-Aly Z. Acute and postacute sequelae associated with SARS-CoV-2 reinfection. Nat Med. 2022;28:2398–405.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bowe B, Xie Y, Al-Aly Z. Postacute sequelae of COVID-19 at 2 years. Nat Med. 2023;29:2347–57.

Ranzani OT, Bastos LSL, Gelli JGM, Marchesi JF, Baião F, Hamacher S, et al. Characterisation of the first 250 000 hospital admissions for COVID-19 in Brazil: a retrospective analysis of nationwide data. Lancet Respir Med. 2021;9:407–18.

Oliveira EA, Colosimo EA, Silva ACSE, Mak RH, Martelli DB, Silva LR, et al. Risk factors for COVID-19 mortality in hospitalised children and adolescents in Brazil—āuthors’ reply. Lancet Child Adolesc Health. 2021;5:e40–2.

Cerqueira-Silva T, Andrews JR, Boaventura VS, Ranzani OT, De Araújo Oliveira V, Paixão ES, et al. Effectiveness of CoronaVac, ChAdOx1 nCoV-19, BNT162b2, and Ad26.COV2.S among individuals with previous SARS-CoV-2 infection in Brazil: a test-negative, case-control study. Lancet Infect Dis. 2022;22:791–801.

Florentino PTV, Alves FJO, Cerqueira-Silva T, de Araújo Oliveira V, Júnior JBS, Penna GO, et al. Effectiveness of BNT162b2 booster after CoronaVac primary regimen in pregnant people during omicron period in Brazil. Lancet Infect Dis. 2022;22:1669–70.

Bird S, Klein E, Loper E. Natural language processing with Python. 1st ed. Beijing; Cambridge [Mass]: O’Reilly; 2009.

Cerqueira-Silva T, Katikireddi SV, De Araujo Oliveira V, Flores-Ortiz R, Júnior JB, Paixão ES, et al. Vaccine effectiveness of heterologous CoronaVac plus BNT162b2 in Brazil. Nat Med. 2022;28:838–43.

Rahimian M, Warner JL, Jain SK, Davis RB, Zerillo JA, Joyce RM. Significant and distinctive n -Grams in oncology notes: a text-mining method to analyze the effect of opennotes on clinical documentation. JCO Clin Cancer Inform. 2019;3:1–9.

Golz C, Richter D, Sprecher N, Gurtner C. Mental health-related communication in a virtual community: text mining analysis of a digital exchange platform during the Covid-19 pandemic. BMC Psychiatry. 2022;22:430.

Schäfer H, Teschler M, Mooren FC, Schmitz B. Altered tissue oxygenation in patients with post COVID-19 syndrome. Microvasc Res. 2023;148:104551.

Guarnieri G, Lococo S, Bertagna De Marchi L, Cecchetto A, Molena B, Arcaro G, et al. Persistent oxygen desaturation during exercise in patients with long COVID. Eur Respir J. 2022;60:3725.

Domènech-Montoliu S, Puig-Barberà J, Pac-Sa MR, Vidal-Utrillas P, Latorre-Poveda M, Del Rio-González A, et al. Complications post-COVID-19 and risk factors among patients after six months of a SARS-CoV-2 infection: a population-based prospective cohort study. Epidemiologia. 2022;3:49–67.

Global Burden of Disease Long COVID Collaborators, Wulf Hanson S, Abbafati C, Aerts JG, Al-Aly Z, Ashbaugh C, et al. Estimated global proportions of individuals with persistent fatigue, cognitive, and respiratory symptom clusters following symptomatic COVID-19 in 2020 and 2021. JAMA. 2022;328:1604.

Katikireddi SV, Cerqueira-Silva T, Vasileiou E, Robertson C, Amele S, Pan J, et al. Two-dose ChAdOx1 nCoV-19 vaccine protection against COVID-19 hospital admissions and deaths over time: a retrospective, population-based cohort study in Scotland and Brazil. Lancet. 2022;399:25–35.

Download references

Acknowledgements

The authors acknowledge Dr. Lucia Pellanda, Dr. Ethel Maciel, Dr. Adhemar Arthur Chioro, and Dr. Nisia Trindade for their support and discussion on this work. We also acknowledge the support of Fiotec-Fiocruz, FAP-Unifesp, CNPq 400504/2023-5 and FAPESP 2019/02821-8. This study is part of the Alert-Early System of Outbreaks with Pandemic Potential with financial support from the Rockefeller Foundation's Health Initiative (grant 2023-PPI-007 awarded to M-BN).

Author information

These authors contributed equally: Pilar Tavares Veras Florentino, Vinícius de Oliveira Araújo, Henrique Zatti.

These authors jointly supervised this work: Manoel Barral-Netto, Soraya S. Smaili.

Authors and Affiliations

Laboratório de Medicina e Saúde Pública de Precisão (MeSP2), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Brazil

Pilar Tavares Veras Florentino, Viviane Boaventura & Manoel Barral-Netto

Centro de Integração de Dados e Conhecimentos para a Saúde (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Brazil

Pilar Tavares Veras Florentino, Vinícius de Oliveira Araújo, Henrique Zatti, Juracy Bertoldo Junior, George G. Caique Barbosa & Tales Mota Machado

Faculdade de Medicina da Bahia, Universidade Federal da Bahia, Salvador, Brazil

Vinícius de Oliveira Araújo, Viviane Boaventura & Manoel Barral-Netto

Departamento de Farmacologia, Escola Paulista de Medicina, Universidade Federal de São Paulo, São Paulo, Brazil

Caio Vinícius Luis, Célia Regina Santos Cavalcanti, Matheus Henrique Citibaldi de Oliveira, Anderson Henrique França Figueredo Leão, Ernesto Ravera, Alberto Cebukin, Renata Bernardes David & Soraya S. Smaili

Faculdade de Medicina, Universidade de São Paulo, São Paulo, Brazil

Danilo Batista Vieira de Melo

Diretoria de Tecnologia da Informação, Universidade Federal de Ouro Preto, Ouro Preto, Brazil

Tales Mota Machado

Disciplina de Moléstias Infecciosas, Escola Paulista de Medicina, Universidade Federal de São Paulo, São Paulo, Brazil

Nancy C. J. Bellei

You can also search for this author in PubMed   Google Scholar

Contributions

PTVF and VdOA conducted the study; NBVB, MB-N and SS conceived the idea; NB and VB idealized the plan of clinical analyzes, PTVF, HZ and AHFFL conducted the data analysis and JB and GGCB verified the scripts. All authors contributed to the study design. ER and AC were responsible for data obtaining, VdOA, ER, AC, PTVF, TMM, JB and GGCB were responsible for data curating and processing; RBD, VB, CVL, CRSC, MHC and DBVdM were responsible for the manual verification of the text mining automated extraction. PTVF and SS wrote the first draft, and further revised the manuscript; all authors critically revised the manuscript and approved the final version for submission. MB-N and SS jointly supervised this work.

Corresponding authors

Correspondence to Manoel Barral-Netto or Soraya S. Smaili .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Edited by Mauro Piacentini

Supplementary information

Supplementary material, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Veras Florentino, P.T., Araújo, V.d.O., Zatti, H. et al. Text mining method to unravel long COVID’s clinical condition in hospitalized patients. Cell Death Dis 15 , 671 (2024). https://doi.org/10.1038/s41419-024-07043-4

Download citation

Received : 13 April 2024

Revised : 28 August 2024

Accepted : 29 August 2024

Published : 13 September 2024

DOI : https://doi.org/10.1038/s41419-024-07043-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

term paper on text mining

 Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. You can use text mining to analyze vast collections of textual materials to capture key concepts, trends and hidden relationships.

By applying advanced analytical techniques, such as Naïve Bayes, Support Vector Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within their unstructured data.

Text is a one of the most common data types within databases. Depending on the database, this data can be organized as:

  • Structured data: This data is standardized into a tabular format with numerous rows and columns, making it easier to store and process for analysis and machine learning algorithms. Structured data can include inputs such as names, addresses, and phone numbers.
  • Unstructured data:  This data does not have a predefined data format. It can include text from sources, like social media or product reviews, or rich media formats like, video and audio files.
  • Semi-structured data: As the name suggests, this data is a blend between structured and unstructured data formats. While it has some organization, it doesn’t have enough structure to meet the requirements of a relational database. Examples of semi-structured data include XML, JSON and HTML files.

Since  roughly 80% of data in the world resides in an unstructured format  (link resides outside ibm.com), text mining is an extremely valuable practice within organizations. Text mining tools and  natural language processing  (NLP) techniques, like  information extraction  (link resides outside ibm.com), allow us to transform unstructured documents into a structured format to enable analysis and the generation of high-quality insights. This, in turn, improves the decision-making of organizations, leading to better business outcomes.

Use this model selection framework to choose the most appropriate model while balancing your performance requirements with cost, risks and deployment needs.

Read the guide for data leaders

The terms, text mining and text analytics, are largely synonymous in meaning in conversation, but they can have a more nuanced meaning.  Text mining and text analysis identifies textual patterns and trends within unstructured data through the use of machine learning, statistics, and linguistics. By transforming the data into a more structured format through text mining and text analysis, more quantitative insights can be found through text analytics. Data visualization techniques can then be harnessed to communicate findings to wider audiences.

The process of text mining comprises several activities that enable you to deduce information from unstructured text data. Before you can apply different text mining techniques, you must start with text preprocessing, which is the practice of cleaning and transforming text data into a usable format. This practice is a core aspect of natural language processing (NLP) and it usually involves the use of techniques such as language identification, tokenization, part-of-speech tagging, chunking, and syntax parsing to format data appropriately for analysis. When text preprocessing is complete, you can apply text mining algorithms to derive insights from the data. Some of these common text mining techniques include:

Information retrieval

Information retrieval (IR) returns relevant information or documents based on a pre-defined set of queries or phrases. IR systems utilize algorithms to track user behaviors and identify relevant data. Information retrieval is commonly used in library catalogue systems and popular search engines, like Google. Some common IR sub-tasks include:

  • Tokenization: This is the process of breaking out long-form text into sentences and words called “tokens”. These are, then, used in the models, like bag-of-words, for text clustering and document matching tasks. 
  • Stemming: This refers to the process of separating the prefixes and suffixes from words to derive the root word form and meaning. This technique improves information retrieval by reducing the size of indexing files.

Natural language processing (NLP)

Natural language processing , which evolved from computational linguistics, uses methods from various disciplines, such as computer science, artificial intelligence , linguistics, and data science, to enable computers to understand human language in both written and verbal forms. By analyzing sentence structure and grammar, NLP sub-tasks allow computers to “read”. Common sub-tasks include:

  • Summarization: This technique provides a synopsis of long pieces of text to create a concise, coherent summary of a document’s main points.
  • Part-of-Speech (PoS) tagging: This technique assigns a tag to every token in a document based on its part of speech—that is, denoting nouns, verbs, adjectives, and so on. This step enables semantic analysis on unstructured text.
  • Text categorization : This task, which is also known as text classification, is responsible for analyzing text documents and classifying them based on predefined topics or categories. This sub-task is particularly helpful when categorizing synonyms and abbreviations.
  • Sentiment analysis : This task detects positive or negative sentiment from internal or external data sources, allowing you to track changes in customer attitudes over time. It is commonly used to provide information about perceptions of brands, products, and services. These insights can propel businesses to connect with customers and improve processes and user experiences.

Information extraction

Information extraction (IE) surfaces the relevant pieces of data when searching various documents. It also focuses on extracting structured information from free text and storing these entities, attributes, and relationship information in a database. Common information extraction sub-tasks include:

  • Feature selection, or attribute selection, is the process of selecting the important features (dimensions) to contribute the most to output of a predictive analytics model.
  • Feature extraction is the process of selecting a subset of features to improve the accuracy of a classification task. This is particularly important for dimensionality reduction.
  • Named-entity recognition (NER) also known as entity identification or entity extraction, aims to find and categorize specific entities in text, such as names or locations. For example, NER identifies “California” as a location and “Mary” as a woman’s name.

Data mining

Data mining is the process of identifying patterns and extracting useful insights from big data sets. This practice evaluates both structured and unstructured data to identify new information, and it is commonly utilized to analyze consumer behaviors within marketing and sales. Text mining is essentially a sub-field of data mining as it focuses on bringing structure to unstructured data and analyzing it to generate novel insights. The techniques mentioned above are forms of data mining but fall under the scope of textual data analysis.

Text analytics software has impacted the way that many industries work, allowing them to improve product user experiences as well as make faster and better business decisions. Some use cases include:

Customer service: There are various ways in which we solicit customer feedback from our users. When combined with text analytics tools, feedback systems, such as chatbots , customer surveys, NPS (net-promoter scores), online reviews, support tickets, and social media profiles, enable companies to improve their customer experience with speed. Text mining and sentiment analysis can provide a mechanism for companies to prioritize key pain points for their customers, allowing businesses to respond to urgent issues in real-time and increase customer satisfaction. Learn how Verizon is using text analytics in customer service .

Risk management: Text mining also has applications in risk management, where it can provide insights around industry trends and financial markets by monitoring shifts in sentiment and by extracting information from analyst reports and whitepapers. This is particularly valuable to banking institutions as this data provides more confidence when considering business investments across various sectors. Learn how CIBC and EquBot are using text analytics for risk mitigation .

Maintenance: Text mining provides a rich and complete picture of the operation and functionality of products and machinery. Over time, text mining automates decision making by revealing patterns that correlate with problems and preventive and reactive maintenance procedures. Text analytics helps maintenance professionals unearth the root cause of challenges and failures faster.

Healthcare: Text mining techniques have been increasingly valuable to researchers in the biomedical field, particularly for clustering information. Manual investigation of medical research can be costly and time-consuming; text mining provides an automation method for extracting valuable information from medical literature.

Spam filtering: Spam frequently serves as an entry point for hackers to infect computer systems with malware. Text mining can provide a method to filter and exclude these e-mails from inboxes, improving the overall user experience and minimizing the risk of cyber-attacks to end users.

IBM Watson Discovery is an award-winning AI-powered search technology that eliminates data silos and retrieves information buried inside enterprise data.

Watson Natural Language Understanding is a cloud native product that uses deep learning to extract metadata from text such as keywords, emotion, and syntax.

NLP is AI that speaks the language of your business. Build solutions that drive 383% ROI over three years with IBM Watson Discovery.

Learn how IBM Watson can help you with text analytics.

This paper presents the initial efforts towards the creation of a new corpus on the history domain.

Train and fine-tune an LDA topic model with Python's NLTK and Gensim.

Build an AI strategy for your business on one collaborative AI and data platform—IBM watsonx. Train, validate, tune and deploy AI models to help you scale and accelerate the impact of AI with trusted data across your business.

  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

Text Mining in Data Mining

In this article, we will learn about the main process or we should say the basic building block of any NLP-related tasks starting from this stage of basically Text Mining.

What is Text Mining?

Text mining is a component of data mining that deals specifically with unstructured text data. It involves the use of natural language processing (NLP) techniques to extract useful information and insights from large amounts of unstructured text data. Text mining can be used as a preprocessing step for data mining or as a standalone process for specific tasks. 

Text Mining in Data Mining?

Text mining in data mining is mostly used for, the unstructured text data that can be transformed into structured data that can be used for data mining tasks such as classification, clustering , and association rule mining. This allows organizations to gain insights from a wide range of data sources, such as customer feedback, social media posts, and news articles.

Text Mining vs. Text Analytics

Text mining and text analytics are related but distinct processes for extracting insights from textual data. Text mining involves the application of natural language processing and machine learning techniques to discover patterns, trends, and knowledge from large volumes of unstructured text.

However, Text Analytics focuses on extracting meaningful information, sentiments, and context from text, often using statistical and linguistic methods. While text mining emphasizes uncovering hidden patterns, text analytics emphasizes deriving actionable insights for decision-making. Both play crucial roles in transforming unstructured text into valuable knowledge, with text mining exploring patterns and text analytics providing interpretative context.

Why is Text Mining Important?

Text mining is widely used in various fields, such as natural language processing , information retrieval, and social media analysis. It has become an essential tool for organizations to extract insights from unstructured text data and make data-driven decisions.

“Extraction of interesting information or patterns from data in large databases is known as data mining.”

Text mining is a process of extracting useful information and nontrivial patterns from a large volume of text databases. There exist various strategies and devices to mine the text and find important data for the prediction and decision-making process. The selection of the right and accurate text mining procedure helps to enhance the speed and the time complexity also. This article briefly discusses and analyzes text mining and its applications in diverse fields.

As we discussed above, the size of information is expanding at exponential rates. Today all institutes, companies, different organizations, and business ventures are stored their information electronically. A huge collection of data is available on the internet and stored in digital libraries, database repositories, and other textual data like websites, blogs, social media networks, and e-mails. It is a difficult task to determine appropriate patterns and trends to extract knowledge from this large volume of data. Text mining is a part of Data mining to extract valuable text information from a text database repository. Text mining is a multi-disciplinary field based on data recovery, Data mining, AI, statistics , Machine learning , and computational linguistics.

Text Mining Process

Conventional Process of Text Mining

Conventional Process of Text Mining

  • Gathering unstructured information from various sources accessible in various document organizations, for example, plain text, web pages, PDF records, etc.
  • Pre-processing and data cleansing tasks are performed to distinguish and eliminate inconsistency in the data. The data cleansing process makes sure to capture the genuine text, and it is performed to eliminate stop words stemming (the process of identifying the root of a certain word and indexing the data.
  • Processing and controlling tasks are applied to review and further clean the data set.
  • Pattern analysis is implemented in Management Information System.
  • Information processed in the above steps is utilized to extract important and applicable data for a powerful and convenient decision-making process and trend analysis.

Common Methods for Analyzing Text Mining

  • Text Summarization: To extract its partial content and reflect its whole content automatically.
  • Text Categorization: To assign a category to the text among categories predefined by users.
  • Text Clustering: To segment texts into several clusters, depending on the substantial relevance.

Procedures for Analyzing Text Mining

Procedures for Analyzing Text Mining

Text Mining Techniques

Information retrieval.

In the process of Information retrieval, we try to process the available documents and the text data into a structured form so, that we can apply different pattern recognition and analytical processes. It is a process of extracting relevant and associated patterns according to a given set of words or text documents.

For this, we have processes like Tokenization of the document or the stemming process in which we try to extract the base word or let’s say the root word present there. 

Information Extraction

It is a process of extracting meaningful words from documents.

  • Feature Extraction – In this process, we try to develop some new features from existing ones. This objective can be achieved by parsing an existing feature or combining two or more features based on some mathematical operation.
  • Feature Selection – In this process, we try to reduce the dimensionality of the dataset which is generally a common issue while dealing with the text data by selecting a subset of features from the whole dataset.

Natural Language Processing

Natural Language Processing includes tasks that are accomplished by using Machine Learning and Deep Learning methodologies. It concerns the automatic processing and analysis of unstructured text information.

  • Named Entity Recognition (NER) : Identifying and classifying named entities such as people, organizations, and locations in text data.
  • Sentiment Analysis : Identifying and extracting the sentiment (e.g. positive, negative, neutral) of text data.
  • Text Summarization : Creating a condensed version of a text document that captures the main points.

Overview of Text Mining Techniques

Text Preprocessing phase Tokenization How can transform a text into words or text format? Transferring strings into a single textual token. White space separation.
Compound word identification How can I identify words that have a joint meaning? Identifying words with a joint meaning that gets lost word
and noise reduction How can I cope with too many variables in my Document‐Term‐Matrix? Reducing the dimensionality of Document‐Term‐Matrix  Stemming, , Deletion of stop words. infrequent term.
Linguistic analysis How can I identify words with a special meaning or grammatical function? Tagging of words Named‐entity recognition, Part‐of‐speech tagging
Content Analysis Dictionary‐based How can I identify how latent sociological or psychological traits and states are reflected in natural language? Measuring contextual, psychological, linguistic, or semantic concepts and constructs Pre‐defined dictionaries and Customized dictionaries
Algorithmic techniques How can I assign texts to predefined classes? Classifying textual entities into predefined categories techniques such as binary or multi‐class classifiers
Algorithmic techniques How can I group similar documents? Clustering of textual entities into formerly undefined and unknown techniques such as LDA, k‐means, or non‐negative

Text Mining Applications

  • Digital Library : Various text mining strategies and tools are being used to get the pattern and trends from journal and proceedings which is stored in text database repositories. These resources of information help in the field of research area. Libraries are a good resource for text data in digital form. It gives a novel technique for getting useful data in such a way that makes it conceivable to access millions of records online. A green-stone international digital library that supports numerous languages and multilingual interfaces gives a springy method for extracting reports that handle various formats, i.e. Microsoft Word , PDF, postscript, HTML , scripting languages, and email. It additionally supports the extraction of audiovisual and image formats along with text documents. Text Mining processes perform different activities like document collection, determination, enhancement, removing data, and handling substances, and Producing summarization.
  • Academic and Research Field : In the education field, different text-mining tools and strategies are utilized to examine the instructive patterns in a specific region/research field. The main purpose of text mining utilization in the research field is help to discover and arrange research papers and relevant material from various fields on one platform. For this, we use k-Means clustering and different strategies help to distinguish the properties of significant data. Also, student performance in various subjects can be accessed, and how various qualities impact the selection of subjects evaluated by this mining.
  • Life Science : Life science and healthcare industries are producing an enormous volume of textual and mathematical data regarding patient records, sicknesses, medicines, symptoms, and treatments of diseases, etc. It is a major issue to filter data and relevant text to make decisions from a biological data repository. The clinical records contain variable data which is unpredictable, and lengthy. Text mining can help to manage such kinds of data. Text mining is used in biomarkers disclosure, the pharmacy industry, clinical trade analysis examination, clinical study, and patent competitive intelligence also.
  • Social-Media : Text mining is accessible for dissecting and analyzing web-based media applications to monitor and investigate online content like the plain text from internet news, web journals, emails, blogs, etc. Text mining devices help to distinguish and investigate the number of posts, likes, and followers on the web-based media network. This kind of analysis shows individuals’ responses to various posts, and news and how it spread around. It shows the behavior of people who belong to a specific age group and variations in views about the same post.  
  • Business Intelligence : Text mining plays an important role in business intelligence that help different organization and enterprises to analyze their customers and competitors to make better decisions. It gives an accurate understanding of business and gives data on how to improve consumer satisfaction and gain competitive benefits. The text mining devices like IBM text analytics. This mining can be used in the telecom sector, commerce, and customer chain management system.

Advantages of Text Mining

  • Large Amounts of Data : Text mining allows organizations to extract insights from large amounts of unstructured text data.
  • Variety of Applications : Text mining has a wide range of applications, including sentiment analysis, named entity recognition, and topic modeling.
  • Improved Decision Making
  • Cost-effective : Text mining can be a cost-effective way, as it eliminates the need for manual data entry.

Disadvantages of Text Mining

  • Complexity : Text mining can be a complex process requiring advanced skills in natural language processing and machine learning.
  • Quality of Data : The quality of text data can vary, affecting the accuracy of the insights extracted from text mining.
  • High Computational Cost : Text mining requires high computational resources, and it may be difficult for smaller organizations to afford the technology.
  • Limited to Text Data : Text mining is limited to extracting insights from unstructured text data and cannot be used with other data types.
  • Noise in text mining results: Text mining of documents may result in mistakes. It’s possible to find false links or to miss others. In most situations, if the noise (error rate) is sufficiently low, the benefits of automation exceed the chance of a larger mistake than that produced by a human reader.
  • Lack of transparency: Text mining is frequently viewed as a mysterious process where large corpora of text documents are input and new information is produced. Text mining is in fact opaque when researchers lack the technical know-how or expertise to comprehend how it operates, or when they lack access to corpora or text mining tools.

Text mining extracts valuable insights from unstructured text, aiding decision-making across diverse fields. Despite challenges, its applications in academia, healthcare, business, and more demonstrate its significance in converting textual data into actionable knowledge.

Text Mining- FAQs

What is text mining with example.

Text mining is extracting insights from text. Example: analyzing customer reviews to identify sentiments and preferences.

What is NLP and text mining?

NLP is Natural Language Processing, and text mining is using NLP techniques to analyze unstructured text data for insights.

Who uses text mining?

Industries such as healthcare, business, academia, and social media utilize text mining for data-driven decision-making.

What is text mining in Python?

Text mining in Python involves using libraries like NLTK or spaCy for natural language processing tasks.

Why is text mining used?

Text mining is used to extract insights from unstructured text data, aiding decision-making and providing valuable knowledge across various domains.

Please Login to comment...

Similar reads.

  • Computer Subject
  • Data Mining
  • data mining
  • Natural-language-processing
  • OpenAI o1 AI Model Launched: Explore o1-Preview, o1-Mini, Pricing & Comparison
  • How to Merge Cells in Google Sheets: Step by Step Guide
  • How to Lock Cells in Google Sheets : Step by Step Guide
  • PS5 Pro Launched: Controller, Price, Specs & Features, How to Pre-Order, and More
  • #geekstreak2024 – 21 Days POTD Challenge Powered By Deutsche Bank

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

term paper on text mining

Enter your search term

*Limited to most recent 250 articles Use advanced search to set an earlier date range

Sponsored by   

Saving articles

Articles can be saved for quick future reference. This is a subscriber benefit. If you are already a subscriber, please log in to save this article. If you are not a subscriber, click on the View Subscription Options button to subscribe.

Article Saved

Contact us at [email protected]

Forgot Password

Please enter the email address that you used to subscribe on Engineering News. Your password will be sent to this address.

Content Restricted

This content is only available to subscribers

Set Default Regional Edition

Select your default regional edition of MiningWeekly.com

Note: When you select a default region you will be directed to the MiningWeekly.com home page of your choice whenever you visit miningweekly.com. This setting is controlled by cookies and should your cookies be re-set you will then be directed to the regional edition associated with the geographic location of our IP address. Should your cookies be reset then you may again use the menu to select a default region.

term paper on text mining

sponsored by  

term paper on text mining

  • LATEST NEWS
  • LOADSHEDDING
  • MULTIMEDIA LATEST VIDEOS RESOURCES WATCH SECOND TAKE AUDIO ARTICLES CREAMER MEDIA ON SAFM WEBINARS YOUTUBE
  • SECTORS BASE METALS CHEMICALS COAL CORPORATE SOCIAL RESPONSIBILITY CRITICAL MINERALS DIAMONDS DIVERSIFIED MINERS ENVIRONMENTAL EXPLORATION FERROUS METALS FLUORSPAR GEMSTONES GOLD GRAPHITE HEALTH & SAFETY HYDROGEN LEGISLATIVE ENVIRONMENT LITHIUM MINERAL SANDS MINING SERVICES OIL & GAS PLATINUM GROUP METALS POTASH & PHOSPHATES PROJECT MANAGEMENT RARE-EARTH MINERALS SILVER TECHNOLOGY URANIUM VANADIUM
  • WORLD NEWS AFRICA ASIA AUSTRALASIA EUROPE MIDDLE EAST NORTH AMERICA SOUTH AMERICA
  • SPONSORED POSTS
  • ANNOUNCEMENTS
  • BUSINESS THOUGHT LEADERSHIP
  • ENGINEERING NEWS
  • SHOWROOM PLUS
  • PRODUCT PORTAL
  • MADE IN SOUTH AFRICA
  • PRESS OFFICE
  • WEBINAR RECORDINGS
  • COMPANY PROFILES
  • ELECTRA MINING
  • MINING INDABA
  • VIRTUAL SHOWROOMS
  • CREAMER MEDIA
  • MINE PROFILES
  • BACK COPIES
  • BUSINESS LEADER
  • SUPPLEMENTS
  • FEATURES LIBRARY
  • RESEARCH REPORTS
  • PROJECT BROWSER

Change: -0.0104

Article Enquiry

BHP sees ‘fly-up’ in copper prices later, but short-term outlook cooler

Email This Article

separate emails by commas, maximum limit of 4 addresses

term paper on text mining

With EVs being three times as copper-intensive as internal combustion engine vehicles, BHP expects the transport sector to account for over 20% of global copper demand by 2040.

term paper on text mining

As a magazine-and-online subscriber to Creamer Media's Engineering News & Mining Weekly , you are entitled to one free research report of your choice . You would have received a promotional code at the time of your subscription. Have this code ready and click here . At the time of check-out, please enter your promotional code to download your free report. Email [email protected] if you have forgotten your promotional code. If you have previously accessed your free report, you can purchase additional Research Reports by clicking on the “Buy Report” button on this page. The most cost-effective way to access all our Research Reports is by subscribing to Creamer Media's Research Channel Africa - you can upgrade your subscription now at this link .

The most cost-effective way to access all our Research Reports is by subscribing to Creamer Media's Research Channel Africa - you can upgrade your subscription now at this link . For a full list of Research Channel Africa benefits, click here

If you are not a subscriber, you can either buy the individual research report by clicking on the ‘Buy Report’ button, or you can subscribe and, not only gain access to your one free report, but also enjoy all other subscriber benefits , including 1) an electronic archive of back issues of the weekly news magazine; 2) access to an industrial and mining projects browser; 3) access to a database of published articles; and 4) the ability to save articles for future reference. At the time of your subscription, Creamer Media’s subscriptions department will be in contact with you to ensure that you receive a copy of your preferred Research Report. The most cost-effective way to access all our Research Reports is by subscribing to Creamer Media's Research Channel Africa - you can upgrade your subscription now at this link .

If you are a Creamer Media subscriber, click here to log in.

13th September 2024

By: Mariaan Webb

Creamer Media Senior Deputy Editor Online

Font size: - +

Email this article

The diversified mining company has projected a strong outlook for the copper market, forecasting significant price increases owing to expected supply deficits in the latter part of the decade. However, its short-term outlook is more cautious, with the group lowering its forecast for Chinese demand this year and warning of a marginal surplus until the end of next year.

The Australia-headquartered mining giant bases its long- term outlook on the expectation of emerging deficit conditions in the copper market during the final third of the 2020s.

The company foresees a challenging environment for new copper production, with the marginal tonnes in the long-term market likely to originate from either lower- grade brownfield expansions in established mining regions or higher-grade greenfield projects in higher-risk, emerging areas.

“None of these sources of metal is likely to come cheaply, easily – or, unfortunately, promptly,” BHP states in its latest ‘Economic and Commodity Outlook’, highlighting the difficulties in bringing new copper supplies to market.

The company believes that these constraints could lead to a “fly-up” pricing regime, where copper prices become disconnected from traditional cost curves owing to a persistent excess of demand over supply.

The group anticipates a 70% growth in global copper demand between 2021 and 2050. But as global demand for copper continues to grow – driven by the metal’s critical role in renewable-energy technologies, electric vehicles (EVs) and infrastructure – the supply side faces significant hurdles in keeping pace. This imbalance, BHP warns, could lead to a period of elevated and volatile prices, exacerbated by insufficient inventory levels.

With EVs three times as copper- intensive as internal combustion engine vehicles, BHP expects the transport sector to make up over 20% of global copper demand by 2040, compared to only 11% today.

Data centres will be another source of solid copper demand growth, requiring vast amounts of power and cooling, which all require copper, to deliver AI-enabled services. Demand growth in this sector, currently about 1% of global copper demand, could grow sixfold out to 2050.

Meanwhile, in its short-term outlook, BHP notes that Chinese copper demand enjoyed a robust 2023, with a 6% year-on-year increase, driven by healthy growth across end-use sectors and strong demand from energy transition sectors. However, the company expects 2024 to 2025 to be a period of consolidation, with more modest growth of 1% to 2% year-on-year. This is a downgrade from BHP’s prior expectations, reflecting shifts in the Chinese real estate market, particularly the sharp contraction expected in housing completions, a major indicator for copper end-use in housing.

OECD economies are projected to see a modest demand recovery over the next 18 months, with the US expected to bounce back more rapidly than Europe and Japan. India continues to be a bright spot for demand growth, albeit from a relatively small base.

On the supply side, the copper mining sector experienced a tumultuous 2023, marked by unexpected mine closures and guidance downgrades for 2024. While mine supply disruptions in 2024 are expected to be in line with the 5% historical average, the events of 2023 have created a lower starting point, resulting in a projected supply growth of less than 2% year-on-year in 2024, improving to 4% in 2025.

Regional trends are varied, BHP notes, with African supply rising strongly, driven by Chinese investment in the Democratic Republic of Congo, while the Americas remain relatively stagnant, if not in decline.

In 2024, BHP expects the copper market to experience a marginal surplus, followed by a slightly larger, though still modest, surplus in 2025. However, this accumulation of inventory is likely to offer only minimal protection against the anticipated deficits in the latter half of the decade, the major notes.

Edited by Martin Zhuwakinyu Creamer Media Senior Deputy Editor

Research Reports

Cover image of Creamer Media's Coal 2024 report

Latest Multimedia

Mines shown by red dots; x marks exploration.

Latest News

Britain's approval for new coal mine unlawful, court rules

Developed to exceed the latest EN 15964 standards for police breathalysers proving that it will remain accurate and reliable for many years to come.

Yale Lifting Solutions

Yale Lifting Solutions is a leading supplier of lifting and material handling equipment in Southern Africa. Yale offers a wide range of quality...

sponsored by

Magazine round up | 13 September 2024

Press Office

Announcements

Subscribe to improve your user experience...

Option 1 (equivalent of R125 a month):

Receive a weekly copy of Creamer Media's Engineering News & Mining Weekly magazine (print copy for those in South Africa and e-magazine for those outside of South Africa) Receive daily email newsletters Access to full search results Access archive of magazine back copies Access to Projects in Progress Access to ONE Research Report of your choice in PDF format

Option 2 (equivalent of R375 a month):

All benefits from Option 1 PLUS Access to Creamer Media's Research Channel Africa for ALL Research Reports, in PDF format, on various industrial and mining sectors including Electricity; Water; Energy Transition; Hydrogen; Roads, Rail and Ports; Coal; Gold; Platinum; Battery Metals; etc.

Already a subscriber?

Forgotten your password?

MAGAZINE & ONLINE

R1500 (equivalent of R125 a month)

Receive weekly copy of Creamer Media's Engineering News & Mining Weekly magazine (print copy for those in South Africa and e-magazine for those outside of South Africa)

Access to full search results

Access archive of magazine back copies

Access to Projects in Progress

Access to ONE Research Report of your choice in PDF format

RESEARCH CHANNEL AFRICA

R4500 (equivalent of R375 a month)

All benefits from Option 1

Electricity

Energy Transition

Roads, Rail and Ports

Battery Metals

CORPORATE PACKAGES

Discounted prices based on volume

Receive all benefits from Option 1 or Option 2 delivered to numerous people at your company

Intranet integration access to all in your organisation

Magazine Cover image

DAILY NEWS YOU CAN USE

Register for free newsletter

Newsletter Icon

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

sustainability-logo

Article Menu

term paper on text mining

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Deformation-adapted spatial domain filtering algorithm for uav mining subsidence monitoring.

term paper on text mining

1. Introduction

2. overview of the area, 2.1. overview of the experimental area, 2.2. overview of the study area, 3. materials, 3.1. data sources, 3.1.1. uav orthophoto data, 3.1.2. validate data, 4. principles and methods, 4.1. data characteristics, 4.2. error distribution of subsidence basin established by conventional method, 4.3. deformation-adapted spatial domain filtering algorithm, 5. results and discussion, 5.1. simulation experimental results, 5.2. case application results, accuracy assessment, 6. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.

  • Li, D.; Deng, K.; Gao, X.; Niu, H. Monitoring and Analysis of Surface Subsidence in Mining Area Based on SBAS-InSAR. Geomat. Inf. Sci. Wuhan Univ. 2018 , 43 , 1531–1537. (In Chinese) [ Google Scholar ]
  • Chen, Y.; Tao, Q.; Liu, G.; Wang, L.; Wang, F.; Wang, K. Detailed mining subsidence monitoring combined with InSAR and probability integral method. Chin. J. Geophys. 2021 , 64 , 3554–3566. (In Chinese) [ Google Scholar ]
  • Yang, J.; Jiang, Y.; Zhou, J.; Huang, L.; Lu, X. Analysis on reliability and accuracy of subsidence measurement with gps technique. J. Geod. Geodyn. 2006 , 26 , 70–75. (In Chinese) [ Google Scholar ]
  • Liu, G.; Zhang, L.; Cheng, S.; Jiang, T. Feasibility Analysis of Monitoring Mining Surface Substance Using InSAR/GPS Data Fusion. Bull. Surv. Mapp. 2005 , 11 , 13–16. (In Chinese) [ Google Scholar ]
  • Wang, Z.; Zhang, J.; Huang, G. Precise monitoring and analysis of the land subsidence in Jining coal mining area based on InSAR technique. J. China Univ. Min. Technol. 2014 , 43 , 169–174. [ Google Scholar ]
  • Deng, J.; Xu, Y. Three-dimensional Dynamie Monitoring and Analysis of Surface Deformation in Mining Area Based on Multi-track SAR Image. Metal Mine 2019 , 10 , 48–54. (In Chinese) [ Google Scholar ]
  • Hu, L.; Navarro-Hernández, M.I.; Liu, X.; Tomás, R.; Tang, X.; Bru, G.; Ezquerro, P.; Zhang, Q. Analysis of regional large-gradient land subsidence in the Alto Guadalentín Basin (Spain) using open-access aerial LiDAR datasets. Remote Sens. Environ. 2022 , 280 , 113218. [ Google Scholar ] [ CrossRef ]
  • Chen, B.; Deng, K.; Fan, H.; Hao, M. Large-scale deformation monitoring in mining area by D-InSAR and 3D laser scanning technology integration. Int. J. Min. Sci. Technol. 2013 , 23 , 555–561. [ Google Scholar ] [ CrossRef ]
  • Chen, B.Q.; Deng, K.Z. Integration of D-InSAR technology and PSO-SVR algorithm for time series monitoring and dynamic prediction of coal mining subsidence. Surv. Rev. 2014 , 46 , 392–400. [ Google Scholar ] [ CrossRef ]
  • Dong, L.; Wang, C.; Tang, Y.; Tang, F.; Zhang, H.; Wang, J.; Duan, W. Time Series InSAR Three-Dimensional Displacement Inversion Model of Coal Mining Areas Based on Symmetrical Features of Mining Subsidence. Remote Sens. 2021 , 13 , 2143. [ Google Scholar ] [ CrossRef ]
  • Liu, G.S.; Lv, J.C. Research on key technologies for monitoring changes in mining areas based on unmanned aerial vehicles. Bull. Surv. Mapp. 2013 , 95–98. (In Chinese) [ Google Scholar ]
  • Sui, L.; Zhang, Y.; Zhang, S.; Chen, W. Filtering of airborne LiDAR point cloud data based on progressive TIN. Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomat. Inf. Sci. Wuhan Univ. 2011 , 36 , 1159–1163. [ Google Scholar ]
  • Polat, N.; Uysal, M. Investigating performance of Airborne LiDAR data filtering algorithms for DTM generation. Measurement 2015 , 63 , 61–68. [ Google Scholar ] [ CrossRef ]
  • Cook, K.L. An evaluation of the effectiveness of low-cost UAVs and structure from motion for geomorphic change detection. Geomorphology 2017 , 278 , 195–208. [ Google Scholar ] [ CrossRef ]
  • Liang, X.; Kankare, V.; Hyyppä, J.; Wang, Y.; Kukko, A.; Haggrén, H.; Yu, X.; Kaartinen, H.; Jaakkola, A.; Guan, F.; et al. Terrestrial laser scanning in forest inventories. ISPRS J. Photogramm. Remote Sens. 2016 , 115 , 63–77. [ Google Scholar ] [ CrossRef ]
  • Mengxia, Z.; Xinqi, Z.; Bo, L. Dynamic Monitoring Method for Mining Area Based on UAV Images. Bull. Surv. Mapp. 2017 , 43–47. (In Chinese) [ Google Scholar ]
  • Jianfeng, Z.; Pengcheng, Z.; Dejun, W.; Chenyang, M.; Chongwu, Z.; Hongwu, Y.; Dong, W. UAV aerial triangulation: Point error distributions and the influencing mechanisms of ground control points on its accuracy. Coal Geol. Explor. 2023 , 51 , 151–161. (In Chinese) [ Google Scholar ]
  • Ćwiąkała, P.; Gruszczyński, W.; Stoch, T.; Puniach, E.; Mrocheń, D.; Matwij, W.; Matwij, K.; Nędzka, M.; Sopata, P.; Wójcik, A. UAV Applications for Determination of Land Deformations Caused by Underground Mining. Remote Sens. 2020 , 12 , 1733. [ Google Scholar ] [ CrossRef ]
  • Cardenal, J.; Fernández, T.; Pérez-García, J.L.; Gómez-López, J.M. Measurement of Road Surface Deformation Using Images Captured from UAVs. Remote Sens. 2019 , 11 , 1507. [ Google Scholar ] [ CrossRef ]
  • Liu, J.; Liu, X.; Lv, X.; Wang, B.; Lian, X. Novel Method for Monitoring Mining Subsidence Featuring Co-Registration of UAV LiDAR Data and Photogrammetry. Appl. Sci. 2022 , 12 , 9374. [ Google Scholar ] [ CrossRef ]
  • Lian, X.-g.; Liu, X.-y.; Ge, L.; Hu, H.F.; Du, Z.; Wu, Y.-r. Time-series unmanned aerial vehicle photogrammetry monitoring method without ground control points to measure mining subsidence. J. Appl. Remote Sens. 2021 , 15 , 024505. [ Google Scholar ] [ CrossRef ]
  • Zhou, D.; Qi, L.; Zhang, D.; Zhou, B.; Guo, L. Unmanned Aerial Vehicle (UAV) Photogrammetry Technology for Dynamic Mining Subsidence Monitoring and Parameter Inversion: A Case Study in China. IEEE Access 2020 , 8 , 16372–16386. [ Google Scholar ]
  • Yang, X.; Yao, W.; Zheng, J.; Ma, B.; Ma, X. UAV terrain following technology application in the mining subsidence monitoring research. Bull. Surv. Mapp. 2021 , 111–115. (In Chinese) [ Google Scholar ]
  • Puniach, E.; Gruszczyński, W.; Stoch, T.; Mrocheń, D.; Ćwiąkała, P.; Sopata, P.; Pastucha, E.; Matwij, W. Determination of the coefficient of proportionality between horizontal displacement and tilt change using UAV photogrammetry. Eng. Geol. 2023 , 312 , 106939. [ Google Scholar ] [ CrossRef ]
  • Liu, X.; Zhu, W.; Lian, X.; Xu, X. Monitoring Mining Surface Subsidence with Multi-Temporal Three-Dimensional Unmanned Aerial Vehicle Point Cloud. Remote Sens. 2023 , 15 , 374. [ Google Scholar ] [ CrossRef ]
  • Siafali, E.; Tsioras, P.A. An Innovative Approach to Surface Deformation Estimation in Forest Road and Trail Networks Using Unmanned Aerial Vehicle Real-Time Kinematic-Derived Data for Monitoring and Maintenance. Forests 2024 , 15 , 212. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

ProjectParametersProjectParameter
UAV modelDJI Phantom 4 RTKCamera nameFC6130r
Maximum flight time/min30Camera pixel20 megapixels
Camera sensorOne inch COMOS
Pixel width/px5472
Pixel high/px3648
TimeAltitude/mGSD/cmNumber of Images per IssueRoute OverlapLateral Overlap
23 June 2023501.410485%75%
DateAltitude/mGSD/cmOverlap RateNumber of ImagesWeather
4 March 2021250.780%/70%1109Sunny
1 April 2021250.780%/70%1091Sunny
18 April 2021250.780%/70%1100Cloudy
21 May 2021250.780%/70%1109Sunny
10 June 2021250.780%/70%1091Cloudy
13 July 2021250.780%/70%1100Sunny
11 January 2022250.780%/70%1100Sunny
PhaseRMSE in UAV Original Data Measurements (mm)RMSE in UAV Proceed Data Measurements (mm)Precision Improvement
Phase IV13838.5%
Phase V181138.9%
Phase VI201240.0%
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Zha, J.; Miao, P.; Ling, H.; Yu, M.; Sun, B.; Zhong, C.; Hao, G. Deformation-Adapted Spatial Domain Filtering Algorithm for UAV Mining Subsidence Monitoring. Sustainability 2024 , 16 , 8039. https://doi.org/10.3390/su16188039

Zha J, Miao P, Ling H, Yu M, Sun B, Zhong C, Hao G. Deformation-Adapted Spatial Domain Filtering Algorithm for UAV Mining Subsidence Monitoring. Sustainability . 2024; 16(18):8039. https://doi.org/10.3390/su16188039

Zha, Jianfeng, Penglong Miao, Hukai Ling, Minghui Yu, Bo Sun, Chongwu Zhong, and Guowei Hao. 2024. "Deformation-Adapted Spatial Domain Filtering Algorithm for UAV Mining Subsidence Monitoring" Sustainability 16, no. 18: 8039. https://doi.org/10.3390/su16188039

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

HDML: hybrid data-driven multi-task learning for China’s stock price forecast

  • Published: 13 September 2024

Cite this article

term paper on text mining

  • Weiqiang Xu 1 ,
  • Yang Liu 1 ,
  • Wenjie Liu 1 ,
  • Huakang Li 2 &
  • Guozi Sun 1  

Recent years have witnessed the rapid development of the China’s stock market, but investment risks have also emerged. Stock price is always unstable and non-linear, affected not only by historical transaction data but also by national policies, news, and other data. Stock price and textual data are beginning to be employed in the prediction process. However, the challenge lies in effectively integrating feature information derived from stock price and textual information. To address the problem, in this paper, this paper proposes a H ybrid D ata-driven M ulti-task L earning( HDML ) framework to predict stock price. HDML adopts hybrid data as model input, mining the transaction and capital flow data information in the stock market and considering the impact of investors’ emotions on the stock market. In addition, we incorporate multi-task learning, which predicts the closing price range of stock based on structured data and then corrects the prediction results through investors’ comment text data. HDML effectively captures the relationship between different modal data through multi-task learning and achieve improvements on both tasks. The experimental results show that compared with previous work, HDML reduces the RMSE of the evaluation set by 12.14% and improves the F1 score by an average of 13.64% at the same time. Moreover, value at risk (VaR), together with the HDML model, can help investors weigh the potential gains against the associated risks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

term paper on text mining

Explore related subjects

  • Artificial Intelligence

Data Availability

The datasets are available from the corresponding author on reasonable request.

Chen Y, Wu J, Wu Z (2022) China’s commercial bank stock price prediction using a novel k-means-lstm hybrid approach. Expert Syst Appl (Sep.):202

Park D, Ryu D (2021) A machine learning-based early warning system for the housing and stock markets. IEEE Access 9:85566–85572

Article   Google Scholar  

Yilmaz FM, Yildiztepe E (2022) Statistical evaluation of deep learning models for stock return forecasting. Comput Econ pp 1–24

Ariyo AA, Adewumi AO, Ayo CK (2014) Stock price prediction using the arima model. In: 2014 UKSim-AMSS 16th international conference on computer modelling and simulation, IEEE, pp 106–112

Lütkepohl H (2013) Vector autoregressive models. Handbook of research methods and applications in empirical macroeconomics 30

Hassan MR, Nath B (2005) Stock market forecasting using hidden markov model: a new approach. In: 5th International conference on intelligent systems design and applications (ISDA’05), IEEE, pp 192–196

Drucker H, Burges CJ, Kaufman L et al (1996) Support vector regression machines. Advances in neural information processing systems 9

Reddy GT, Reddy MPK, Lakshmanna K et al (2020) Analysis of dimensionality reduction techniques on big data. Ieee Access 8:54776–54788

Jiang W (2021) Applications of deep learning in stock market prediction: recent progress. Expert Syst Appl 184:115537

Shah J, Vaidya D, Shah M (2022) A comprehensive review on multiple hybrid deep learning approaches for stock prediction. Intell Syst Appl p 200111

Mejbri H, Mahfoudh M, Forestier G (2022) Deep learning-based sentiment analysis for predicting financial movements. In: International conference on knowledge science, engineering and management. Springer, pp 586–596

Lu W, Li J, Wang J et al (2021) A cnn-bilstm-am method for stock price prediction. Neural Comput Appl 33:4741–4753

Kanwal A, Lau MF, Ng SP et al (2022) Bicudnnlstm-1dcnn—a hybrid deep learning-based predictive model for stock price prediction. Expert Syst Appl 202:117123

Moghar A, Hamiche M (2020) Stock market prediction using lstm recurrent neural network. Procedia Comput Sci 170:1168–1173

Teng X, Zhang X, Luo Z (2022) Multi-scale local cues and hierarchical attention-based lstm for stock price trend prediction. Neurocomputing 505:92–100

Wang H, Li S, Wang T et al (2021) Hierarchical adaptive temporal-relational modeling for stock trend prediction. In: IJCAI, pp 3691–3698

Wu JMT, Li Z, Herencsar N et al (2023) A graph-based cnn-lstm stock price prediction algorithm with leading indicators. Multimedia Syst 29(3):1751–1770

Kumar A, Alsadoon A, Prasad P et al (2022) Generative adversarial network (gan) and enhanced root mean square error (ermse): deep learning for stock price movement prediction. Multimed Tool Appl pp 1–19

Wang C, Chen Y, Zhang S et al (2022) Stock market index prediction using deep transformer model. Expert Syst Appl 208:118128

Xiang S, Cheng D, Shang C et al (2022) Temporal and heterogeneous graph neural network for financial time series prediction. In: Proceedings of the 31st ACM international conference on information & knowledge management, pp 3584–3593

Ashtiani MN, Raahemi B (2023) News-based intelligent prediction of financial markets using text mining and machine learning: a systematic literature review. Expert Syst Appl 217:119509

Swathi T, Kasiviswanath N, Rao AA (2022) An optimal deep learning-based lstm for stock price prediction using twitter sentiment analysis. Appl Intell 52(12):13675–13688

Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

Cheng LC, Huang YH, Wu ME (2018) Applied attention-based lstm neural networks in stock prediction. In: 2018 IEEE International conference on big data (Big Data), IEEE, pp 4716–4718

Zhang CX, Li J, Huang XF et al (2022) Forecasting stock volatility and value-at-risk based on temporal convolutional networks. Expert Syst Appl 207:117951

Akhtar MM, Zamani AS, Khan S et al (2022) Stock market prediction based on statistical data using machine learning algorithms. Journal of King Saud University-Science 34(4):101940

Vijh M, Chandola D, Tikkiwal VA et al (2020) Stock closing price prediction using machine learning techniques. Procedia Comput Sci 167:599–606

Chen J, Wen Y, Nanehkaran YA et al (2023) Machine learning techniques for stock price prediction and graphic signal recognition. Eng Appl Artif Intell 121:106038

Kumbure MM, Lohrmann C, Luukka P et al (2022) Machine learning techniques and data for stock market forecasting: a literature review. Expert Syst Appl 197:116659

Yun KK, Yoon SW, Won D (2021) Prediction of stock price direction using a hybrid ga-xgboost algorithm with a three-stage feature engineering process. Expert Syst Appl 186:115716

Jing N, Wu Z, Wang H (2021) A hybrid model integrating deep learning with investor sentiment analysis for stock price prediction. Expert Syst Appl 178:115019

Zhang Q, Qin C, Zhang Y et al (2022) Transformer-based attention network for stock movement prediction. Expert Syst Appl 202:117239

Wu Y, Fu Z, Liu X et al (2023) A hybrid stock market prediction model based on gng and reinforcement learning. Expert Syst Appl 228:120474

Park HJ, Kim Y, Kim HY (2022) Stock market forecasting using a multi-task approach integrating long short-term memory and the random forest framework. Appl Soft Comput 114:108106

Dhal P, Azad C (2022) A comprehensive survey on feature selection in the various fields of machine learning. Appl Intell 52(4):4543–4581

Singh D, Singh B (2020) Investigating the impact of data normalization on classification performance. Appl Soft Comput 97:105524

Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Advances in neural information processing systems 30

Boulahia SY, Amamra A, Madi MR et al (2021) Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach Vis Appl 32(6):121

Zhang Q, Zhang Y, Bao F et al (2024) Incorporating stock prices and text for stock movement prediction based on information fusion. Eng Appl Artif Intell 127:107377

Liu J, Li T, Xie P et al (2020) Urban big data fusion based on deep learning: An overview. Information Fusion 53:123–133

Behera J, Pasayat AK, Behera H et al (2023) Prediction based mean-value-at-risk portfolio optimization using machine learning regression algorithms for multi-national stock markets. Eng Appl Artif Intell 120:105843

Mabrouk S, Saadi S (2012) Parametric value-at-risk analysis: evidence from stock indices. Q Rev Econ Finance 52(3):305–321

Powers DM (2020) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv:2010.16061

Karunasingha DSK (2022) Root mean square error or mean absolute error? use their ratio as well. Inf Sci 585:609–629

Chicco D, Warrens MJ, Jurman G (2021) The coefficient of determination r-squared is more informative than smape, mae, mape, mse and rmse in regression analysis evaluation. Peerj computer science 7:e623

Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. ACM

Chen K, Zhou Y, Dai F (2015) A lstm-based method for stock returns prediction: a case study of china stock market. In: 2015 IEEE International conference on big data (big data), IEEE, pp 2823–2824

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and helpful suggestions. We acknowledge the support received from the National Natural Science Foundation of China (No. 62472234, No. 62372245), and the Natural Science Foundation of Xinjiang Uygur Autonomous Region, China under the grant number of 2024D01A55.

Author information

Authors and affiliations.

School of Computer Science, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Nanjing, 210023, Jiangsu, China

Weiqiang Xu, Yang Liu, Wenjie Liu & Guozi Sun

School of Artificial Intelligence and Advanced Computing, Xi’an Jiaotong-Liverpool University, No. 111, Ren’ai Road, Suzhou, 215123, Jiangsu, China

You can also search for this author in PubMed   Google Scholar

Contributions

Weiqiang Xu: Conceptualization, Methodology, Writing -original draft; Yang Liu: Data curation, Writing -original draft; Wenjie Liu: Conceptualization, Data curation; Huakang Li: Writing -review and editing; Guozi Sun: Writing -review & editing.

Corresponding author

Correspondence to Guozi Sun .

Ethics declarations

Competing of interest.

The authors have no competing interests to declare that are relevant to the content of this article. The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Ethical and Informed Consent for Data Used

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Specific data items and corresponding definition

The historical transaction dataset contains information such as opening price, closing price, highest price, and trading volume. The historical capital flow dataset contains information such as turnover rate, net inflow of major forces, and net inflow. The specific data items and corresponding definitions are as shown in Table 10 .

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Xu, W., Liu, Y., Liu, W. et al. HDML: hybrid data-driven multi-task learning for China’s stock price forecast. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05838-8

Download citation

Accepted : 31 August 2024

Published : 13 September 2024

DOI : https://doi.org/10.1007/s10489-024-05838-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Stock prediction
  • Data fusion
  • Multi-task learning
  • LSTM neural network
  • Self-attention
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. (PDF) An Analysis on Text Mining Techniques for Smart Literature Review

    term paper on text mining

  2. (PDF) Text mining methods and techniques-A survey of the text mining

    term paper on text mining

  3. (PDF) Text Mining: Techniques, Applications and Issues

    term paper on text mining

  4. (PDF) Text Mining at Feature Level: A Review

    term paper on text mining

  5. (PDF) Text Mining Research: A Survey

    term paper on text mining

  6. (PDF) Basic techniques in text mining using open-source tools

    term paper on text mining

VIDEO

  1. Data Mining- Term-Document Incidence Matrix (2)

  2. Paper text #paper #text #bullet #love #song #sad

  3. Cara Membuat Folded Paper Text Effect di Canva

  4. Applied Text Mining in Python

  5. Paper Text || @TheAmayraShow || #paper #Text

  6. Bookbinding Plough

COMMENTS

  1. The application of text mining methods in innovation research: current

    Text mining techniques that rely on word frequency counts to measure contextual, psychological, linguistic, or semantic concepts and constructs are among the most widely adopted approaches for computer-aided analysis of textual data in management-related research so far (Duriau et al., 2007; Short et al., 2010).

  2. Text Preprocessing for Text Mining in Organizational Research: Review

    Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. ... Short Papers) (pp. 431-437). Association for Computational Linguistics. ... Spärck Jones K. (1972). A statistical interpretation of term specificity and its ...

  3. (PDF) Using Text Mining Techniques for Extracting Information from

    The primary goals of this research are (1) Using text mining techniques for. identifying the topics of a scienti fic text related to ML research and developing a. hierarchical and evolutionary ...

  4. Research trends in text mining: Semantic network and main path analysis

    To answer these questions, we analyzed 1856 papers about text mining stored in the international academic citation databases, Scopus and Web of Science. To find current trends in text mining, semantic network analysis and main path analysis are implemented as text mining methods in this paper. 2. Theoretical background2.1. Text mining

  5. Text mining in unstructured text: techniques, methods and analysis

    This paper briefly discuss and analyze the text mining techniques and their applications in diverse fields of life. Moreover, the issues in the field of text mining that affect the accuracy and ...

  6. Text Mining Challenges and Applications, A Comprehensive Review

    Contact# 0092-334-9342615. Summary. Text Mining which is known as text analysis, is defined as the. process to extract the proper text patterns from the unstructured. text data, which are ...

  7. Text Mining

    Text preprocessing strongly affects the success of the outcome of text mining. Tokenization, or splitting the input into words, is an important first step that seems easy but is fraught with small decisions: how to deal with apostrophes and hyphens, capitalization, punctuation, numbers, alphanumeric strings, whether the amount of white space is significant, whether to impose a maximum length ...

  8. Overview of Text Mining

    Abstract. Text mining and data mining are contrasted relative to automated prediction. Models are constructed by training on samples of unstructured documents, and results are projected to new text. A standard data format for input to prediction methods is described. The key objective of data preparation is to transform text into a numerical ...

  9. Text mining on social media data: a systematic literature review

    Text mining is the process of getting meaningful information from unstructured data. In this paper, a precise writing overview was directed to research text mining via online media information. Thus, a comprehensive deliberate writing audit (SLR) was completed to explore online media as a hotspot for the perception of text mining. For this reason, 40 articles were chosen from different notable ...

  10. Text mining methodologies with R: An application to central bank texts

    Higher saliency values indicate that a word is more helpful in identifying a specific topic than a randomly selected term. 7. Conclusion. In this paper, we review some of the most commonly used text mining methodologies. We demonstrate how text sentiment and topics can be extracted from a set of text sources.

  11. PDF The Text Mining Handbook

    Text mining is a new and exciting area of computer science research that tries to solve the crisis of information overload by combining techniques from data mining, machine learning, natural language processing, information retrieval, and knowledge management. Similarly, link detection - a rapidly evolving approach to the analysis of text ...

  12. Text mining and semantics: a systematic mapping study

    As text semantics has an important role in text meaning, the term semantics has been seen in a vast sort of text mining studies. However, there is a lack of studies that integrate the different research branches and summarize the developed works. This paper reports a systematic mapping about semantics-concerned text mining studies. This systematic mapping study followed a well-defined protocol.

  13. Text Mining

    Text categorization with WEKA: A survey. Donatella Merlini, Martina Rossini, in Machine Learning with Applications, 2021. 1 Introduction. Text Mining is a term which generally refers to the automatic extraction of interesting and non-trivial information from text in an unstructured form; generally, its purpose is not to understand all or part of what is said by a particular speaker/writer, but ...

  14. Applications of text mining within systematic reviews

    In this paper, we describe the application of four text mining technologies, namely, automatic term recognition, document clustering, classification and summarization, which support the identification of relevant studies in systematic reviews. The contributions of text mining technologies to improve reviewing efficiency are considered and their ...

  15. Comprehensive review of text-mining applications in finance

    Text-mining technologies have substantially affected financial industries. As the data in every sector of finance have grown immensely, text mining has emerged as an important field of research in the domain of finance. Therefore, reviewing the recent literature on text-mining applications in finance can be useful for identifying areas for further research. This paper focuses on the text ...

  16. Text Mining in Education—A Bibliometrics-Based Systematic Review

    Text mining, also known as text data analytics, is the process of extracting patterns and useful textual details in terms of words and topics from written words (Ahadi et al., 2022; Tavana et al ...

  17. PDF The Text Mining Handbook

    The Text Mining Handbook ... and conference papers in these areas. James Sanger is a venture capitalist, applied technologist, and recognized industry expert ... term extraction), the storage of the intermediate represen-tations, the techniques to analyze these intermediate representations (such as distri-

  18. Text mining at the term level

    Previous work in text mining focused at the word or the tag level. This paper presents an approach to performing text mining at the term level. The mining process starts by preprocessing the document collection and extracting terms from the documents. Each document is then represented by a set of terms and annotations characterizing the ...

  19. PDF Text Mining: Overview, Applications and Issues

    What is Text Mining? • It is also referred as text data mining and is roughly equivalent to text analytics. • Process of extracting interesting and non-trivial knowledge from unstructured text. • Text analysis also involves following: • information retrieval • lexical analysis to study word frequency distribution

  20. Text mining method to unravel long COVID's clinical ...

    Therefore, the use of text mining to evaluate non-structured EHRs provides a great opportunity to improve the knowledge of long COVID in areas with resource-limited settings.

  21. What Is Text Mining?

    Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. You can use text mining to analyze vast collections of textual materials to capture key concepts, trends and hidden relationships. By applying advanced analytical techniques ...

  22. PDF Text mining methodologies with R: An application to central bank texts

    pproach to text analysis can be described by several sequential steps. Given the unstructured nature of text data, a consistent and repeatable approach is required to. assign a set of meaningful quantitative measures to this type of data. This process can be roughly divided into four steps: data selection, da.

  23. Text Mining in Data Mining

    Text mining is a component of data mining that deals specifically with unstructured text data. It involves the use of natural language processing (NLP) techniques to extract useful information and insights from large amounts of unstructured text data. Text mining can be used as a preprocessing step for data mining or as a standalone process for ...

  24. BHP predicts potential 'fly-up' in copper prices amid supply challenges

    The Australia-headquartered mining giant bases its long- term outlook on the expectation of emerging deficit conditions in the copper market during the final third of the 2020s.

  25. (PDF) Text Mining: Use of TF-IDF to Examine the ...

    So, to rectify. this issue, the occurrence of any term in a document is divided. by the total terms present in that document, to find the term. frequency. So, in this case the term frequency of ...

  26. Sustainability

    Underground coal mining induces surface subsidence, leading to disasters such as damage to buildings and infrastructure, landslides, and surface water accumulation. Preventing and controlling disasters in subsidence areas and reutilizing land depend on understanding subsidence regularity and obtaining surface subsidence monitoring data. These data are crucial for the reutilization of regional ...

  27. HDML: hybrid data-driven multi-task learning for China's ...

    To address the problem, in this paper, this paper proposes a Hybrid Data-driven Multi-task Learning(HDML) framework to predict stock price. HDML adopts hybrid data as model input, mining the transaction and capital flow data information in the stock market and considering the impact of investors' emotions on the stock market.