Mining of Text Data with the Application of Side Information-A Review


Prof. Garima Singh*, Ms. Neha Tiwari

Computer Science and Engineering,   W.C.O.E.M.,  Nagpur

*Corresponding Author Email:



In Text Mining, Side Information is present with the Text Documents, much of the data in those files consists of unstructured Text. Side information is used in various Text Mining applications such as user-access behavior from web logs, document origin information, other non-textual attributes which are embedded into the text or the links in the document. These could play a significant role for clustering process. However, some of the information may contain noisy data which may lead in increase in the level of difficulty for estimation of importance of side information.  In such condition it is risky to use the side information in mining process because it may result in the improvement of mining process or may add noise to the process. Therefore to maximize the benefit of using side information in mining text data, we need a principled way to perform the mining process.






Unstructured text generally refers to the process of extracting non-trivial and interesting information and knowledge. Text mining has been defined as extracting information automatically from different written resources by discovery of computer of new, previously unknown information. Data Mining consists of various tools which are designed to handle structured data from databases or XML files whereas Text mining can handle semi-structured data sets such as emails, full-text documents, HTML files, etc or unstructured data. Hence, text mining is a much better solution for companies, where large volumes of diverse types of information must be merged and managed. So Text mining is similar to data mining.

Steps to Text Mining:

·         "Preprocessing" the text to distill the documents into a structured format.

·         Reducing the results into a more practical size.

·         Mining the reduced data with traditional data mining techniques.


Feature extraction is used during this stage to locate specific information such as organizations, addresses and customer names. While searching the user looks for something that has been written by someone else and which is already known. The problem is pushing  aside  all  the  material  that  currently  is  not relevant to your  needs  in  order  to  find  the  relevant  information. The main objective is to discover unknown information that no one yet knows nor have yet written.



Our primary goal in this paper is to study the clustering problem, we note that such an approach can also be extended in principle to other data mining problems in which auxiliary information is available with text. Such scenarios are very common in a wide variety of data domains.  We will also propose a method in this paper in order to extend the approach to the problem classification. We will show that the extension of the approach to the classification problem provides superior results because of the incorporation of side information. Our goal is to show that the advantages of using side-information extend beyond a pure clustering task, and can provide competitive advantages for a wider variety of problem scenarios. We will extend our earlier clustering approach in order to incorporate supervision, and create a model which summarizes the class distribution in the data in terms of the clusters. Then, we will show how to use the summarized model for effective classification, which summarizes the class distribution in the data in terms of the clusters. Then, we will show how to use the summarized model for effective classification.



The problem of text clustering arises in the context of many application domains such as the web, social networks, and other digital collections. The rapidly increasing amounts of text data in the context of these large online collections has led to an interest in creating scalable and effective mining algorithms. A tremendous amount of work has been done in recent years on the problem of clustering in text collections in the database and information retrieval communities. However, this work is primarily designed for the problem of pure text clustering, in the absence of other kinds of attributes. Side-information can sometimes be useful in improving the quality of the clustering process, it can be a risky approach when the side-information is noisy. In such cases, it can actually worsen the quality of the mining process. Therefore, we will use an approach which carefully ascertains the coherence of the clustering characteristics of the side information with that of the text content. This helps in magnifying the clustering effects of both kinds of data. The core of the approach is to determine a clustering in which the text attributes and side-information provide similar hints about the nature of the underlying clusters, and at the same time ignore those aspects in which conflicting hints are provided.

As baselines, the following algorithms were tested (a) A Naive Bayes Classifier 3, (b) An SVM Classifier 4 and (c) A supervised k-means method which is based on both text and side information. In the last case, the classification is performed by using the nearest cluster based on text+side.



Existing System:

Either feature selection or extraction can be used to obtain an appropriate set of features to use in clustering. Pattern proximity is usually measured by a distance function defined on pairs of patterns. Euclidean distance, Minkowski distance, Manhattan distance and Supremum distance are used to calculate the dissimilarity between data objects. Whereas Cosine similarity, Pearson correlation, Bregman divergence, Mahalanobis distance used for similarity measure between data objects. All the metrics are chosen carefully based on feature types. The clusters are generated is assessed for cluster validity. Experts in the relevant fields interpret the data partition. Further experiments can be made to guarantee the reliability of extracted knowledge. To evaluate the quality of clustering measures adopted are Statistical measures, Mean Square Error, Silhouette Coefficient, purity, entropy and other such measures. Normalized Mutual Information (NMI) is clustering evaluation measure is suitable for document clustering. Starting with a collection of documents, a text mining tool would retrieve a particular document and preprocess it by checking format and character sets. Then it would go through a text analysis phase, sometimes repeating techniques until Text preprocessing transforms text into an information-rich, term-by-document matrix.



By understanding problem of the existing system we introduce new approach for clustering document using side information. We will work on real data set which contains unstructured data. There are number of algorithm that is use for clustering of the data but in single documents there are various side information sometime help to form a better cluster. In this project we will focus to find the side information in the given documents. For that we will use TD-IFD functions for finding the key words and rest of the information we are considering as side information by providing some threshold value. And forming a cluster we will develop content and auxiliary attribute based text Clustering.



(i)       Method of Data Collection

The data sets used were as follows:


(1)     Cora Data Set:

The Cora data set contains 19,396 scientific publications in the computer science domain. Each research paper in the Cora data set is classified into a topic hierarchy. On the leaf level, there are 73 classes in total. We used the second level labels in the topic hierarchy, and there are 10 class labels, which are Information Retrieval, Databases, Artificial Intelligence, Encryption and Compression, Operating Systems, Networking, Hardware and Architecture, Data Structures Algorithms and Theory, Programming and Human Computer Interaction. We further obtained two types of side information from the data set: citation and authorship. These were used as separate attributes in order to assist in the clustering process. There are 75,021 citations and 24,961 authors. One paper has 2.58 authors in average, and there are 50,080 paper-author pairs in total.


(2) DBLP-Four-Area Data Set:

The DBLP-Four-Area data set [13] is a subset extracted from DBLP that contains four data mining related research areas, which are database, data mining, information retrieval and machine learning. This data set contains 28,702 authors, and the texts are the important terms associated with the papers that were published by these authors. In addition, the data set contained information about the conferences in which each author published. There are 20 conferences in these four areas and 44,748 author-conference pairs. Besides the author conference attribute, we also used co-authorship as another type of side information, and there were 66,832 coauthor pairs in total.


(3) IMDB Data Set:

The Internet Movie Data Base (IMDB) is an online collection 2 of movie information. We obtained ten-year movie data from 1996 to 2005 from IMDB in order to perform text clustering. We used the plots of each movie as text to perform pure text clustering. The genre of each movie is regarded as its class label. We extracted movies from the top four genres in IMDB which were labeled by Short, Drama, Comedy, and Documentary. We removed the movies which contain more than two above genres. There were 9,793 movies in total, which contain 1,718 movies from the Short genre, 3,359 movies from the Dramagenre, 2,324 movies from the Comedy genre and 2,392 movies from the Documentary genre. The names of the directors, actors, actresses, and producers were used as categorical attributed corresponding to side information. The IMDB data set contained 14,374 movie-director pairs, 154,340 movie-actor pairs, 86,465 movie-actress pairs and 36,925 movie-producer pairs.



We presented methods for mining text data with the use of side-information. Many forms of text databases contain a large amount of side-information or meta-information, which may be used in order to improve the clustering process. In order to design the clustering method, we combined an iterative partitioning technique with a probability estimation process which computes the importance of different kinds of side-information. This general approach is used in order to design both clustering and classification algorithms. We present results on real data sets illustrating the effectiveness of our approach. The results show that the use of side-information can greatly enhance the quality of text clustering and classification, while maintaining a high level of efficiency.



1.        C. C. Aggarwal and H. Wang, Managing and Mining Graph Data. New York, NY, USA: Springer, 2010.

2.        C. C. Aggarwal, Social Network Data Analytics. New York, NY, USA: Springer, 2011.

3.        C. C. Aggarwal and C.-X. Zhai, Mining Text Data. New York, NY, USA: Springer, 2012.

4.        C. C. Aggarwal and C.-X. Zhai, “A survey of text classification algorithms,” in Mining Text Data. New York, NY, USA: Springer, 2012.

5.        C. C. Aggarwal and P. S. Yu, “A framework for clustering massive text and categorical data streams,” in Proc. SIAM Conf. Data Mining, 2006, pp. 477–481.

6.        C. C. Aggarwal, S. C. Gates, and P. S. Yu, “On using partial supervision for text categorization,”IEEE Trans. Knowl. Data Eng., vol. 16, no. 2, pp. 245–255, Feb. 2004.

7.        C. C. Aggarwal and P. S. Yu, “On text clustering with side  information,” in Proc. IEEE ICDE Conf., Washington, DC, USA, 2012.

8.        R. Angelova and S. Siersdorfer, “A neighborhood-based approach for clustering of linked document collections,” in Proc. CIKM Conf., New York, NY, USA, 2006, pp. 778–779.

9.        A. Banerjee and S. Basu, “Topic models over text streams: A study of batch and online unsupervised learning,” in Proc. SDM Conf., 2007, pp. 437–442.

10.     J. Chang and D. Blei, “Relational topic models for document networks,” inProc. AISTASIS, Clearwater, FL, USA, 2009, pp. 81–88.

11.     D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: A cluster-based approach to browsing large document collections,” in Proc. ACM SIGIR Conf., New York, NY, USA, 1992,pp. 318–329.

12.     I. Dhillon, “Co-clustering documents and words using bipartite  spectral graph partitioning,” in Proc. ACM KDD Conf., New York, NY, USA, 2001, pp. 269–274.




Received on 02.03.2015            Accepted on 13.04.2015     

© All Right Reserved

Int. J. Tech. 5(1): Jan.-June 2015; Page 21-24