A Comprehensives Survey Paper on Sensor Data Mining Based on Sensor Network
Sweta Kumari and Varsha Singh
National Institute of Technology, Raipur
*Corresponding Author E-mail: swetakumari9@gmail.com, varsha_x_singh@yahoo.com
ABSTRACT:
Wireless sensor networks aim to improve different aspects of our lives. This technology is experiencing true expansion for the past decades. Main concept in these systems is a sensor node, small microprocessor integrated with number of sensors. The sensor functions imply the WSNs’ course of utilization. Possibilities are enormous. This paper focuses on sensor dataminning based on wireless sensor network. Sensor dataminning is very useful in various application like Traffic control, Disease diagnose, Animal tracking and etc.
KEYWORDS: Wireless Sensor Networks (WSNs), Sensor Nodes, Utilization, Existing Medical Applications (EXMAs), Healthcare.
INTRODUCTION:
Wireless Sensor Network (WSN) is a set of small, autonomous devices, working together to solve different problems. It is a relatively new technology, experiencing true expansion for the past decade. Research in the field of nanostructures and sensors has brought real opportunities for development of WSNs. People have realized that integration of small and cheap microcontrollers with sensors can result in production of extremely useful devices, which can be used as an integral part of the sensor nets. These devices are called sensor nodes. Nodes are able to communicate each other over different protocols. Studies, in the field of communication protocols for wireless sensor networks, are particularly interesting, and rely on various network topologies. Issues, addressed by communication among nodes, include power management, data transfer, mobility patterns, etc. Like it was mentioned before, WSNs present new technology. History of this concept begins at University of California (UC), Berkley, with Smart Dust project, which was funded by Defense Advanced Research Projects Agency (DARPA) [1]. The aim of this project was to develop self-organized, millimeter-scale hardware platform for distributed WSNs.
Primary, this was a military application which resulted in development of relatively large sensor nodes. Later miniaturization process brought much smaller devices, with solid sensing and communication capabilities.
One of the key points in the history of wireless sensor networks was the implementation of energy-efficient software platform, Tiny OS, operating system, also developed at UC. Further development lead to implementation of different software platforms for WSNs. Soon, people have understood that, by putting sensor nodes to work together, infrastructure improvement and problem resolving can be achieved in different fields. The opinion, which was generally accepted, points on the low cost of this technology and its endless benefits. Today, sensor nets are used in agriculture, ecology, tourism, but medicine is the area where they certainly meet the greatest potential.
The current limitation of data mining applications in sensor networks is that existing distributed data mining techniques impose heavy demands on computation and/or communication. In addition to the trivial approach of sending all collected data from the individual sensors to the root, where any standard data-mining technique can then be applied, two techniques for distributed data mining are known: the Collective Data Mining (CDM) framework introduced by Kargupta et al. [8], and metal earning [5].
Fig 1 : Wireless Sensor Network
Sensor Field: A sensor field can be considered as the area in which the nodes are placed.
Sensor Nodes: Sensors nodes are the heart of the network. They are in charge of collecting data and routing this information back to a sink.
Sink: A sink is a sensor node with the specific task of receiving, processing and storing data from the other sensor nodes. They serve to reduce the total number of messages that need to be sent, hence reducing the overall energy requirements of the network. The network usually assigns such points dynamically. Regular nodes can also be considered as sinks if they delay outgoing messages until they have aggregated enough sensed information. Sinks are also known as data aggregation points.
Task Manager: The task manager also known as base station is a centralized point of control within the network, which extracts information from the network and disseminates control information back into the network. It also serves as a gateway to other networks, a powerful data processing and storage centre and an access point for a human interface. The base station is either a laptop or a workstation. Data is streamed to these workstations either via the internet, wireless channels, satellite etc, therefore hundreds to several thousand nodes are deployed throughout a sensor field to create a wireless multi-hop network. Nodes can use wireless communication media such as infrared, radio, optical media or Bluetooth for their communications. The transmission range of the nodes varies according to the communication protocol is use.
The Sensor Node
A sensor is a small device that has a micro-sensor technology, low power signal processing, low power computation and a short-range communications capability. Sensor nodes are conventionally made up of four basic components as shown in Figure 2: a sensor, a processor, a radio transceiver and a power supply/battery [11]. Additional components may include Analog-to-Digital Convertor (ADC), location finding systems, mobilizers that are required to move the node in specific applications and power generators. The analog signals are measured by the sensors are digitized via an ADC and in turn fed into the processor. The processor and its associated memory commonly RAM is used to manage the procedures that make the sensor node carry out its assigned sensing and collaboration tasks. The radio transceiver connects the node with the network and serves as the communication medium of the node.
Knowledge Discovery in Databases
Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. In order to get this information, we try to find patterns in the given data set. To know if a pattern is valuable, the assessment of its interestingness and certainty is crucial. Patterns that are interesting and certain enough according to the user's measures are called knowledge. The output of a program that discovers such useful patterns is called discovered knowledge. According to [28], KDD exhibits four main characteristics:
· High-Level Language: The discovered knowledge is represented in a language that does not necessarily have to be directly used by humans, but its expression should be comprehensible. 14
· Accuracy: The measure of certainty implies whether the discovered patterns portray the contents of a database properly or not.
· Interestingness: Discovered knowledge is considered interesting if it fulfills the predefined biases. By denoting a pattern interesting, we mean that it is novel, potentially useful and the discovery process is nontrivial.
· Efficiency: Even for large Datasets, the running time of the algorithm is acceptable and predictable. Data and patterns are defined in [29]: “Here, data is a set of facts and pattern is an expression in some language describing a subset of the data or a model applicable to that subset. Patterns should be understandable, if not immediately then after some post-processing.” The so-called KDD Process consists of several steps that are in place to achieve the defined goals for knowledge discovery.
The KDD Process:
Fig 2: Knowledge Discovery
· Data Understanding: learning the application domain for prior knowledge and goals of the application.
· Creating a target data set: selecting the subset of the data on which the data mining will be performed.
· Data cleaning and preprocessing: removing noise or outliers, developing strategies for handling missing data.
· Data reduction: reduce dimensionality of the data set in order to get rid of data that is unnecessary for completing the mining task and thereby keep the computing time low.
· Selecting the data mining method: the most important task here is to find the method that will best suit the completion of the KDD goals.
· Choosing the data mining algorithm: there are many different data mining algorithms. Deciding on an efficient one to search for patterns in data is critical and includes decisions about appropriate models and parameters.
· Data mining: applying the previously chosen algorithm to the data set and searching for interesting patterns in a particular representational form.
· Interpreting mined patterns includes the visualization of mined patterns and a possible return to any of the steps 1-7 if the results are unsatisfactory.
· Consolidating discovered knowledge: documenting the results and incorporating them into another system.
Data Mining Strategies:
According to the goal we want to achieve with data mining, there are several data mining strategies to choose from. These strategies can be broadly classified in supervised learning, unsupervised learning, and market basket analysis. Supervised learning is mainly used for prediction. Several input variables are used to build models which predict a specified output variable. Supervised learning methods either allow only one single or several output attributes. Unsupervised learning does not have any output variable but rather tries to find structures in the data by grouping the instances into different classes. The designation of market basket analysis is to find regularities in data in order to explore customer behavior. Results can help retailers design promotions or recommendations.
Supervised Learning:
Supervised learning is used in almost any domain, mainly for the purpose of prediction. It could also be called classification or inductive learning when used in association with machine learning. The goal is to create a function out of a given set of historical training data. This function generates the desired output; for example it is possible to predict whether a customer will buy a certain product or not. To be able to compute the function, we need enough training data to make an accurate prediction. Historically collected data with information about customers who either bought or did not buy the product after a promotion will enable us to find out which potential customers will react on a promotion campaign. The data about a customer’s reaction to the campaign serves as an output variable.
Unsupervised Learning:
Unlike in supervised learning, we do not want to predict a specific output here, but rather discover unknown structures that exist within a data set. The technique used for unsupervised learning is clustering. This technique orders the instances of a data set into groups with similar attributes and values. These groups of items are called clusters. It is important to notice that instances of one single cluster are similar to each other, whereas instances of different cluster are very diverse from each other. By clusters we mean subsets of the overall data set that is being mined. Clusters are created in the mining process without a priori knowledge of cluster attributes.
The way of collecting sensor data will face a revolution when the newly developing technology of distributed sensor networks becomes fully functional and widely available. Smart sensors will acquire full interconnection capabilities with similar devices, so that run-time data aggregation, parallel computing, and distributed hypothesis formation will become reality with off-the-sheIf components and sensor boards. This revolution started around ten years ago, and now hardware and network are converging on the first convincing solutions. Exploring and exploiting this paradigm are a renovated challenge for the pattern recognition and data mining community.
From the computational aspect, various data analysis techniques have been used including classification algorithms, as decision trees (Chen et al., 2007; Gil et al., 2007),SVM (Chen et al., 2007; Williams et al., 2007), logiboot (Chen et al., 2007), rule based approach (Alwan et al., 2006),mixture models as Gaussian mixture (Eagle & Pentland, 2006), pattern recognition algorithms (Fogarty et al., 2006), sophisticated data mining techniques and machine learning algorithms like Markov chain (Eagle & Pentland, 2006), artificial neural networks (Sixsmith & Johnson, 2004), and Bayesian models (Tapia et al., 2004). Another approach recently used to discriminate patterns generated from healthy and pathological states as well as aging is based on frequency and rank order statistics of symbolic sequences because complex physiological signals may carry unique dynamic signatures related to their underlying mechanisms (Shieh et al., 2006).
Classification and Association:
Information is gathered almost everywhere in our everyday lives. For example, at supermarket checkouts information about customer purchases is recorded. When payback or discount cards are used information about customer purchasing behavior and personal details can be linked. Evaluation of this information can help retailers devise more efficient and personalized marketing strategies. The amount of information stored in modern databases makes manual analysis intractable.
Data mining provides tools to reveal previously unknown information in large databases. A well-known data mining technique is association rule mining. It is able to find all interesting relationships (called associations) in a database. Nowadays the discriminative power of these descriptive patterns—the association rules—is used to build accurate classifiers. This thesis compares different association rule mining algorithms and evaluates them in a new way by using the discriminative power of association rules. As a consequence we build and compare classifiers based on association rules.
Association rule mining is a widely-used approach in data mining. Association rules are capable of revealing all interesting relationships in a potentially large database. The abundance of information captured in the set of association rules can be used not only for describing the relationships in the database, but also for discriminating between different kinds or classes of database instances.
However, a major problem in association rule mining is its complexity. Even for moderate sized databases it is intractable to find all the relationships. This is why a mining approach defines a interestingness measure to guide the search and prune the search space. Therefore, the result of an arbitrary association rule mining algorithm is not the set of all possible relationships, but the set of all interesting ones. The definition of the term interesting, however, depends on the application. The different interestingness measures and the large number of rules make it difficult to compare the output of different association rule mining algorithms. There is a lack of comparison measures for the quality of association rule mining algorithms and their interestingness measures. Association rule mining algorithms are often compared using time complexity. That is an important issue of the mining process, but the quality of the resulting rule set is ignored. On the other hand there are approaches to investigate the discriminating power of association rules and use them according to this to solve a classification problem. This research area is called classification using association rules. It has to deal with a large number of rules. Therefore, rule selection and rule weighting are essential for these approaches in classification. An important aspect of classification using association rules is that it can provide quality measures for the output of the underlying mining process. The properties of the resulting classifier can be the base for comparisons between different association rule mining algorithms. A certain mining algorithm is preferable when the mined rule set forms a more accurate, compact and stable classifier in an efficient way. The introduction of this quality measures—particularly the accuracy of the classifier— kills two birds with one stone. First, in this thesis we are interested in the comparison of the quality of different mining algorithms. Therefore, we use classification using association rules. Secondly, classification using association rules can be improved itself by using a mining algorithm that prefers highly accurate rules.
Fig 3: The three algorithmic steps in classification using association rules
Fig 4: Overview of Pruning Steps.
CONCLUSION:
To conclude, the expansion of WSNs for the past few years will be emphasized. This certainly is a rising technology. Like it was stated before, great advantage of sensor nets is their compatibility to existing infrastructures. Another advantage lies in the fact that these networks meet a number of completely different appliances, agricultural, ecologic, and especially medical. Miniaturization of sensor nodes and overcoming of demonstrated issues will bring more sophisticated solutions and applications. With all new technologies we face with problem of failure or success. Despite the promising features, risks of failure are always present. In order of overcoming this problem some necessary points must be done.
REFERENCES:
1. Agrawal R. and Srikant R. Fast Algorithms for Mining Association Rules. In M. Jarke J. Bocca and C. Zaniolo, editors, Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), pages 475–486, Santiago de Chile, Chile, September 1994. Morgan Kaufmann.
2. Blake C. and Merz C. UCI Repository of machine learning databases. http://www.ics.uci.edu/_mlearn/MLRepository. tml, 1998. University of California, Irvine, USA.
3. Der Brockhaus Computer and Information stechnologie. F.A. Brockhaus, Mannheim, Germany, 2002.
4. Cohen W. Fast Effective Rule Induction. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the 12th International Conference(ICML’95), pages 115–123, Tahoe City, CA, USA, 1998. Morgan Kaufmann Publishers.
5. Dong G., Zhang X., Wong L. and Li J. CAEP: Classification by Aggregating Emerging Patterns. In Proceedings of the Second International Conference on Discovery Science, pages 30–42, Tokyo, Japan, 1999.
6. Fayyad U. and Irani K. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI’93), pages 1022–1027, Chamb´ery, France, 1993. Morgan Kaufmann.
7. Fayyad U., Piatetsky-Shapiro G., Smyth P. and Uthurusamy R., editor. Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, Massachusetts, USA, 1996. 99
8. Frank E. and Witten I. Generating Accurate Rule Sets Without Global Optimization. In J. Shavlik, editor, Machine Learning: Proceedings of the 15th International Conference( ICML’98), pages 152–160, San Francisco, USA, 1998. Morgan Kaufmann Publishers.
9. Hand D., Mannila H. and P. Smyth. Principles of Data Mining. MIT Press, Cambridge, Massachusetts, USA, 2001.
10. LiW., Han J. and Pei J. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. In Proceeedings of the 2001 IEEE International Conference on Data Mining (ICDM’01), pages 369–376, San Jose, California, USA, 2001.
11. Liu B., Hsu W. and Ma Y. Integrating Classification and Association Rule Mining. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), pages 80–86, New York, USA, August 1998. The AAAI Press.
12. Mitchell T. Machine Learning. McGraw-Hill, 1997.
13. Nadeau C. and Bengio Y. Inference for the generalization error. Advances in Neural Information Processing Systems, 12:307–313, 1999.
14. Quinlan J. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
15. Salzberg S. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Mining and Knowledge Discovery, 1(3):317–327, 1997.
16. Scheffer T. Finding Association Rules That Trade Support Optimally against Confidence. Unpublished manuscript, more detailed version, algorithm’s prunning method changed.
17. Scheffer T. Finding Association Rules That Trade Support Optimally against Confidence. In L. De Raedt and A. Siebes, editors, Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’01), pages 424–435, Freiburg, Germany, September 2001. Springer-Verlag.
18. Stein J., editor. The Random House Dictionary of the English Language—the unabridged edition. Random House, Inc, New York, USA, 1967.
19. Witten I. and Frank E. Data Mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, 2000
Received on 05.04.2011 Accepted on 12.04.2011
© EnggResearch.net All Right Reserved
Int. J. Tech. 1(1): Jan.-June. 2011; Page 37-41