Extraction of Meaningful Rules in a Medical Database

Clustering enhances the value of existing databases by revealing rules in the data. These rules are useful for understanding trends, making predictions of future events from historical data, or synthesizing data records into meaningful clusters. Through clustering are similar data items grouped together to form clusters. Clustering algorithms usually employ a distance metric based (e.g., Euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this paper, we study clustering algorithms for data with categorical attributes. Instead of using traditional clustering algorithms that use distances between points for clustering which is not an appropriate concept for Boolean and categorical attributes, we propose a novel concept of HAC (hierarchy of attributes and concepts) to measure the similarity/proximity between a pair of data points. In this study, HAC will be used as an aid to represent medical domain knowledge substructures to simplify the generation process of the databases through clustering. As a result, the research will identify interesting relationships and patterns among the data, and represent them in the form of association rules.

(binary) tree. From this tree we can extract a set of k clusters. For example, terminating the merging when only k clusters remain, or when the closest pair of clusters are at a distance exceeding some threshold. The crucial part of this algorithm is to define a metric to measure the distance between two clusters of multiple points. One kind of definition that uses mathematical model is: the smallest distance between a point in one cluster and a point in another; the greatest distance between such points; or the average distance. Each definition has its own advantages and disadvantages. This research focuses on hierarchical conceptual clustering in structured, discrete-valued databases. By structured data, we refer to information consisting of data points and relationships between the data points. This differs from a definition of unstructured data as containing free text and structured data containing feature vectors. Conceptual clustering is an important way of summarizing and explaining data [1,6]. However, the recent formulation of this paradigm has allowed little exploration of conceptual clustering as a means of improving performance. Furthermore, previous work in conceptual clustering has not explicitly dealt with constraints imposed by real world environments. This chapter presents a clustering using HAC (Hierarchy of attributes and concepts), which is a hierarchical conceptual clustering system that organizes data so as to maximize inference ability. This algorithm uses both the hierarchical and conceptual clustering methods to implement clustering by discovering substructures in database which compress the original data and represent structural concepts in the data. Once a substructure is discovered, the substructure is used to simplify the data by replacing instances of the substructure with a pointer to the substructure definition. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the specific data analysis goals. An important property of the conceptual clustering is that, it can enhances the value of existing databases by revealing patterns in the data. These patterns may be useful for understanding trends, for making predictions of future occurrences from historical evidence, or for synthesizing data records into meaningful clusters. A conceptual clustering system accepts a set of object descriptions (events, observations, facts) and produces a classification scheme over the observations. These systems use an evaluation function to determine classes with "good" conceptual descriptions. A learning of this kind is referred to as learning from observation (as opposed to learning from examples). Typically, conceptual clustering systems assume that the observations are available indefinitely so that batch processing is possible using all observations. In this study, HAC will be used as an aid to represent medical domain knowledge substructures to simplify the generation process of the databases through clustering. As a result, the research will identify interesting relationships and patterns among the data, and represent them in the form of association rules.

Related Work
Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorial and differences in assumptions and contexts in different communities have made the transfer of useful generic concepts and methodologies slow to occur [2]. Hierarchical clustering is one of the most frequently used methods in unsupervised learning. Given a set of data points, the output is a binary tree (dendogram) whose leaves are the data points and whose internal nodes represent nested clusters of various sizes. The tree organizes these clusters hierarchically, where the hope is that this hierarchy agrees with the intuitive organization of real-world data. Hierarchical structures are ubiquitous in the natural world [5]. There are two general approaches to hierarchical clustering: top-down and bottom-up. The top-down approach starts with a cluster containing all points that is recursively split until enough sub clusters have been created. Heckel et al. [10,11] use this approach on vector fields where each point has a position and a vector. The cluster with the highest error value will split into two clusters recursively so that the error values of the remaining clusters decrease with each split. The bottom-up approach starts with all points as individual clusters and merges the two clusters with least difference, until one big cluster has been formed from all clusters. Telea and van Wijk [10] use this approach to simplify complex vector fields using an elliptic similarity function. They merge the pair of vectors with the least position, magnitude and direction differences until all vectors have been merged into one single vector. Conceptual clustering is used to summarize the result. It enhances the value of existing databases by revealing patterns in the data. These patterns may be useful for understanding trends, for making predictions of future occurrences from historical evidence, or for synthesizing data records into meaningful clusters. The hybrid conceptual clustering [3] is used to handle both the incremental and non incremental problems for clustering successfully but it is computationally expensive moreover can be applied on small data sets. In past decades, many conceptual clustering algorithms have been proposed which can automatically acquire knowledge or concepts from large amounts of information acquired from experience or observation [1,7,8,9]. Concepts in COBWEB are represented by probabilistic expressions and are acquired by using four learning operators and an evaluation function called category utility. But the category utility used in original COBWEB has a bias to prefer larger size classes in concept hierarchy. This bias produces some spurious intermediate nodes in concept hierarchy (classification tree). These nodes make tree deeper and complex, so we can't understand concepts within the nodes of tree easily [9]. This chapter presents an efficient non-metric measure called HAC (Hierarchy of Attributes and Concepts) for clustering of categorical as well as non-categorical(quantitative) data through which the proximity and relationships between data items can be identified.

Hierarchy of Attributes and Concepts
The present paper introduces a hierarchical description of concepts by attributes, mathematical formalization presents the concepts as matrices whose columns represents terms constructed by attributes. A spherical model is developed to present a vocabulary www.intechopen.com

Machine Learning 414
(spanned space) of concepts. In the beginning of this section the following definitions are introduced. Attribute: Is a basic characteristic or a feature of a term. Term: Is considered as a set of connected attributes. Concept: Is a language independent meaning associated with at least one term, or set of terms. Vocabulary: Is a set of terms and concepts. HAC is both a hierarchical and conceptual clustering system that organizes data to maximize inference ability. This algorithm implements clustering by discovering substructures in database which compress the original data and represent structural concepts in the data. Once a substructure is discovered, it is used to simplify the data by replacing instances of the substructure with a pointer to the substructure definition. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the specific data analysis goals. HAC accepts database of structured data (concepts) as input. This type of data is naturally represented using a graph or diagrammatical form. The graph representation includes labeled vertices with vertex ID (identification) numbers, attributes on X-axis and downward directed edges (see Fig 1). Each vertex represents a concept and the value of that concept is given by the directed edges (usually map to the attribute's value on X-axis or to some other concept).   Table (Table 2). Every row in this table (Table 2) consists of three fields. The first field shows the concept name whereas the second and third attributes represents concept value and concept attributes. For example, the Concept Name C1 whose value is sense is a concept formed from the attributes F1,F2,F12,F17,F21,F36,F54 which represents various Food ID's. Recall that HAC can be represented as a closed diagramatical entity shown in

Approach
HAC is both a hierarchical and conceptual clustering system that organizes data so as to maximize inference ability. This algorithm implement clustering by discovering substructures in database which compress the original data and represent structural concepts in the data. Once a substructure is discovered, the substructure is used to simplify the data by replacing instances of the substructure with a pointer to the substructure definition. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the specific data analysis goals. HAC accepts database of structured data (concepts) as input. This type of data is naturally represented using a graph. The graph representation includes labeled vertices with vertex id numbers, attributes on X-axis and downward directed edges. Each vertex represents a concept and the value of that concept is given by the directed edges (usually map to the attribute's value on X-axis or to some other concept). With this graph we can make a concept

Spherical Model
The graph model developed in the previous section is useful for connection purposes. To provide an opportunity to generate and manipulate concepts a matrix form is considered here after. …………………………………. a k1 a k2 …………… a kn Every column in C kxn represents a term and every a ij represents an attribute. Thus every concept could be decomposed to set of independent terms and every term be generated by pivot element. Therefore every concept (matrix) could be transformed to set of linearly independent terms (columns), created by linearly independent attributes (enteries). As a consecuence every concept will have an invertable minor and the dimensions of this minor will be called a rank of the concept. The set of linearly independent terms T i where i = 1, 2,…, n over the field of real numbers may span the entire set of concepts.
If the dimension of the set T is too large the matrix form of concepts presentation would happen to be memory inefficient. To overcome this obstacle a spherical model is developed to represent terms and concepts generation. The central section of the sphere represents the entire set of linearly independent attributes used to build up the basis T i where i = 1, 2,…,

www.intechopen.com
Machine Learning 418 m. The next part forms the terms and is followed by concepts. Finally the model is ending with only one highest point as shown in figure 3. If the total number of terms, coming from a certain domain, is n then the maximum number of concepts to be generated by (n-1) terms over a numerical field of 1 element (number) is But if n-2 elements are used over the same numeric field the number of composed concepts

Fig. 3. The Spherical Model
Therefore the total number of concepts to be generated over a field with 1 element and n terms is 2 n . If the numeric field consists of j elements (numbers), and the maximum size of a term (column) is k then the total number of concepts to be generated is n k j 2 . From theoretical view point it could be a very big number, but in practice the number of concepts would be smaller because some terms could happen to be mutually contradictive and cannot be linked in a single concept.

Algorithm
On the basis of the theoretical concepts an algorithm is developed to form an HAC.

Algorithm Forming_of_HAC_Clusters (A_Name [], A_Values [])
1. Make concept tables for different attributes using HAC. In this basically we have to make such that each level or concept is meaningful. Table" which will have attribute name, attribute value and relationship with different concepts. Attribute name is basically the name of cluster and each and every attribute name is associated with an attribute value 3. Concept will give the value of cluster which is a combination of different attributes and concepts. 7. For i=1 to num V i = total number of records under different combination of C1 (j, j+1---m) and C2 (k, k+1----n) Here, V (i, i+1----total_num) are different clusters or attributes in Cluster Point Table. C1 (j, j+1---m) are different concepts of concept table1 C2 (k, k+1---m) are different concepts of concept table2 From this Cluster Point table and Concept tables (made through HAC) we can generate rules. The above algorithm is used to form the concepts and form an HAC from the given attribute values. The same algorithm is used if an update of an HAC is needed. The algorithm is capable of adding new concepts, terms and attributes.

Case Study: Clustering Using the HAC
This is an example between various allergy diseases and food categories in an allergy database. First a chart of various concepts is formed in which each concept will have allergy disease name as its value. Note that it could include multiple entries.

HAC1
In Table 1 which is an attribute table for allergy disease, the attribute names are mapped with its values for reference purposes. The concepts D1 through D20 in the TACR for the attribute "Allergy diseases" shown in Table 3 summarize the concepts in the HAC hierarchy of Figure 1 that hold among different attributes and concepts of diseases. Note that the root concept D20 embraces all concepts and attributes of "Food related allergy diseases." www.intechopen.com Machine Learning 420 Table 3. TACR for Allergy Diseases

HAC2
Similarly, we can build the HAC structure for another attribute "Food category" as shown in Figure 4. The hierarchy shows a conceptual relationship among the values and concepts related to the attribute "Foods". For this purpose the attribute table in Table 2   The concepts C1 through C14 in the TACR for the attribute "Food Category" shown in Table  4 summarize the concepts in the HAC hierarchy of Figure 4 showing the relationships among different values and concepts associated with Food Category. Using the TACRs for the selected attributes, Diseases and Food Category in this case, a graph can be constructed to represent clusters at concept levels. The graph is shown in Figure 5. Each point in the graph represents a concept that is formed by a combination of attribute values. Let's call this point in the graph a Concept Cluster Point (CCP). Since the cluster point represents a high level concept, it naturally converts to a rule that matches with a concept. Furthermore, the support of each rule can be given by calculating the number of contributing entries in the original relational table to form the concept. Table 5. Cluster Point Table   In Table 5 above, the CCPs V1 through V9 each represent a concept cluster representing a high level concept associated with departments and degrees with a support value. Hence the next step is to convert these CCPs into characteristic rules. To help this process, the CCP graph in Figure 5 is used. Note that each CCP is associated with an appropriate support value which represents the weight to support the converted rule representing the concept. Each CCP in Figure 5 represents rules regarding how many people affected with specific allergy affected by food category and concepts. A rule can directly be generated from the cluster point Machine Learning 422

Fig. 4. HAC for Food and Foodtypes
In this example, the milk allergy has D13 and the dairy product is C1, both the combination of D13 and C1 represents V3 that has support 5.

Implementation
We have used a medical database which contains information about patient causes and food types. This is generated syntactically based on the case studies found in the websites [13,14].   Table' which has all the possible clusters, each cluster is a combination of 'Allergy disease' attribute and 'Food categories' attribute. We used non-numeric or categorical data for our clustering algorithm. This algorithm is implemented in 'JAVA' with a simple user interface which makes it very easy to use. User just needs to select different concepts of different attributes, concepts are dynamically generated from concept tables in database. After selecting the concepts the user needs to press the 'Submit' button that will display the result or you can say rules for our 'Cluster Point Graph' on an additional results page. 'Cluster Point Graph' represents all the clusters possible from different combinations of different concepts.

Result
Depending on the concept tables the results are generated in the form of rules. Each row in the 'cluster point table' represents a cluster that takes a form of rule. The total number of rules depends on the number of concept tables. Here we are using two concept tables that have generated rules has follows: 1. The two concept tables used are Allergy Diseases and Food Categories. The sample rules generated such as: a) Most likely, the persons are affected by specific allergy due to specific food category.
b) The total number of persons with specific values of different concepts. From the user perspective the user selects the disease and food category then it shows the resulted cluster value from cluster point table. If the user select Concept 'Milk Allergy' and Concept 'Dairy Products' if you submit the query then it will generate the rule as 'There are 5 persons affected by 'Milk Allergy' due to 'Dairy Products' or either way we can generate the rule or there are 45 persons with different symptoms are affected. The sample result is shown in Figure 6. 2. Furthermore, the algorithm can be implemented on three concept tables. These generates rules, based on the concept tables and input tables. The HAC generates different rules based on the concept tables with different combinations of concepts. Each concept table represents conceptual hierarchies of categorical attributes. This algorithm specifies the clustering on categorical attributes on the concept tables to derive the association rules in the form of cluster point

Conclusion
In this paper, we studied the hierarchical conceptual clustering applied on structured databases. We have given a new method of HAC with conceptual clustering to explore categorical data. The main contributions of the paper are to develop HAC algorithm which is applied on categorical attributes instead of traditional algorithms which apply the distance metric measures. The results show that the data is categorized using hierarchical conceptual clustering with HAC. There are numerous types of clustering techniques most of these techniques are applicable only on the unstructured data. Sometimes, there is a need to apply clustering on categorical attributes which is not suitable to apply on it. So, the nonmetric measures are used to perform clustering on the categorical attributes which represents the closest proximity between the data attributes. HAC is used to represents databases in the form concept tables for categorical data which contains the concepts formed on the domain. This technique can be applied on any of the fields which have structure data. The information is extracted using the HAC algorithm from structured data. The structured data is applied on input and the results formed are rules extracted from the data. Clustering is important for both types of data. The modern data mining mechanisms are used to apply on the data. From a machine learning standpoint, this research has been greatly influenced by work in conceptual clustering. HAC seeks classifications that maximize a heuristic measure (as in conceptual clustering systems) and uses a search strategy abstracted from incremental systems such as UNIMEM [8].

Future Work
Our algorithm can be effective in the medically related areas to allergic disease. Implementing this algorithm with multiple concept tables can be a potential extension of the approach in our case we have implemented only on two concept tables. Another important enhancement of this algorithm can be its implementation on other domains in medicine and also domains apart from medical areas.

Acknowledgement
This research has been partially supported by L-3 Communication Corporation, ComCept Division under Project Corvus and TAMU-C research grant #140854-20300.