Hybrid Clustering for Validation and Improvement of Subject-Classification Schemes

A hybrid text/citation-based method is used to cluster journals covered by the Web of Science database in the period 2002-2006. The objective is to use this clustering to validate and, if possible, to improve existing journal-based subject-classification schemes. Cross-citation links are determined on an item-by-paper procedure for individual papers assigned to the corresponding journal. Text mining for the textual component is based on the same principle; textual characteristics of individual papers are attributed to the journals in which they have been published. In a first step, the 22-field subject-classification scheme of the Essential Science Indicators (ESI) is evaluated and visualised. In a second step, the hybrid clustering method is applied to classify the about 8300 journals meeting the selection criteria concerning continuity, size and impact. The hybrid method proves superior to its two components when applied separately. The choice of 22 clusters also allows a direct field-to-cluster comparison, and we substantiate that the science areas resulting from cluster analysis form a more coherent structure than the ''intellectual'' reference scheme, the ESI subject scheme. Moreover, the textual component of the hybrid method allows labelling the clusters using cognitive characteristics, while the citation component allows visualising the cross-citation graph and determining representative journals suggested by the PageRank algorithm. Finally, the analysis of journal 'migration' allows the improvement of existing classification schemes on the basis of the concordance between fields and clusters.


Introduction
The history of cognitive mapping of science is as long as the history of computerised scientometrics itself.While the first visualisations of the structure of science were considered part of information services, i.e., an extension of scientific review literature (Garfield, 1975(Garfield, , 1988)), bibliometricians soon recognised the potential value of structural science studies for science policy and research evaluation as well.At present, the identification of emerging and converging fields and the improvement of subject delineation are in the foreground.The main bibliometric techniques are characterised by three major approaches, particularly the analysis of citation links (cross-citations, bibliographic coupling, cocitations), the lexical approach (text mining), and their combination.The widely used method of co-citation clustering was introduced independently by Small (1973Small ( , 1978) ) and Marshakova (1973).Although the principle of bibliographic coupling had already been discovered earlier by Fano (1956) and Kessler (1963), coupling-based techniques have been used for mapping the structure of science only decades after co-citation analysis had become a standard tool in visualising the structure of science (e.g., Glänzel & Czerwon, 1996;Small, 1998).Cross-citation based cluster analysis for science mapping has to be distinguished from the previous two methods; while the former two types can be -and usually are -based on links connecting individual documents, the latter approach requires aggregation of documents to units like journals, subject categories, etc., among which cross-citation links are established.The obvious advantages of this method (e.g., the possibility to analyse directed information flows among these units or the assignment/aggregation of units to larger structures) are contrasted by some limitations and shortcomings such as possible biases caused by the use of predefined units.Thus, for instance, Leydesdorff (2006), Leydesdorff and Rafols (2008), and Boyack et al. (2008) used journal cross-citation matrices, while Moya-Anegon (2007) used subject co-citation analysis to visualise the structure of science and its dynamics.
Earlier, a completely different approach was introduced by Callon et al., (1983) and Callon, Law and Rip (1986).Their mapping and visualisation tool Leximappe was based on a lexical approach, particularly, co-word analysis.The notion of lexical approach, which was originally based on extracting keywords from records in indexing databases, was later on deepened and extended by using advanced text-mining techniques in full texts (cf.Kostoff et al., 2001Kostoff et al., , 2005;;Glenisson et al., 2005a,b).Whatever method is used to study the structure of science, cluster algorithms have beyond doubt become the most popular technique in science mapping.The sudden, large interest the application of these techniques has found in the community is contrasted by objections and criticism from the viewpoint of information use in the framework of research evaluation (e.g., Noyons, 2001;Jarneving, 2005).For instance, clustering based on co-citation and bibliographic coupling has to cope with several severe methodological problems.This has been reported, among others by Hicks (1987) in the context of co-citation analysis and by Janssens et al. (2008) with regard to bibliographic coupling.One promising solution is to combine these techniques with other methods such as text mining (e.g., combined cocitation and word analysis: Braam et al., 1991; combination of coupling and co-word analysis: Small (1998); hybrid coupling-lexical approach: Janssens et al., 2007bJanssens et al., , 2008)).Most applications were designed to map and visualise the cognitive structure of science and its change in time, and, from a policy-relevant perspective, to detect new, emerging disciplines.Improvement of subject-classification schemes was in most cases not intended.Jarneving (2005) proposed a combination of bibliometric structure-analytical techniques with statistical methods to generate and visualise subject coherent and meaningful clusters.His conclusions drawn from the comparison with 'intellectual' classification were rather sceptical.Despite several limitations, which will be discussed further in the course of the present study, cognitive maps proved useful tools in visualising the structure of science and can be used to adjust existing subject classification schemes even on the large scale as we will demonstrate in the following.The main objective of this study is to compare (hybrid) cluster techniques for cognitive mapping with traditional 'intellectual' subject-classifications schemes.The most popular subject classification schemes created by Thomson Scientific (Philadelphia, PA, USA) are based on journal assignment.Therefore journal cross-citation analysis puts itself forward as underlying method and we will cluster the document space using journals as predefined units of aggregation.In contrast to the method applied by Leydesdorff (2006), who uses the Journal Citation Reports (JCR), we calculate citations on a paper-by-paper basis and then assign individual papers indexed in the Web of Science (WoS) database to the journals in which they have been published.The use of the JCR would confine us to data as available in the JCR and prevent us from combining cross-citation analysis with a textual approach.What is more, proceeding from the document level allows us to control for document types and citation windows, and to combine bibliometrics-based techniques with other methods like text mining.This results in a higher precision since irrelevant document types and 'lowweight journals' can be excluded.This way we can present the results of a hybrid (i.e., combined/integrated) citation-textual cluster analysis to compare those with the structure of an existing 'intellectual' subject classification scheme created and used by Thomson Scientific.The aim of this comparison is exploring the possibility of using the results of the cluster analysis to improve the subject classification scheme in question.

Cognitive mapping vs. subject classification
The objective of the present study is two-fold.The first task is not merely visualising the field structure of science by presenting yet another map based on an alternative approach, but to validate and improve existing subject classifications used for research evaluation.In particular, the question arises of in how far observed 'migration' of journals among science fields can be adopted to improve classification.The second issue is, however, a methodological one, namely to evaluate improved methods of hybrid clustering techniques.The 22-field subject classification scheme of the Essential Science Indicators (ESI) of Thomson Scientific, which actually forms a partition of the Web of Science universe with practically unique subject assignment, is used as the "control structure".In particular, we propose the following approach in seven steps to solve the integration of cluster analysis and cognitive mapping into subject classification.1. Evaluation of existing subject-classification schemes and visualisation of their crosscitation graph 2. Labelling subject fields using cognitive characteristics 3. Studying the cognitive structure based on hybrid cluster analysis and visualisation of the cross-citation graph 4. Evaluation of science areas resulting from cluster analysis 5. Labelling clusters using cognitive characteristics and representative journals suggested by the PageRank algorithm 6.Comparison of subject fields and cluster structure 7. Migration of journals among subject fields

Data sources and data processing
In order to accomplish the above objectives, more than six million papers of the type article, letter, note and review indexed in the Web of Science (WoS) in the period 2002-2006 have been taken into consideration.Citations to individual papers have been aggregated from the publication year till 2006.The complete database has been indexed and all terms extracted from titles, abstracts and keywords have been used for "labelling" the obtained clusters.Citations received by these papers have been determined for a variable citation window beginning with the publication year, up to 2006, on the basis of an item-by-item procedure using special identification-keys made up of bibliographic data elements extracted from first-author names, journal title, publication year, volume and first page.In a first step, journals had to be checked for name changes, merging or splitting and identified accordingly.Journals which were not covered in the entire period have been omitted.Furthermore, only journals that have published at least 50 papers in the period under study were considered.A second threshold was used afterwards to remove all journals for which the sum of references and citations was lower than 30.The resulting number of retained journals was 8,305.Most of the subsequent analyses were performed in Java and MATLAB.We also made use of the MATLAB Tensor Toolbox (Bader, 2006).

Methods
In this section we briefly describe the methodological background and the algorithms and procedures that have been applied.The first subsection refers to the outlines of the textual approach; this is followed by the description of the cross-citation analysis.The journal clustering techniques described in the subsequent paragraphs are applied to the textual and citation data separately and used for combined (hybrid) clustering as well.This procedure is described in the following step by step.

Text analysis
All textual content was indexed with the Jakarta Lucene platform (Hatcher, 2004) and encoded in the Vector Space Model using the TF-IDF weighting scheme (Baeza-Yates, 1999).Stop words were neglected during indexing and the Porter stemmer was applied to all remaining terms from titles, abstracts, and keyword fields.The resulting term-by-document matrix contained nine and a half million term dimensions (9,473,061), but by ignoring all tokens that occurred in one sole document, only 669,860 term dimensions were retained.Those ignored terms with a document frequency equal to one are useless for clustering purposes.The dimensionality was further reduced from 669,860 term dimensions to 200 factors by Latent Semantic Indexing (LSI) (Deerwester, 1990;Berry, 1995), which is based on the Singular Value Decomposition (SVD).The reduction of the number of features in a vector space by application of LSI improves the performance of retrieval, clustering, and classification algorithms.Text-based similarities were calculated as the cosine of the angle between the vector representations of two papers (Salton, 1986).

Citation analysis
Since the present study analyses the structure of science on the level of journals, all local citations between papers are aggregated to form a journal cross-citation graph.For cluster analysis we ignored the direction of citations by symmetrising the journal cross-citation matrix.At the level of journal clusters, the journal cross-citations can be further aggregated into inter-cluster citations.From the raw number of cross-citations between two journals (or clusters, respectively), a normalised similarity can be calculated by dividing it by the square root of the product of the total number of citations to or from the first journal (cluster), and the total number of citations to or from the second.Intra-cluster 'self-citations' are counted only once.For visualisation of the networks we use the similarities just described as edge weights between two clusters or fields (see Figure 2 for an example).For clustering, however, we calculated the similarity of two journals somewhat differently because we didn't want to ignore, for instance, that both journals could be highly cited by a third one.That's why we opted to use "second order" journal cross-citation similarities for clustering.The journal cross-citation numbers are usually stored in a square, symmetric matrix.With "second-order similarities" we mean that the cross-citation values between a journal and all other journals (i.e., row or column of the matrix with cross-citation numbers) are used as input for another step of pairwise similarity calculation.The second-order similarities are found by calculating the cosine of the angle between pairs of vectors containing all symmetric journal cross-citation values between the two respective journals and all other journals.Hence, the ultimate similarity of two journals is based on their respective similarities with all other journals.The journal cross-citation graph is also analysed to identify important high-impact journals.We use the PageRank algorithm (Brin, 1998) to determine representative journals in each cluster.Besides, the graph can also be used to evaluate the quality of a clustering outcome.

Clustering
In order to subdivide the journal set into clusters we used the agglomerative hierarchical cluster algorithm with Ward's method (Jain, 1988).It is a hard clustering algorithm, which means that each individual journal is assigned to exactly one cluster.

Number of clusters
Determination of the optimal number of clusters in a data set is a difficult issue and depends on the adopted validation and chosen similarity measures, as well as on data representation.In general, the number of clusters is determined by comparing the quality of different clustering solutions based on various numbers of clusters.Cluster quality can be assessed by internal or external validation measures.Internal validation solely considers the statistical properties of the data and clusters, whereas external validation compares the clustering result to a known gold standard partition.Halkidi, Batistakis and Vazirgiannis (2001) gave an overview of quality assessment of clustering results and cluster validation measures.The strategy that we adopted to determine the number of clusters is a combination of distancebased and graph-based methods.This compound strategy encompasses observation of a dendrogram, text-and citation-based mean Silhouette curves, and modularity curves.Besides, the Jaccard similarity coefficient and the Rand index are used to compare the obtained results with an intellectual classification scheme.

Dendrogram
A preliminary judgment is offered by a dendrogram, which provides a visualisation of the distances between (sub-) clusters (see Figure 4 for an example).It shows the iterative grouping or splitting of clusters in a hierarchical tree.A candidate number of clusters can be determined visually by looking for a cut-off point where an imaginary vertical line would cut the tree such that resulting clusters are well separated.Because of the difficulty to define the optimal cut-off point on a dendrogram (Jain, 1988), we complement this method with other techniques.

Silhouette curves
A second appraise for the number of clusters is given by the curve with mean Silhouette values.The Silhouette value for a document ranges from -1 to +1 and measures how similar it is to documents in its own cluster vs. documents in other clusters (Rousseeuw, 1987).The average Silhouette value for all clustered objects (e.g., journals) is an intrinsic measurement of the overall quality of a clustering solution with a specific number of clusters.Since Silhouette values are based on distances, depending on the chosen distance measure and reference data different Silhouette values can be calculated.For instance, we use the complement of cosine similarity applied to text and citation data.The quality of a specific partition can be visualised in a Silhouette plot.In a Silhouette plot (see Figures 1 & 5), the sorted Silhouette values of all members of each cluster (or field) are indicated with horizontal lines.The more the Silhouette profile of a cluster (field) is to the right of the vertical line at the value 0, the more coherent the cluster (field) is, whereas negative values indicate that the corresponding objects should rather belong to another cluster (field).

Modularity curves
The quality of a clustering can also be evaluated by calculating the modularity of the corresponding partition of the cross-journal citation graph (Newman & Girvan, 2004;Newman, 2006).Up to a multiplicative constant, modularity measures the number of intracluster citations minus the expected number in an equivalent network with the same clusters but with citations given at random.Intuitively, in a good clustering there are more citations within (and fewer citations between) clusters than could be expected from random citing.The expected number of citations between two journals is based on their respective degrees and on the total number of citations in the network.For an additional 'external validation' of clustering results, we also use modularity curves computed from a network containing all journals as nodes, but with edge weights equal to the number of ISI Subject Categories commonly assigned to both journals by Thomson Scientific (out of the total of 254).

Jaccard similarity coefficient and Rand index
The Jaccard index is the ratio of the cardinality of the intersection of two sets and the cardinality of their union.The Jaccard similarity coefficient is an extension of the Jaccard index and can be used as a measure for external cluster validation.The Rand index is another external validation measure to quantify the correspondence between a clustering outcome and a ground-truth categorisation (Jain, 1988).In contrast to the Jaccard coefficient, the Rand index does take into account negative matches as well.Both measures result in a value between 0 and 1, with 1 indicating identical partitions.In Figure 8, we use the Jaccard index to compare each cluster with every field from the intellectual ESI classification, in order to detect the best matching fields for each cluster.

Hybrid clustering
As mentioned at the outset, in general four major approaches are used for clustering sets of scientific papers, particularly, the lexical approach and three citation-based methods, namely cross-citation, bibliographic coupling, and co-citation analysis.Each of the methods alone suffers from severe shortcomings.For example, typical problems with bibliographic coupling and co-citations are sparse matrices, the lack of consensual referencing in some areas (Braam et al., 1991b;Jarneving, 2007), document types with insufficient number of references (e.g., letters) that have to be excluded (bibliographic coupling), the incompleteness due to missing citations to recent years (co-citation analysis), the missing 'critical mass' for emerging field detection (co-citation analysis, cf.Hicks, 1987), and the bias towards high-impact journals (co-citation analysis).If strict citation-based criteria are applied, then the resulting citations-by-document matrix is extremely sparse.In this case, rejection of relationship between two entities (e.g., journals or documents) tends to be unreliable.On the other hand, any lexical (text-based) approach is usually based on rather rich vocabularies and peculiarities of natural language.The result is, according to our observations, a rather 'smooth' or gradual transition between what is related and what is not.Therefore, the relationship is somewhat fuzzy and not always reliable.Hence, both the textual and citation-based approaches provide different perceptions of similarities among the same data.Textual information might indicate similarities that are not visible to bibliometric techniques, but true document similarity can also be obscured by differences in vocabulary use, or spurious similarities might be introduced as a result of textual preprocessing, or because of polysemous words or words with little semantic value.The combination of the two worlds helps to improve the reliability of relationship and therefore of the clustering algorithm as well.Therefore, the present study combines cross-citation analysis with text mining.The former can be applied to directed links as well as to the symmetrised transaction matrix.Symmetrisation also compensates for the incompleteness caused by the lack of citations to recent years and allows links between journals to be considered strong and subject-relevant even if these are asymmetric or even unidirectional.In order to reduce noise caused by 'small' journals and extremely weak citation links, thresholds have been applied to both citation links and number of papers (see previous section).The text mining analysis supplements the citation analysis.In particular, the textual information is integrated with the bibliometric information before the clustering algorithm is applied.In the present study, the actual integration is achieved by weighted linear combination of the corresponding distance matrices.The methodology and advantages of hybrid clustering have been substantiated in more detail in earlier studies devoted to the analysis of different research fields (see Glenisson et al., 2005;Janssens et al., 2007aJanssens et al., , 2007bJanssens et al., , 2008)).In addition, the lexical approach allows to 'label' clusters using automatically detected salient terms.In Section 4.3, Silhouette and modularity curves will be used to compare results of textbased, citation-based and hybrid clustering, and we will substantiate that the hybrid method in general outperforms the other two.

Multidimensional scaling
Multidimensional scaling (MDS) can be used to represent high-dimensional vectors (for example, the centroids of journal clusters) in a lower dimensional space by explicitly requiring that the pairwise distances between the points approximate the original highdimensional distances as precisely as possible (Mardia, 1979).If the dimensionality is reduced to two or three dimensions, these mutual distances can directly be visualised.It should, however, be stressed that interpretations concerning such a low-dimensional approximation of very high-dimensional distances must be handled with care.

Evaluation of existing 'intellectual' subject-classification schemes
The multidisciplinary databases Science Citation Index Expanded (SCIE) and Social Sciences Citation Index (SSCI) of Thomson-Reuters (formerly Institute for Scientific Information, ISI, Philadelphia, PA, USA) traditionally did not provide a direct subject assignment for indexed papers.The annual Science Citations Index Guides, the Journal Citation Reports (JCR) and more recently the Website of Thomson Scientific, however, contain regularly updated lists of (S)SCI journals assigned to one or more subject matters (ISI Subject Categories) each.For lack of an appropriate subject-heading system, more or less modified versions of this Subject Category scheme were often used in bibliometric studies too, namely as an indirect subject assignment to individual papers based on the journals in which they had been published.Such assignment systems based on journal classification have been developed among others by Narin and Pinski (see, for instance, Narin, 1976;Pinski & Narin, 1976).This was followed by classification schemes developed by other institutes as well.Nowadays two ISI systems are widely used, in particular, the ISI Subject Categories, which are available in the JCR and through journal assignment in the Web of Science as well, and the Essential Science Indicators (ESI).

Field # ESI Field
Field While the first system assigns multiple categories to each journal and is too fine grained (254 categories) for comparison with cluster analysis, the ESI scheme is forming a partition (with practically unique journal assignment) and the 22 fields are large enough.Therefore the ESI classification seems to be a good choice for our analysis.Subject fields will be considered like automatically generated clusters.One precondition for easy comparison with results from hard clustering is that the reference classification system must form a partition of the WoS universe, while most schemes allow multiple assignments (e.g., the above-mentioned ISI Subject Categories).The only commonly known subject scheme for ISI products that meets the criterion is the ESI classification system.This subject classification scheme is in principle based on unique assignment; only about 0.6% of all journals were assigned to more than one field over a five-year period.For the present exercise, assignment has to be de-duplicated in the case of journals which merged or split up during the period of 5 years, declaredly a somewhat arbitrary procedure.Nonetheless, the assignment remains correct and results in no more than a slightly narrower scope for several journals.The field structure of the ESI scheme is presented in Table 1.

Cluster analysis: text-based, citation-based and hybrid
Figure 3 compares the performance of text-based, cross-citation and hybrid clustering by several evaluation methods, for various numbers of clusters.For each of the three clustering types, Figure 3(1) presents for various cluster numbers (2 to 30) the modularity calculated from the journal cross-citation graph.Since this evaluation is based on cross-citation data, it is not a surprise that the text-only clustering provides worse results than cross-citation clustering, which performs best here.However, very interesting to note is that the hybrid clustering (integrated text and cross-citation information) provides results highly comparable to those from cross-citation clustering, especially for 7 or more than 12 clusters.The modularity scores for cross-citation clustering indicate that any number of clusters larger than 9 is acceptable.On the other hand, the modularity curve for text-only clustering contains a maximum for eight clusters.
In Figure 3(2), Silhouette curves based on (the complement of) cross-citation values show the somewhat counter-intuitive but beneficial result that hybrid clustering always performs better than cross-citation clustering, although the evaluation only considers citations here.This again demonstrates the power of hybrid clustering: the combined heterogeneous citation-textual approach is superior to both methods applied separately.Nevertheless, this figure does not provide a clear clue with respect to the number of clusters to choose.Silhouette curves based on the complement of second-order cross-citations are shown in Figure 3(3).Again, the hybrid clustering almost always performs best.
In Figure 3(4), the Silhouette values are computed only from textual distances.Naturally, the citation-based clustering performs worst here, while the integrated clustering scores almost as good as the text-only clustering and for some cluster numbers even better.Figure 3(5) shows Silhouette curves based on linearly combined text-based and citationbased distances (with equal weight).Here, combined data and mere citations give comparable results, which might be an indication that there is a preponderance of citation over text data in the combined Silhouette values.Finally, Figure 3 In Table 3 we compare the quality of the partition of 22 ESI fields with the quality of the 22 clusters resulting from citation-based, text-based and hybrid clustering.The only evaluation measure for which the 22 human-made ESI fields score best is modularity based on ISI Subject Categories.As already explained before, this evaluation type computes modularity from a network containing all journals as nodes and with edge weights equal to the number of ISI Subject Categories commonly assigned to the corresponding journals by ISI/Thomson Scientific (out of the total of 254 categories).Since there is a direct correspondence between the 22 ESI fields and these 254 Subject Categories (a field is an aggregation of multiple subject categories), it is not at all surprising (not to mention unfair) that the ESI fields outperform the clusters for this type of evaluation.For all other data-driven evaluation types it is clear that automatic clustering does better than human expert classification.
Hybrid clustering always performs at least as good as text-based or citation-based clustering, except for evaluation by second order cross-citations.However, small the difference, the last column shows that the 22 hybrid clusters correspond best to the 22 ESI fields.It should be noted that the values in Table 3 can differ somewhat from the values in Figure 3 because, for the sake of a fair comparison with ESI fields, in the table only 7729 journals were considered for which a field assignment was available.

Evaluation of hybrid clusters
The cluster dendrogram shows the structure in a hierarchical order (see Figure 4).We visually find a first clear cut-off point at three clusters, a second one around seven, and 22 clusters also seemed to be an acceptable/ appropriate number.This value coincides with the number of fields according to the ESI classification scheme.The Silhouette plots in Figure 5 and the mean Silhouette values in Table 3 substantiate that the 22 hybrid clusters are furthermore acceptable for both the citation and the text-mining approach.The same conclusion can be drawn from computed modularity scores.
The number of three clusters results in an almost trivial classification.Intuitively, these three high-level clusters should comprise natural and applied sciences, medical sciences, and social sciences and humanities.The solutions with 3 and 22 clusters will be analysed in more detail in Section 4.5.The solution comprising of seven clusters results in a non-trivial classification.The best TF-IDF terms (see Table 5) show that three of these clusters represent the natural/applied sciences, whereas two classes each stand for the life sciences and the social sciences and humanities.This situation is also reflected by the cluster dendrogram in Figure 4.A closer look at the best TF-IDF terms reveals that social sciences cluster (#1 of the 3-cluster solution) is split into the cluster #1 (economics, business and political science) and #6 (psychology, sociology, education), the life-science cluster (#3 in the 3-cluster scheme) is split into clusters #3 (biosciences and biomedical research) and #7 (clinical, experimental medicine and neurosciences) and, finally, the sciences cluster #2 of the 3-cluster scheme is distributed over three clusters in the 7-cluster solution, particularly, the cluster comprising biology, agriculture and environmental sciences (#2), physics, chemistry and engineering (#4) as well as mathematics and computer science (#5).The hybrid, i.e. the combined citation-textual based clustering yields acceptable results (see Figure 5), and is distinctly superior to both methods applied separately.Nonetheless, we must not conceal that we can also find clusters of lesser quality, notably cluster #1, in the hybrid classification.

Cognitive characteristics of clusters
As already mentioned in the previous section, another nice point to cut off the dendrogram is at three clusters (cf. the right-most vertical line in Figure 4).Although this refers to a rather trivial case, it might be worthwhile to have a look at term representation of this structure before we deal with 'labelling' the 22 clusters that we have obtained from the hybrid algorithm.This will also help us to understand the hierarchical architecture of the subject structure of science.Table 4 lists the best 50 terms for each of the three top-level clusters which definitely confirm the presence of the expected clusters.Indeed, cluster #1 comprises the social sciences, cluster #2 the natural and applied sciences and cluster #3 the medical sciences.The distribution of journals over clusters is surprisingly well-balanced.

Comparison of subject and cluster structure
In this subsection we compare the structure resulting from the hybrid clustering with the ESI subject classification.This comparison is based on the centroids of the clusters and fields.
The centroid of a cluster or field is defined as the linear combination of all documents in it and is thus a vector in the same vector space.For each cluster and for each field, the centroid was calculated and the MDS of pairwise distances between all centroids is shown in Figure 7.In Figure 8, we use the Jaccard index to determine the concordance between our clustering solution and the ESI Scheme by comparing each cluster with every field, in order to detect the best matching fields for each cluster.The darker a cell in the matrix, the higher the Jaccard index, and hence the more pronounced the overlap between the corresponding cluster and ESI field.For example, cluster #4 (Chemistry) definitely corresponds to ESI field #3 (Chemistry).The same applies to field and cluster #6 (Economics and business).Clearly, ESI field #21 has the least concordance as this field is spread over seven clusters.It is defined as one single field in social sciences.It is not a surprise that the strongest match is found with our somewhat 'fuzzy' multidisciplinary social cluster.On the other hand, clusters #13 and #14 are quite similarly spread over four ESI fields each.

Migration of journals among subject fields and clusters
If clustering algorithms are adjusted or changed, one can observe the following phenomenon.Some units of analysis are leaving clusters they formerly belonged to and end up in different clusters.This phenomenon is called 'migration'.We can distinguish between 'good migration' and 'bad migration'.'Good migration' is observed if the goodness of the unit's classification improves, otherwise we speak about 'bad migration'.
We can also apply this notion of migration to the comparison of clustering results with any reference classification.In the following we will use the ESI scheme as reference classification.
In the previous section we visualised the concordance between the clustering and the ESI classification.To determine for each ESI field the cluster that best matches the field, we used the Jaccard index on basis of the number of overlapping journals (cf.upper part of Figure 8).
Out of 8305 journals under study, there were more than one third, namely, 3204 journals that were not assigned to the cluster which best matches their ESI field.As already mentioned above, we call these journals 'migrated journals'.To measure the quality of migrations, we calculated the differences in Silhouette values before and after migration (based on textual and citation distances), for each migrated journal.Most migrated journals improved their Silhouette values.In the following, we will give some examples of good migrations and bad migrations.'Good migrations' are observed if journals improved their Silhouette values after migration.Based on their titles and scopes (not shown), apparently they should indeed be assigned to the cluster to which they have moved.We observed numerous good migrations and the following cases will serve just as examples.The Journal of Analytical Chemistry and Chemia Analityczna migrated from ESI field #7 (Engineering) to cluster 4 (chemistry).The best matching ESI cluster were field #3 (Chemistry) in this case (cf. Figure 8).Similarly, Land Economics, Developing Economies and Economic Development and Cultural Change migrated from field #21 (Social Sciences, general) to the more specific cluster 6 (economics and business).Here, the corresponding ESI field were #6 (Economics & business).In the life sciences, we found the following good migration Finally, we mention a migration between engineering and mathematics.The journals Quarterly of Applied Mathematics, Bit Numerical Mathematics, Siam Journal on Discrete Mathematics and Discrete Applied Mathematics, which were assigned to the ESI field Engineering (field #7), were found in our 'Mathematics' cluster (#8) which in turn corresponds to WSI field #12 (Mathematics).In the case of bad migration, the Silhouette values decreased after migration, that is, their Silhouette values in the ESI scheme were better than in the hybrid clustering.The reasons for this phenomenon are not always clear.According to their titles and scopes this migration is not always convincing.For instance, Journal of Astrophysics and Astronomy, New Astronomy, Astrophysical Journal and Astronomy & Astrophysics migrated from the ESI field 22 (Space Science) to Cluster 2 (geosciences) corresponding to ESI field #9, where we have to admit that journals in astronomy and astrophysics are in general spread over the geosciences and physics clusters.Viral Immunology migrated from field #10 (Immunology) to cluster #13 (microbiology and veterinary science) and Canadian Journal of Microbiology migrated from field #13 (Microbiology) to cluster #15 (agricultural and environmental sciences).Both clusters are rather spread over several ESI fields each (see Figure 8).The distinction between good and bad makes a target-oriented adjustment of the existing classification scheme possible.Good migration can be used to reassign journals within the old scheme on the basis of the concordance with the results of clustering.

Conclusions
The hybrid clustering using textual information and cross-citations provided good results and proved superior to its two components when applied separately.The goodness of the resulting classification was even better than that of the "intellectual" reference scheme, the ESI subject scheme.Both classification systems form partitions of the Web of Science so that the direct comparison of clusters and fields was possible.According to our expectations, not all clusters have a unique counterpart in the ESI scheme and vice versa although the number of clusters coincided with the number of ESI fields.Although the Silhouette and modularity values substantiate a more coherent structure of the hybrid clustering as compared with the ESI subject scheme, not all clusters are of high quality.Problems have been found, for instance, in clusters #1 and #12 where interdisciplinarity and strong links with other clusters distort the intra-cluster coherence.However, intellectual classification schemes usually do have a category "multidisciplinary sciences" as well.Although the result of a hard clustering algorithm often does contain a cluster with objects (journals) not strongly related to any other cluster, forming a "multidisciplinary sciences" cluster is not an inherent goal of the algorithm, and actually is not really meaningful either in the light of our outset goal to improve the classification of the sciences.Consequently, real multidisciplinary journals are scattered around different clusters.Based on the external validation of clustering results by expert knowledge present in ISI subject categories, seven clusters seem to yield best results.Although there is no adequate subject classification scheme with 7 categories to be used as reference system, a more detailed analysis of this solution will be part of future research.Additional ideas for future research are a further improvement of the hybrid clustering algorithm by iterative cleaning of clusters as a post-processing step; allowing multiple assignments by fuzzy clustering; evaluating other algorithms like spectral clustering; and, finally, dynamic analysis by dynamic hybrid clustering.The continuous rise of computing power might one day allow a large-scale mapping of the scientific universe explorable at various levels of detail.What's more, application of advanced natural language processing and machine summarization at the scale of large bibliographic corpora might offer some insight into semantics beyond mere statistical processing.
Figure 1 presents the evaluation of the 22 ESI fields based on the cross-citation-(left) and text-based (right) Silhouette values (see Section 3.3.3).Several fields seem not to be consistent enough from both perspectives.Above all, the Silhouette values of field #2 (Biology & Biochemistry), #4 (Clinical Medicine), #7 (Engineering), #19 (Plant & Animal Science) and #21 (Social Sciences) substantiate that at least five of the 22 fields are not sufficiently consistent.

Fig. 1 .
Fig. 1.Silhouette plot for 22 ESI fields based on journal cross-citations (left) and based on text (right)

Fig. 3 .
Fig. 3. Performance evaluation of text-based, citation-based and hybrid clustering based on (1) modularity calculated from the journal cross-citation graph, and based on Silhouette curves calculated from (2) journal cross-citations, (3) second-order journal cross-citations, (4) text-based distances, and (5) linearly combined distances.For an additional 'external validation' of clustering results compared to ISI Subject Categories, the lower-right figure (6) uses modularity computed from a network containing all journals as nodes, but with edge weights equal to the number of ISI Subject Categories commonly assigned to the corresponding journals by ISI/Thomson Scientific (out of the total of 254 categories).

Fig. 4 .
Fig. 4. Cluster dendrogram for hybrid hierarchical clustering of 8305 journals, cut off at 22 clusters on the left-hand side.Two other vertical lines indicate the cut-off points for 7 and 3 clusters.

Fig. 5 .
Fig. 5. Evaluation of the hybrid clustering solution with 22 clusters by citation based Silhouette plot (left), text based Silhouette plot (centre) and the plot with Silhouette values based on combined data (right).

Fig. 7 .
Fig. 7. Three-dimensional MDS map visualising distances between the centres (centroids) of the 22 ESI fields and the 22 clusters containing 8305 WoS journals.

Fig. 8 .
Fig. 8. Concordance between our clustering solution and the ESI Scheme visualised by coloured cells representing the Jaccard index for each cluster and field pair.The darkest cells represent the best matching pairs of fields and clusters.In the upper figure, the Jaccard index is computed from the number of journals a cluster and a field have in common, while the lower figure takes the size of each journal into account by counting the numbers of overlapping papers.
. The journals Neuropathology, Revista de Neurologia, Current Opinion in Neurology, Revue Neurologique, Lancet Neurology, European Journal of Neurology, Neurologist, Nervenheilkunde, Visual Neuroscience, Seminars in Neurology, Epilepsy & Behavior and Journal of Neuroimaging migrated from field #4 (Clinical Medicine) to cluster #7 (neuroscience and behaviour) which rather corresponds to ESI field #16 (Neuroscience and behavior).

Table 1
. The 22 broad science fields according to the Essential Science Indicators (ESI)

Table 3
(6)provides an external validation of clustering results by expert knowledge available in the ISI Subject Categories assigned to journals by ISI/Thomson Scientific.The modularity curves are computed from a network containing all journals as nodes, but with edge weights equal to the number of ISI Subject Categories in common (out of the total of 254 categories).Again very interesting to see is that hybrid clustering outperforms text-only and citation-based clustering.The optimal number of clusters according to this type of evaluation is 7.
. Evaluation of 22 ESI fields and 22 citation-based, text-based and hybrid clusters by modularities and mean Silhouette values (MSV).Highest values in each column are shown in bold.

Table 7 .
The five most important journals of each cluster according to a modified version of Google's PageRank algorithm (see Equation1).

Table 8 .
The largest 'exodus' comprising 226 migrating journals occurred from the ESI "Engineering" field to cluster #18 (Computer science), whereas the best matching cluster for the Engineering field is actually Cluster #5 (Engineering).The top 10 strongest patterns of migration are listed in Table 8, which indicate possible improvements of journal assignments.Engineering) to Cluster 18 From ESI field 14 (Molecular Biology & Genetics) to Cluster 3 From ESI field 21 (Social Sciences, general) to Cluster 10 From ESI field 11 (Materials Science) to Cluster 20 From ESI field 4 (Clinical Medicine) to Cluster 7 From ESI field 19 (Plant & Animal Science) to Cluster 15 From ESI field 21 (Social Sciences, general) to Cluster 21 From ESI field 7 (Engineering) to Cluster 20 From ESI field 4 (Clinical Medicine) to Cluster 3 From ESI field 8 (Environment/Ecology) to Cluster 15 Top 10 strongest migration patterns