Expert Systems with Applications 41 (2014) 6755–6772
Contents lists available at ScienceDirect
Expert Systems with Applications
journal homepage: www.elsevier.com/l
ocate/eswa
A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
Maxime Gasse, Alex Aussem ?, Haytham Elghazel
Université de Lyon, CNRS, Université Lyon 1, LIRIS, UMR5205, F-69622, France
a r t i c l e
i n f o
a b s t r a c t
We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It ?rst reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max–Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of ?t to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classi?cation problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classi?cation accuracy over ten multi-label data sets covering different application domains. Overall, our experiments support the conclusions that local structural learning with H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learning framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for the empirical tests are publicly available. ? 2014 Elsevier Ltd. All rights reserved.
Article history: Available online 9 May 2014 Keywords: Bayesian networks Multi-label learning Markov boundary Feature subset selection
1. Introduction A Bayesian network (BN) is a probabilistic model formed by a structure and parameters. The structure of a BN is a directed acyclic graph (DAG), whilst its parameters are conditional probability distributions associated with the variables in the model. The problem of ?nding the DAG that encodes the conditional independencies present in the data attracted a great deal of interest over the last years (Gasse, Aussem, & Elghazel, 2012; Kojima, Perrier, Imoto, & Miyano, 2010; Pe?a, 2012; Perrier, Imoto, & Miyano, 2008; Rodrigues de Morais & Aussem, 2010a; Scutari, 2010; Scutari & Brogini, 2012; Villanueva & Maciel, 2012). The inferred DAG is very useful for many applications, including feature selection (Aliferis, Statnikov, Tsamardinos, Mani, & Koutsoukos, 2010; Pe?a, Nilsson, Bj?rkegren, & Tegnér, 2007; Rodrigues de Morais & Aussem, 2010b), causal relationships inference from observational
? Corresponding author. Tel.: +33 426234466.
E-mail address: aaussem@univ-lyon1.fr (A. Aussem). http://dx.doi.org/10.1016/j.eswa.2014.04.032 0957-4174/? 2014 Elsevier Ltd. All rights reserved.
data (Aliferis et al., 2010; Aussem, Rodrigues de Morais, & Corbex, 2012, 2010; Brown & Tsamardinos, 2008; Cawley, 2008; Ellis & Wong, 2008; Prestat et al., 2013) and more recently multi-label learning (Dembczyski, Waegeman, Cheng, & Hüllermeier, 2012; Guo & Gu, 2011; Zhang & Zhang, 2010). Ideally the DAG should coincide with the dependence structure of the global distribution, or it should at least identify a distribution as close as possible to the correct one in the probability space. This step, called structure learning, is similar in approaches and terminology to model selection procedures for classical statistical models. Basically, constraint-based (CB) learning methods systematically check the data for conditional independence relationships and use them as constraints to construct a partially oriented graph representative of a BN equivalence class, whilst search-and-score (SS) methods make use of a goodness-of-?t score function for evaluating graphical structures with regard to the data set. Hybrid methods attempt to get the best of both worlds: they learn a skeleton with a CB approach and constrain on the DAGs considered during the SS phase.
6756
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
In this study, we present a novel hybrid algorithm for Bayesian network structure learning, called H2PC.1 It ?rst reconstructs the skeleton of a Bayesian network and then performs a Bayesianscoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. HPC may be thought of as a way to compensate for the large number of false negatives at the output of the weak PC learner, by performing extra computations. As this may arise at the expense of the number of false positives, we control the expected proportion of false discoveries (i.e. false positive nodes) among all the discoveries made in PCT . We use a modi?cation of the Incremental association Markov boundary algorithm (IAMB), initially developed by Tsamardinos et al. in Tsamardinos, Aliferis, and Statnikov (2003) and later modi?ed by Pe?a in Pe?a (2008) to control the FDR of edges when learning Bayesian network models. HPC scales to thousands of variables and can deal with many fewer samples (n < q). To illustrate its performance by means of empirical evidence, we conduct two series of experimental comparisons of H2PC against Max–Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for BN structure learning (Tsamardinos, Brown, & Aliferis, 2006), using well-known BN benchmarks with various data sizes, to assess the goodness of ?t to new data as well as the quality of the network structure with respect to the true dependence structure of the data. We then address a real application of H2PC where the true dependence structure is unknown. More speci?cally, we investigate H2PC’s ability to encode the joint distribution of the label set conditioned on the input features in the multi-label classi?cation (MLC) problem. Many challenging applications, such as photo and video annotation and web page categorization, can bene?t from being formulated as MLC tasks with large number of categoˇ eroski, ries (Dembczyski et al., 2012; Kocev, Vens, Struyf, & Dz ˇ eroski, 2012; Read, 2007; Madjarov, Kocev, Gjorgjevikj, & Dz Pfahringer, Holmes, & Frank, 2009; Tsoumakas, Katakis, & Vlahavas, 2010b). Recent research in MLC focuses on the exploitation of the label conditional dependency in order to better predict the label combination for each example. We show that local BN structure discovery methods offer an elegant and powerful approach to solve this problem. We establish two theorems (Theorems 6 and 7) linking the concepts of marginal Markov boundaries, joint Markov boundaries and so-called label powersets under the faithfulness assumption. These Theorems offer a simple guideline to characterize graphically: (i) the minimal label powerset decomposition, (i.e. into minimal subsets YLP # Y such that YLP Y n YLP j X), and (ii) the minimal subset of features, w.r.t an Information Theory criterion, needed to predict each label powerset, thereby reducing the input space and the computational burden of the multi-label classi?cation. To solve the MLC problem with BNs, the DAG obtained from the data plays a pivotal role. So in this second series of experiments, we assess the comparative ability of H2PC and MMHC to encode the label dependency structure by means of an indirect goodness of ?t indicator, namely the 0=1 loss function, which makes sense in the MLC context. The rest of the paper is organized as follows: In the Section 2, we review the theory of BN and discuss the main BN structure learning strategies. We then present the H2PC algorithm in details in Section 3. Section 4 evaluates our proposed method and shows results for several tasks involving arti?cial data sampled from known BNs. Then we report, in Section 5, on our experiments on real-world data sets in a multi-label learning context so as to provide empirical support for the proposed methodology. The
main theoretical results appear formally as two theorems (Theorems 6 and 7) in Section 5. Their proofs are established in the Appendix. Finally, Section 6 raises several issues for future work and we conclude in Section 7 with a summary of our contribution. 2. Preliminaries We de?ne next some key concepts used along the paper and state some results that will support our analysis. In this paper, upper-case letters in italics denote random variables (e.g., X ; Y ) and lower-case letters in italics denote their values (e.g., x; y). Upper-case bold letters denote random variable sets (e.g., X; Y; Z) and lower-case bold letters denote their values (e.g., x; y). We denote by X Y j Z the conditional independence between X and Y given the set of variables Z. To keep the notation uncluttered, we use p?y j x? to denote p?Y ? y j X ? x?. 2.1. Bayesian networks Formally, a BN is a tuple hG; P i, where G ? hU; Ei is a directed acyclic graph (DAG) with nodes representing the random variables U and P a joint probability distribution in U . In addition, G and P must satisfy the Markov condition: every variable, X 2 U, is independent of any subset of its non-descendant variables conditioned on the set of its parents, denoted by PaG i . From the Markov condition, it is easy to prove (Neapolitan, 2004) that the joint probability distribution P on the variables in U can be factored as follows:
P?V? ? P?X 1 ; . . . ; X n ? ?
n Y i? 1
P?X i j PaG i?
?1?
Eq. (1) allows a parsimonious decomposition of the joint distribution P. It enables us to reduce the problem of determining a huge number of probability values to that of determining relatively few. A BN structure G entails a set of conditional independence assumptions. They can all be identi?ed by the d-separation criterion (Pearl, 1988). We use X ?G Y j Z to denote the assertion that X is d-separated from Y given Z in G. Formally, X ?G Y j Z is true when for every undirected path in G between X and Y, there exists a node W in the path such that either (1) W does not have two parents in the path and W 2 Z, or (2) W has two parents in the path and neither W nor its descendants is in Z. If hG; P i is a BN, X ?P Y j Z if X ?G Y j Z. The converse does not necessarily hold. We say that hG; Pi satis?es the faithfulness condition if the d-separations in G identify all and only the conditional independencies in P, i.e., X ?P Y j Z iff X ?G Y j Z. A Markov blanket MT of T is any set of variables such that T is conditionally independent of all the remaining variables given MT . By extension, a Markov blanket of T in V guarantees that MT # V, and that T is conditionally independent of the remaining variables in V, given MT . A Markov boundary, MBT , of T is any Markov blanket such that none of its proper subsets is a Markov blanket of T. We denote by PCG T , the set of parents and children of T in G, and by SPG T , the set of spouses of T in G. The spouses of T are the variables that have common children with T. These sets are unique for all G, such that < G; P > satis?es the faithfulness condition and so we will drop the superscript G. We denote by dSep?X ?, the set that d-separates X from the (implicit) target T. Theorem 1. Suppose < G; P > satis?es the faithfulness condition. Then X and Y are not adjacent in G iff 9Z 2 U n fX ; Y g such that X Y j Z. Moreover, MBX ? PCX [ SPX . A proof can be found for instance in Neapolitan (2004).
1 A ?rst version of HP2C without FDR control has been discussed in a paper that appeared in the Proceedings of ECML-PKDD, pp. 58–73, 2012.
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
6757
Two graphs are said equivalent iff they encode the same set of conditional independencies via the d-separation criterion. The equivalence class of a DAG G is a set of DAGs that are equivalent to G. The next result showed by Pearl (1988) establishes that equivalent graphs have the same undirected graph but might disagree on the direction of some of the arcs. Theorem 2. Two DAGs are equivalent iff they have the same underlying undirected graph and the same set of v-structures ?i.e. converging edges into the same node, such as X ! Y Z ?. Moreover, an equivalence class of network structures can be uniquely represented by a partially directed DAG (PDAG), also called a DAG pattern. The DAG pattern is de?ned as the graph that has the same links as the DAGs in the equivalence class and has oriented all and only the edges common to the DAGs in the equivalence class. A structure learning algorithm from data is said to be correct (or sound) if it returns the correct DAG pattern (or a DAG in the correct equivalence class) under the assumptions that the independence tests are reliable and that the learning database is a sample from a distribution P faithful to a DAG G, The (ideal) assumption that the independence tests are reliable means that they decide (in) dependence iff the (in) dependence holds in P. 2.2. Conditional independence properties The following three theorems, borrowed from Pe?a et al. (2007), are proven in Pearl (1988): Theorem 3. Let X; Y; Z and W denote four mutually disjoint subsets of U. Any probability distribution p satis?es the following four properties: Symmetry X Y j Z ) Y X j Z, Decomposition, X ?Y [ W? j Z ) X Y j Z, Weak Union X ?Y [ W? j Z ) X Y j ?Z [ W? and Contraction, X Y j ?Z [ W? ^ X W j Z ) X ?Y [ W? j Z.
Theorem 4. If p is strictly positive, then p satis?es the previous four properties plus the Intersection property X Y j ?Z [ W? ^ X W j ?Z [ Y? ) X ?Y [ W? j Z. Additionally, each X 2 U has a unique Markov boundary, MBX Theorem 5. If p is faithful to a DAG G, then p satis?es the previous ?ve properties plus the Composition property X Y j Z ^ X W j Z ) X ?Y [ W? j Z and the local Markov property X ?NDX n PaX ? j PaX for each X 2 U, where NDX denotes the non-descendants of X in G. 2.3. Structure learning strategies The number of DAGs, G, is super-exponential in the number of random variables in the domain and the problem of learning the most probable a posteriori BN from data is worst-case NP-hard (Chickering, Heckerman, & Meek, 2004). One needs to resort to heuristical methods in order to be able to solve very large problems effectively. Both CB and SS heuristic approaches have advantages and disadvantages. CB approaches are relatively quick, deterministic, and have a well de?ned stopping criterion; however, they rely on an arbitrary signi?cance level to test for independence, and they can be unstable in the sense that an error early on in the search can have a cascading effect that causes many errors to be present in the ?nal graph. SS approaches have the advantage of being able to ?exibly incorporate users’ background knowledge in the form of prior probabilities over the structures and are also capable of dealing with incomplete records in the database (e.g. EM technique).
Although SS methods are favored in practice when dealing with small dimensional data sets, they are slow to converge and the computational complexity often prevents us from ?nding optimal BN structures (Kojima et al., 2010; Perrier et al., 2008). With currently available exact algorithms (Cussens & Bartlett, 2013; ? & Koivisto & Sood, 2004; Silander & Myllym?ki, 2006; Studeny Haws, 2014) and a decomposable score like BDeu, the computational complexity remains exponential, and therefore, such algorithms are intractable for BNs with more than around 30 vertices on current workstations. For larger sets of variables the computational burden becomes prohibitive and restrictions about the structure have to be imposed, such as a limit on the size of the parent sets. With this in mind, the ability to restrict the search locally around the target variable is a key advantage of CB methods over SS methods. They are able to construct a local graph around the target node without having to construct the whole BN ?rst, hence their scalability (Pe?a et al., 2007; Pe?a, 2008; Rodrigues de Morais & Aussem, 2010a, 2010b; Tsamardinos et al., 2006). With a view to balancing the computation cost with the desired accuracy of the estimates, several hybrid methods have been proposed recently. Tsamardinos et al. proposed in Tsamardinos et al. (2006) the Min–Max Hill Climbing (MMHC) algorithm and conducted one of the most extensive empirical comparison performed in recent years showing that MMHC was the fastest and the most accurate method in terms of structural error based on the structural hamming distance. More speci?cally, MMHC outperformed both in terms of time ef?ciency and quality of reconstruction the PC (Spirtes, Glymour, & Scheines, 2000), the Sparse Candidate (Friedman, Nachman, & Peer, 1999b), the Three Phase Dependency Analysis (Cheng, Greiner, Kelly, Bell, & Liu, 2002), the Optimal Reinsertion (Moore & Wong, 2003), the Greedy Equivalence Search (Chickering, 2002), and the greedy hill-climbing search on a variety of networks, sample sizes, and parameter values. Although MMHC is rather heuristic by nature (it returns a local optimum of the score function), it is currently considered as the most powerful state-of-the-art algorithm for BN structure learning capable of dealing with thousands of nodes in reasonable time. In order to enhance its performance on small dimensional data sets, Perrier et al. proposed in Perrier et al. (2008) a hybrid algorithm that can learn an optimal BN (i.e., it converges to the true model in the sample limit) when an undirected graph is given as a structural constraint. They de?ned this undirected graph as a super-structure (i.e., every DAG considered in the SS phase is compelled to be a subgraph of the super-structure). This algorithm can learn optimal BNs containing up to 50 vertices when the average degree of the super-structure is around two, that is, a sparse structural constraint is assumed. To extend its feasibility to BN with a few hundred of vertices and an average degree up to four, Kojima et al. proposed in Kojima et al. (2010) to divide the super-structure into several clusters and perform an optimal search on each of them in order to scale up to larger networks. Despite interesting improvements in terms of score and structural hamming distance on several benchmark BNs, they report running times about 103 times longer than MMHC on average, which is still prohibitive. Therefore, there is great deal of interest in hybrid methods capable of improving the structural accuracy of both CB and SS methods on graphs containing up to thousands of vertices. However, they make the strong assumption that the skeleton (also called super-structure) contains at least the edges of the true network and as small as possible extra edges. While controlling the false discovery rate (i.e., false extra edges) in BN learning has attracted some attention recently (Armen & Tsamardinos, 2011; Pe?a, 2008; Tsamardinos & Brown, 2008), to our knowledge, there is no work on controlling actively the rate of false-negative errors (i.e., false missing edges).
6758
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
2.4. Constraint-based structure learning Before introducing the H2PC algorithm, we discuss the general idea behind CB methods. The induction of local or global BN structures is handled by CB methods through the identi?cation of local neighborhoods (i.e., PCX ), hence their scalability to very high dimensional data sets. CB methods systematically check the data for conditional independence relationships in order to infer a target’s neighborhood. Typically, the algorithms run either a G2 or a v2 independence test when the data set is discrete and a Fisher’s Z test when it is continuous in order to decide on dependence or independence, that is, upon the rejection or acceptance of the null hypothesis of conditional independence. Since we are limiting ourselves to discrete data, both the global and the local distributions are assumed to be multinomial, and the latter are represented as conditional probability tables. Conditional independence tests and network scores for discrete data are functions of these conditional probability tables through the observed frequencies fnijk ; i ? 1; . . . ; R; j ? 1; . . . ; C ; k ? 1; . . . ; Lg for the random variables X and Y and all the con?gurations of the levels of the conditioning P variables Z. We use ni?k as shorthand for the marginal j nijk and similarly for ni?k ; n??k and n??? ? n. We use a classic conditional independence test based on the mutual information. The mutual information is an information-theoretic distance measure de?ned as
R X C X L X nijk i?1 j?1 k?1
3.1. Parents and children discovery HPC (Algorithm 1) can be viewed as an ensemble method for combining many weak PC learners in an attempt to produce a stronger PC learner. HPC is based on three subroutines: DataEf?cient Parents and Children Superset (DE-PCS), Data-Ef?cient Spouses Superset (DE-SPS), and Incremental Association Parents and Children with false discovery rate control (FDR-IAPC), a weak PC learner based on FDR-IAMB (Pe?a, 2008) that requires little computation. HPC receives a target node T, a data set D and a set of variables U as input and returns an estimation of PCT . It is hybrid in that it combines the bene?ts of incremental and divide-and-conquer methods. The procedure starts by extracting a superset PCST of PCT (line 1) and a superset SPST of SPT (line 2) with a severe restriction on the maximum conditioning size (Z <? 2) in order to signi?cantly increase the reliability of the tests. A ?rst candidate PC set is then obtained by running the weak PC learner called FDR-IAPC on PCST [ SPST (line 3). The key idea is the decentralized search at lines 4–8 that includes, in the candidate PC set, all variables in the superset PCST n PCT that have T in their vicinity. So, HPC may be thought of as a way to compensate for the large number of false negatives at the output of the weak PC learner, by performing extra computations. Note that, in theory, X is in the output of FDR-IAPC (Y) if and only if Y is in the output of FDR-IAPC (X). However, in practice, this may not always be true, particularly when working in high-dimensional domains (Pe?a, 2008). By loosening the criteria by which two nodes are said adjacent, the effective restrictions on the size of the neighborhood are now far less severe. The decentralized search has a signi?cant impact on the accuracy of HPC as we shall see in the experiments. We proved in Rodrigues de Morais and Aussem (2010a) that the original HPC(T) is consistent, i.e. its output converges in probability to PCT , if the hypothesis tests are consistent. The proof also applies to the modi?ed version presented here. We now discuss the subroutines in more detail. FDR-IAPC (Algorithm 2) is a fast incremental method that receives a data set D and a target node T as its input and promptly returns a rough estimation of PCT , hence the term ‘‘weak’’ PC learner. In this study, we use FDR-IAPC because it aims at controlling the expected proportion of false discoveries (i.e., false positive nodes in PCT ) among all the discoveries made. FDR-IAPC is a straightforward extension of the algorithm IAMBFDR developed by Pe?a in Pe?a (2008), which is itself a modi?cation of the incremental association Markov boundary algorithm (IAMB) (Tsamardinos et al., 2003), to control the expected proportion of false discoveries (i.e., false positive nodes) in the estimated Markov boundary. FDR-IAPC simply removes, at lines 3–6, the spouses SPT from the estimated Markov boundary MBT output by IAMBFDR at line 1, and returns PCT assuming the faithfulness condition. The subroutines DE-PCS (Algorithm 3) and DE-SPS (Algorithm 4) search a superset of PCT and SPT respectively with a severe restriction on the maximum conditioning size (jZj <? 1 in DEPCS and jZj <? 2 in DE-SPS) in order to signi?cantly increase the reliability of the tests. The variable ?ltering has two advantages: (i) it allows HPC to scale to hundreds of thousands of variables by restricting the search to a subset of relevant variables, and (ii) it eliminates many true or approximate deterministic relationships that produce many false negative errors in the output of the algorithm, as explained in Rodrigues de Morais and Aussem (2010a, 2010b). DE-SPS works in two steps. First, a growing phase (lines 4–8) adds the variables that are dseparated from the target but still remain associated with the target when conditioned on another variable from PCST . The
M I?X ; Y j Z? ?
n
log
nijk n??k ni?k n?jk
It is proportional to the log-likelihood ratio test G2 (they differ by a 2n factor, where n is the sample size). The asymptotic null distribution is v2 with ?R ? 1??C ? 1?L degrees of freedom. For a detailed analysis of their properties we refer the reader to Agresti (2002). The main limitation of this test is the rate of convergence to its limiting distribution, which is particularly problematic when dealing with small samples and sparse contingency tables. The decision of accepting or rejecting the null hypothesis depends implicitly upon the degree of freedom which increases exponentially with the number of variables in the conditional set. Several heuristic solutions have emerged in the literature (Rodrigues de Morais & Aussem, 2010a; Spirtes et al., 2000; Tsamardinos & Borboudakis, 2010; Tsamardinos et al., 2006) to overcome some shortcomings of the asymptotic tests. In this study we use the two following heuristics that are used in MMHC. First, we do not perform MI?X ; Y j Z? and assume independence if there are not enough samples to achieve large enough power. We require that the average sample per count is above a user de?ned parameter, equal to 5, as in Tsamardinos et al. (2006). This heuristic is called the power rule. Second, we consider as structural zero either case n?jk or ni?k ? 0. For example, if n?jk ? 0, we consider y as a structurally forbidden value for Y when Z = z and we reduce R by 1 (as if we had one column less in the contingency table where Z = z). This is known as the degrees of freedom adjustment heuristic.
3. The H2PC algorithm In this section, we present our hybrid algorithm for Bayesian network structure learning, called Hybrid HPC (H2PC). It ?rst reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to ?lter and orient the edges. It is based on a CB subroutine called HPC to learn the parents and children of a single variable. So, we shall discuss HPC ?rst and then move to H2PC.
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
6759
shrinking phase (lines 9–16) discards irrelevant variables that are ancestors or descendants of a target’s spouse. Pruning such irrelevant variables speeds up HPC. Algorithm 1. HPC Require: T: target; D: data set; U: set of variables Ensure PCT : Parents and Children of T 1: 2: 3: 4: 5: 6: 7: 8: DE ? PCS?T ; D; U? ?PCST ; dSep? DE-SPS?T ; D; U; PCST ; dSep? SPST PCT FDR-IAPC ?T ; D; ?T [ PCST [ SPST ?? for all X 2 PCST n PCT do if T 2 FDR-IAPC ?X ; D; ?T [ PCST [ SPST ?? then PCT PCT [ X then end if end for
Algorithm 4. DE-SPS Require: T: target; D: data set; U: the set of variables; PCST : parents and children superset of T; dSep: d-separating sets; Ensure: SPST : Superset of the spouses of T; ; 1: SPST 2: for all X 2 PCST do 3: 4: 5: 6: 7: 8: 9: 10: 11: ; SPSX T for all Y 2 U n fT [ PCST g do if ?T Y jdSep?Y ? [ X ? then SPSX T end if end for SPSX T [Y
for all Y 2 SPSX T do for all Z 2 SPSX T n Y do if ?T ? Y j X [ Z ? then
Algorithm 2. FDR-IAPC Require: T: target; D: data set; U: set of variables Ensure: PCT : Parents and children of T; ? Learn the Markov boundary of T 1: MBT IAMBFDR?X ; D; U? ? Remove spouses of T from MBT 2: PCT MBT 3: for all X 2 MBT do 4: if 9Z # ?MBT n X ? such that T ? X j Z then 5: PCT PCT n X 6: end if 7: end for
12: SPSX SPSX T T nY 13: break loop FOR 14: end if 15: end for 16: end for 17: SPST SPST [ SPSX T 18: end for
Algorithm 5. Hybrid HPC Require: D: data set; U: set of variables Ensure: A DAG G on the variables U 1: for all pair of nodes X ; Y 2 U do 2: Add X in PCY and Add Y in PCX if X 2 HPC ?Y ? and Y 2 HPC ?X ? 3: end for 4: Starting from an empty graph, perform greedy hillclimbing with operators add-edge, delete-edge, reverse-edge. Only try operator add-edge X ! Y if Y 2 PCX
Algorithm 3. DE-PCS Require: T: target; D: data set; U: set of variables; Ensure: PCST : parents and children superset of T; dSep: dseparating sets; 1: 2: 3: 4: 5: 6: 7: Phase I: Remove X if T ? X PCST UnT for all X 2 PCST do if ?T ? X ? then PCST PCST n X dSep?X ? ; end if end for
3.2. Hybrid HPC (H2PC) In this section, we discuss the SS phase. The following discussion draws strongly on Tsamardinos et al. (2006) as the SS phase in Hybrid HPC and MMHC are exactly the same. The idea of constraining the search to improve time-ef?ciency ?rst appeared in the Sparse Candidate algorithm (Friedman et al., 1999b). It results in ef?ciency improvements over the (unconstrained) greedy search. All recent hybrid algorithms build on this idea, but employ a sound algorithm for identifying the candidate parent sets. The Hybrid HPC ?rst identi?es the parents and children set of each variable, then performs a greedy hill-climbing search in the space of BN. The search begins with an empty graph. The edge addition, deletion, or direction reversal that leads to the largest increase in score (the BDeu score with uniform prior was used) is taken and the search continues in a similar fashion recursively. The important difference from standard greedy search is that the search is constrained to only consider adding an edge if it was discovered by HPC in the ?rst phase. We extend the greedy search with a TABU list (Friedman et al., 1999b). The list keeps the last 100
Phase II: Remove X if T ? X j Y 8: for all X 2 PCST do 9: for all Y 2 PCST n X do 10: if ?T ? X j Y ? then 11: PCST PCST n X 12: dSep?X ? Y 13: break loop FOR 14: end if 15: end for 16: end for
6760
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
Table 1 Description of the BN benchmarks used in the experiments. Network # var. # edges Max. degree in/out Child Insurance Mildew Alarm Hail?nder Munin1 Pigs Link 20 27 35 37 56 186 441 724 25 52 46 46 66 273 592 1125 2/7 3/7 3/3 4/5 4/16 3/15 2/39 3/14 Domain Range 2–6 2–5 3–100 2–4 2–11 2–21 3–3 2–4 min/med/max PC set size 1/2/8 1/3/9 1/2/5 1/2/6 1/1.5/17 1/3/15 1/2/41 0/2/17
the end of the SS phase, we report the ?ve performance indicators (Scutari & Brogini, 2012) described below: the posterior density of the network for the data it was learned from, as a measure of goodness of ?t. It is known as the Bayesian Dirichlet equivalent score (BDeu) from Heckerman, Geiger, and Chickering (1995) and Buntine (1991) and has a single parameter, the equivalent sample size, which can be thought of as the size of an imaginary sample supporting the prior distribution. The equivalent sample size was set to 10 as suggested in Koller and Friedman (2009); the BIC score (Schwarz, 1978) of the network for the data it was learned from, again as a measure of goodness of ?t; the posterior density of the network for a new data set, as a measure of how well the network generalizes to new data; the BIC score of the network for a new data set, again as a measure of how well the network generalizes to new data; the structural hamming distance (SHD) between the learned and the true structure of the network, as a measure of the quality of the learned dependence structure. The SHD between two PDAGs is de?ned as the number of the following operators required to make the PDAGs match: add or delete an undirected edge, and add, remove, or reverse the orientation of an edge. For each data set sampled from the true probability distribution of the benchmark, we ?rst learn a network structure with the H2PC and MMHC and then we compute the relevant performance indicators for each pair of network structures. The data set used to assess how well the network generalizes to new data is generated again from the true probability structure of the benchmark networks and contains 50,000 observations. Notice that using the BDeu score as a metric of reconstruction quality has the following two problems. First, the score corresponds to the a posteriori probability of a network only under certain conditions (e.g., a Dirichlet distribution of the hyper parameters); it is unknown to what degree these assumptions hold in distributions encountered in practice. Second, the score is highly sensitive to the equivalent sample size (set to 10 in our experiments) and depends on the network priors used. Since, typically, the same arbitrary value of this parameter is used both during learning and for scoring the learned network, the metric favors algorithms that use the BDeu score for learning. In fact, the BDeu score does not rely on the structure of the original, gold standard network at all; instead it employs several assumptions to score the networks. For those reasons, in addition to the score we also report the BIC score and the SHD metric. 4.2. Results In Fig. 1, we report the quality of the skeleton obtained with HPC over that obtained with MMPC (before the SS phase) as a function of the sample size. Results for each benchmark are not shown here in detail due to space restrictions. For sake of conciseness, the performance values are averaged over the 8 benchmarks depicted in Table 1. The increase factor for a given performance indicator is expressed as the ratio of the performance value obtained with HPC over that obtained with MMPC (the gold standard). Note that for some indicators, an increase is actually not an improvement but is worse (e.g., false positive rate, Euclidean distance). For clarity, we mention explicitly on the subplots whether an increase factor > 1 should be interpreted as an improvement or not. Regarding the quality of the superstructure, the advantages of HPC against MMPC are noticeable. As observed, HPC consistently increases the recall and reduces the rate of false negative edges. As expected this bene?t comes at a little expense in terms of false positive edges. HPC also improves the Euclidean distance from perfect
structures explored. Instead of applying the best local change, the best local change that results in a structure not on the list is performed in an attempt to escape local maxima. When 15 changes occur without an increase in the maximum score ever encountered during search, the algorithm terminates. The overall best scoring structure is then returned. Clearly, the more false positives the heuristic allows to enter candidate PC set, the more computational burden is imposed in the SS phase. 4. Experimental validation on synthetic data Before we proceed to the experiments on real-world multi-label data with H2PC, we ?rst conduct an experimental comparison of H2PC against MMHC on synthetic data sets sampled from eight well-known benchmarks with various data sizes in order to gauge the practical relevance of the H2PC. These BNs that have been previously used as benchmarks for BN learning algorithms (see Table 1 for details). Results are reported in terms of various performance indicators to investigate how well the network generalizes to new data and how well the learned dependence structure matches the true structure of the benchmark network. We implemented H2PC in R (R Core Team, 2013) and integrated the code into the bnlearn package from Scutari (2010). MMHC was implemented by Marco Scutari in bnlearn. The source code of H2PC as well as all data sets used for the empirical tests are publicly available.2 The threshold considered for the type I error of the test is 0:05. Our experiments were carried out on PC with Intel (R) Core (TM) i5–3470 M CPU @3.20 GHz 4Go RAM running under Linux 64 bits. We do not claim that those data sets resemble real-world problems, however, they make it possible to compare the outputs of the algorithms with the known structure. All BN benchmarks (structure and probability tables) were downloaded from the bnlearn repository.3 Ten sample sizes have been considered: 50, 100, 200, 500, 1000, 2000, 5000, 10,000, 20,000 and 50,000. All experiments are repeated 10 times for each sample size and each BN. We investigate the behavior of both algorithms using the same parametric tests as a reference. 4.1. Performance indicators We ?rst investigate the quality of the skeleton returned by H2PC during the CB phase. To this end, we measure the false positive edge ratio, the precision (i.e., the number of true positive edges in the output divided by the number of edges in the output), the recall (i.e., the number of true positive edges divided the true number of edges) and a combination of precision and recall de?ned q?????????????????????????????????????????????????????????????????? 2 2 as ?1 ? precision? ? ?1 ? recall? , to measure the Euclidean distance from perfect precision and recall, as proposed in Pe?a et al. (2007). Second, to assess the quality of the ?nal DAG output at
2 3
https://github.com/madbix/bnlearn-clone-3.4 http://www.bnlearn.com/bnrepository
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
6761
False negative rate (lower is better)
0.8
HPC MMPC
False positive rate (lower is better)
0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 10000 20000 30000
HPC MMPC
mean rate
0.6 0.4 0.2 0.0 0 10000 20000 30000 40000 50000
mean rate
40000
sample size
sample size
Recall (higher is better)
increase factor increase factor
8 6 4 2 0 50 100 200 500 1000 2000 5000 10000 20000 50000 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
Precision (higher is better)
sample size
Euclidian distance (lower is better)
increase factor increase factor
1.2 1.0 0.8 0.6 0.4 0.2 0.0 50 100 200 500 1000 2000 5000 10000 20000 50000 25 20 15 10 5 0
Number of statistical tests
sample size
Fig. 1. Quality of the skeleton obtained with HPC over that obtained with MMPC (after the CB phase). The 2 ?rst ?gures present mean values aggregated over the 8 benchmarks. The 4 last ?gures present increase factors of HPC/ MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line). (For interpretation of the references to color in this ?gure caption, the reader is referred to the web version of this article.)
precision and recall on all benchmarks, while increasing the number of independence tests and thus the running time in the CB phase (see number of statistical tests). It is worth noting that HPC is capable of reducing by 50% the Euclidean distance with 50,000 samples (lower left plot). These results are very much in line with other experiments presented in Rodrigues de Morais and Aussem (2010a) and Villanueva and Maciel (2012). In Fig. 2, we report the quality of the ?nal DAG obtained with H2PC over that obtained with MMHC (after the SS phase) as a function of the sample size. Regarding BDeu and BIC on both training and test data, the improvements are noteworthy. The results in terms of goodness of ?t to training data and new data using H2PC clearly dominate those obtained using MMHC, whatever the sample size considered, hence its ability to generalize better. Regarding the quality of the network structure itself (i.e., how close is the DAG to the true dependence structure of the data), this is pretty much a dead heat between the 2 algorithms on small
sample sizes (i.e., 50 and 100), however we found H2PC to perform signi?cantly better on larger sample sizes. The SHD increase factor decays rapidly (lower is better) as the sample size increases (lower left plot). For 50,000 samples, the SHD is on average only 50% that of MMHC. Regarding the computational burden involved, we may observe from Table 2 that H2PC has a little computational overhead compared to MMHC. The running time increase factor grows somewhat linearly with the sample size. With 50,000 samples, H2PC is approximately 10 times slower on average than MMHC. This is mainly due to the computational expense incurred in obtaining larger PC sets with HPC, compared to MMPC. 5. Application to multi-label learning In this section, we address the problem of multi-label learning with H2PC. MLC is a challenging problem in many real-world application domains, where each instance can be assigned
50 100 200 500 1000 2000 5000 10000 20000 50000
50 100 200 500 1000 2000 5000 10000 20000 50000
sample size
sample size
50000
6762
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
log(BDeu post.) on train data (lower is better)
increase factor increase factor
1.0 0.9 0.8 0.7 0.6 0.5 50 100 200 500 1000 2000 5000 10000 20000 50000 1.0 0.9 0.8 0.7 0.6 0.5
BIC on train data (lower is better)
sample size
log(BDeu post.) on test data (lower is better)
increase factor increase factor
1.0 0.9 0.8 0.7 0.6 0.5 50 100 200 500 1000 2000 5000 10000 20000 50000 1.0 0.9 0.8 0.7 0.6 0.5
sample size
Structural Hamming Distance (lower is better)
1.2 8
increase factor
1.0 0.8 0.6 0.4 0.2 0.0 50 100 200 500 1000 2000 5000 10000 20000 50000
increase factor
6 4 2
sample size
Fig. 2. Quality of the ?nal DAG obtained with H2PC over that obtained with MMHC (after the SS phase). The 6 ?gures present increase factors of HPC/MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line). (For interpretation of the references to color in this ?gure caption, the reader is referred to the web version of this article.)
simultaneously to multiple binary labels (Dembczyski et al., 2012; Kocev et al., 2007; Madjarov et al., 2012; Read et al., 2009; Tsoumakas et al., 2010b). Formally, learning from multi-label examples amounts to ?nding a mapping from a space of features to a space of labels. We shall assume throughout that X (a random vector in Rd ) is the feature set, Y (a random vector in f0; 1gn ) is the label set, U ? X [ Y and p a probability distribution de?ned over U. Given a multi-label training set D, the goal of multi-label learning is to ?nd a function which is able to map any unseen example to its proper set of labels. From the Bayesian point of view, this problem amounts to modeling the conditional joint distribution p?Y j X?.
5.1. Related work This MLC problem may be tackled in various ways (AlvaresCherman, Metz, & Monard, 2011; Blockeel, Raedt, & Ramon,
1998; Kocev et al., 2007; Luaces, Díez, Barranquero, del Coz, & Bahamonde, 2012; Read et al., 2009). Each of these approaches is supposed to capture – to some extent – the relationships between labels. The two most straightforward meta-learning methods (Madjarov et al., 2012) are: Binary Relevance (BR) (Luaces et al., 2012) and Label Powerset (LP) (Tsoumakas et al., 2010b; Tsoumakas & Vlahavas, 2007). Both methods can be regarded as opposite in the sense that BR does consider each label independently, while LP considers the whole label set at once (one multi-class problem). An important question remains: what shall we capture from the statistical relationships between labels exactly to solve the multi-label classi?cation problem? The problem attracted a great deal of interest (Dembczyski et al., 2012; Zhang & Zhang, 2010). It is well beyond the scope and purpose of this paper to delve deeper into these approaches, we point the reader to Dembczyski et al. (2012) and Tsoumakas et al. (2010b) for a review. The second fundamental problem that we wish to
50 100 200 500 1000 2000 5000 10000 20000 50000
50 100 200 500 1000 2000 5000 10000 20000 50000
50 100 200 500 1000 2000 5000 10000 20000 50000
sample size
BIC on test data (lower is better)
sample size
Number of scores
sample size
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
6763
Table 2 Total running time ratio R (H2PC/MMHC). White cells indicate a ratio R < 1 (in favor of H2PC), while shaded cells indicate a ratio R > 1 (in favor of MMHC). The darker, the larger the ratio.
Table 3 Data sets characteristics. Data set Emotions Yyeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k Domain Music Biology Images Images Text Biology Text Text Text Images jDj 593 2417 2000 2407 3782 662 978 1702 7395 5000 dim?D? 72 103 135 294 1079 1186 1449 1001 1836 499 F ?D? cont. cont. cont. cont. disc. disc. disc. disc. disc. disc. L?D? 6 14 5 6 22 27 45 53 159 374 DL?D? 27 198 20 15 156 32 94 753 2856 3175
where fYLP1 ; . . . ; YLPL g is a partition of label powersets and MLPj is a Markov blanket of YLPj in X. From the above de?nition, we have YLPi YLPj j X; 8i – j. In the framework of MLC, one can consider a multitude of loss functions. In this study, we focus on a non label-wise decomposable loss function called the subset 0=1 loss which generalizes the well-known 0=1 loss from the conventional to the multi-label setting. The risk-minimizing prediction for subset 0=1 loss is simply given by the mode of the distribution (Dembczyski et al., 2012). Therefore, we seek a factorization into a product of minimal factors in order to facilitate the estimation of the mode of the conditional distribution (also called the most probable explanation (MPE)):
maxp?yjx? ?
address involves ?nding an optimal feature subset selection of a label set, w.r.t an Information Theory criterion (Koller & Sahami, 1996). As in the single-label case, multi-label feature selection has been studied recently and has encountered some success (Gu, Li, & Han, 2011; Spola?r, Cherman, Monard, & Lee, 2013). 5.2. Label powerset decomposition We shall ?rst introduce the concept of label powerset that will play a pivotal role in the factorization of the conditional distribution p?Y j X?. De?nition 1. YLP # Y is called a label powerset iff YLP Y n YLP j X. Additionally, YLP is said minimal if it is non empty and has no label powerset as proper subset. Let LP denote the set of all powersets de?ned over U, and minLP the set of all minimal label powersets. It is easily shown fY; ;g # LP. The key idea behind label powersets is the decomposition of the conditional distribution of the labels into a product of factors
y
L Y j?1
maxp?yLPj j x? ?
yLP j
L Y j?1
maxp yLPj j mLPj
yLP j
The next section aims to obtain theoretical results for the characterization of the minimal label powersets YLPj and their Markov boundaries MLPj from a DAG, in order to be able to estimate the MPE more effectively. 5.3. Label powerset characterization In this section, we show that minimal label powersets can be depicted graphically when p is faithful to a DAG, Theorem 6. Suppose p is faithful to a DAG G. Then, Y i and Y j belong to the same minimal label powerset if and only if there exists an undirected path in G between nodes Y i and Y j in Y such that all intermediate nodes Z are either (i) Z 2 Y, or (ii) Z 2 X and Z has two parents in Y (i.e. a collider of the form Y p ! X Y q ). The proof is given in the Appendix. We shall now address the following questions: What is the Markov boundary in X of a minimal label powerset? Answering this question for general distributions is not trivial. We establish a useful relation between a label powerset and its Markov boundary in X when p is faithful to a DAG,
p?Y j X? ?
L L Y Y p?YLPj j X? ? p?YLPj j MLPj ? j? 1 j? 1
6764
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
Theorem 7. Suppose p is faithful to a DAG G. Let YLP ? fY 1 ; Y 2 ; . . . ; Y n g be a label powerset. Then, its Markov boundary M in U is also its Markov boundary in X, and is given in G by S M? n j?1 fPCY j [ SPY j g n Y. The proof is given in the Appendix. 5.4. Experimental setup The MLC problem is decomposed into a series of multi-class classi?cation problems, where each multi-class variable encodes one label powerset, with as many classes as the number of possible combinations of labels, or those present in the training data. At this point, it should be noted that the LP method is a special case of this framework since the whole label set is a particular label powerset (not necessarily minimal though). The above procedure can been summarized as follows: 1. Learn the BN local graph G around the label set;
2. Read off G the minimal label powersets and their respective Markov boundaries; 3. Train an independent multi-class classi?er on each minimal LP, with the input space restricted to its Markov boundary in G; 4. Aggregate the prediction of each classi?er to output the most probable explanation, i.e. arg maxy p?y j x?. To assess separately the quality of the minimal label powerset decomposition and the feature subset selection with Markov boundaries, we investigate 4 scenarios: BR without feature selection (denoted BR): a classi?er is trained on each single label with all features as input. This is the simplest approach as it does not exploit any label dependency. It serves as a baseline learner for comparison purposes. BR with feature selection (denoted BR + MB): a classi?er is trained on each single label with the input space restricted to its Markov boundary in G. Compared to the previous strategy, we evaluate here the effectiveness of the feature selection task.
Fig. 3. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Emotions and Yeast.
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772 Table 4 Distribution of the number and the size of the minimal label powersets output by H2PC (top) and MMHC (bottom). On each data set: mean number of powersets, minimum/median/maximum number of labels per powerset, and minimum/median/ maximum number of distinct classes per powerset. The total number of labels and distinct labels combinations is recalled for convenience. Dataset # LPs # labels/LP min/med/max H2PC Emotions Yeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k MMHC Emotions Yeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k 1:0 ? 0:0 2:0 ? 0:0 1:0 ? 0:0 1:0 ? 0:0 9:5 ? 0:8 26:9 ? 0:3 38:6 ? 1:8 24:5 ? 1:9 39:6 ? 4:3 97:7 ? 3:9 1:2 ? 0:4 2:0 ? 0:0 1:0 ? 0:0 1:0 ? 0:0 15:1 ? 0:6 27:0 ? 0:0 41:7 ? 1:4 36:8 ? 2:7 77:2 ? 3:1 165:3 ? 5:5 6/6/6 2/7/12 5/5/5 6/6/6 1/1/14 1/1/2 1/1/6 1/1/30 1/1/112 1/1/263 1/6/6 1/7/13 5/5/5 6/6/6 1/1/9 1/1/1 1/1/4 1/1/12 1/1/28 1/1/177 (6) (14) (5) (6) (22) (27) (45) (53) (159) (374) (6) (14) (5) (6) (22) (27) (45) (53) (159) (374) L?D? # classes/LP min/med/max 26/27/27 3/64/135 19/20/20 14/15/15 1/2/111 1/2/2 1/2/10 1/2/334 2/2/1588 1/2/2749 2/26/27 2/89/186 19/20/20 14/15/15 1/2/46 1/2/2 1/2/7 1/2/119 2/2/233 1/2/2294 (27) (198) (20) (15) (156) (32) (94) (753) (2856) (3175) (27) (198) (20) (15) (156) (32) (94) (753) (2856) (3175) DL?D? Table 6 DAG learning time (in seconds) and average degree of the label nodes. Dataset MMHC Time Emotions Yeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k 1.5 9.0 3.1 9.9 44.8 12.5 43.8 199.8 960.8 169.6 Label degree 2:6 ? 1:1 3:0 ? 1:6 4:0 ? 0:8 3:1 ? 1:0 2:7 ? 1:4 1:0 ? 0:3 1:7 ? 1:1 1:2 ? 1:5 2:6 ? 1:3 1:5 ? 1:4 H2PC Time 16.2 90.9 28.4 662.9 822.3 12.4 334.8 47863.0 159469.6 2513.0
6765
Label degree 4:7 ? 1:8 5:0 ? 2:9 4:6 ? 1:0 6:6 ? 1:8 4:0 ? 5:2 1:1 ? 0:3 2:9 ? 2:4 4:3 ? 5:2 6:4 ? 10:3 2:9 ? 2:8
Table 5 Global classi?cation accuracies using 4 learning methods (BR, MLP, BR + MB, MLP + MB) based on the DAG obtained with H2PC and MMHC. Best values between H2PC and MMHC are boldfaced. Dataset BR MLP MMHC Emotions Yeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k 0.327 0.169 0.342 0.580 0.355 0.069 0.454 0.134 0.108 0.002 0.371 0.273 0.504 0.733 0.419 0.069 0.499 0.165 0.115 0.030 H2PC 0.391 0.271 0.504 0.733 0.474 0.069 0.502 0.180 0.167 0.043 BR + MB MMHC 0.147 0.062 0.227 0.123 0.274 0.974 0.630 0.024 0.120 0.000 H2PC 0.266 0.155 0.252 0.361 0.373 0.976 0.651 0.079 0.133 0.002 MLP + MB MMHC 0.189 0.073 0.308 0.199 0.278 0.974 0.636 0.079 0.121 0.010 H2PC 0.285 0.233 0.360 0.565 0.439 0.976 0.669 0.157 0.177 0.034
5.4.1. Data and performance indicators A total of 10 multi-label data sets are collected for experiments in this section, whose characteristics are summarized in Table 3. These data sets come from different problem domains including text, biology, and music. They can be found on the Mulan5 repository, except for image which comes from Zhou6 (Maron & Ratan, 1998). Let D be the multi-label data set, we use jDj; dim?D?; L?D?; F ?D? to represent the number of examples, number of features, number of possible labels, and feature type respectively. DL?D? ? jfY j9x : ?x; Y ? 2 Dgj counts the number of distinct label combinations appearing in the data set. Continues values are binarized during the BN structure learning phase. The performance of a multi-label classi?er can be assessed by several evaluation measures (Tsoumakas, Katakis, & Vlahavas, 2010a). We focus on maximizing a non-decomposable score function: the global accuracy (also termed subset accuracy, complement of the 0=1-loss), which measures the correct classi?cation rate of the whole label set (exact match of all labels required). Note that the global accuracy implicitly takes into account the label correlations. It is therefore a very strict evaluation measure as it requires an exact match of the predicted and the true set of labels. It was recently proved in Dembczyski et al. (2012) that BR is optimal for decomposable loss functions (e.g., the hamming loss), while non-decomposable loss functions (e.g. subset loss) inherently require the knowledge of the label conditional distribution. 10-fold cross-validation was performed for the evaluation of the MLC methods. 5.4.2. Results Table 4 reports the outputs of H2PC and MMHC, Table 5 shows the global accuracy of each method on the 10 data sets. Table 6 reports the running time and the average node degree of the labels in the DAGs obtained with both methods. Figs. 3–7 display graphically the local DAG structures around the labels, obtained with H2PC and MMHC, for illustration purposes. Several conclusions may be drawn from these experiments. First, we may observe by inspection of the average degree of the label nodes in Table 6 that several DAGs are densely connected, like scene or bibtex, while others are rather sparse, like genbase, medical, corel5k. The DAGs displayed in Figs. 3–7 lend themselves to interpretation. They can be used for encoding as well as portraying the conditional independencies, and the d-sep criterion can be used to read them off the graph. Many label powersets that are reported in Table 4 can be identi?ed by graphical inspection of the DAGs. For instance, the two label powersets in yeast (Fig. 3, bottom plot) are clearly noticeable in both DAGs. Clearly, BNs have a number of advantages over alternative methods. They lay bare
5
Minimum label powerset method without feature selection (denoted MLP): the minimal label powersets are obtained from the DAG. All features are used as inputs. Minimum label powerset method with feature selection (denoted MLP + MB): the minimal label powersets are obtained from the DAG. the input space is restricted to the Markov boundary of the labels in that powerset. We use the same base learner in each meta-learning method: the well-known Random Forest classi?er (Breiman, 2001). RF achieves good performance in standard classi?cation as well as in multi-label problems (Kocev et al., 2007; Madjarov et al., 2012), and is able to handle both continuous and discrete data easily, which is much appreciated. The standard RF implementation in R (Liaw & Wiener, 2002)4 was used. For practical purposes, we restricted the forest size of RF to 100 trees, and left the other parameters to their default values.
4
http://cran.r-project.org/web/packages/randomForest
6
http://mulan.sourceforge.net/datasets.html http://lamda.nju.edu.cn/data_MIMLimage.ashx
6766
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
X24 X12 X20 X24 X129 X21 X77 sunset mountains X12 trees sea X23 X38 X118
X28
X40 X37 X46 X21 X20 sunset X78 X75 mountains X61 trees desert X77 sea X67 X76 X79 X73
desert
X69
X4
X7 X8 X89 X16
X8
X9
X2
X1
X6 X18
X3 X11
Att92 Att91 Att36
Att86 Att183
Field
Att281
Urban Sunset
Att64 Att23 Att15 Att71 Att29
Att78
Att85
Att114 Att122
Beach Att44 Mountain FallFoliage
Att256
Att116 Att165 Att22 Att108
Att115
Att213
Att262
Att59
Att263 Att270
Field FallFoliage Beach Att238 Att245
Att45
Att67 Att172 Att264 Urban
Mountain
Att231
Att121
Att271
Att173
Sunset
Att280 Att294
Att174
Att68
Att237
Att180
Att166 Att75
Att69 Att61
Att26 Att27
Att1
Att2
Att46 Att38
Att25 Att33 Att19 Att28 Att20
Att34 Att3 Att76
Att8
Fig. 4. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Image and Scene.
useful information about the label dependencies which is crucial if one is interested in gaining an understanding of the underlying domain. It is however well beyond the scope of this paper to delve deeper into the DAG interpretation. Overall, it appears that the structures recovered by H2PC are signi?cantly denser, compared to MMHC. On bibtex and enron, the increase of the average label degree is the most spectacular. On bibtex (resp. enron), the average label degree has raised from 2.6 to 6.4 (resp. from 1.2 to 3.3). This result is in nice agreement with the experiments in the previous section, as H2PC was shown to consistently reduce the rate of false negative edges with respect to MMHC (at the cost of a slightly higher false discovery rate). Second, Table 4 is very instructive as it reveals that on emotions, image, and scene, the MLP approach boils down to the LP scheme. This is easily seen as there is only one minimal label powerset extracted on average. In contrast, on genbase the MLP approach boils down to the simple BR scheme. This is con?rmed by inspecting Table 5: the performances of MMHC and H2PC (without feature selection) are equal. An interesting observation upon looking at the distribution of the label powerset size shows that for the remaining data sets, the MLP mostly decomposes the label set in two parts:
one on which it performs BR and the other one on which it performs LP. Take for instance enron, it can be seen from Table 4 that there are approximately 23 label singletons and a powerset of 30 labels with H2PC for a total of 53 labels. The gain in performance with MLP over BR, our baseline learner, can be ascribed to the quality of the label powerset decomposition as BR is ignoring label dependencies. As expected, the results using MLP clearly dominate those obtained using BR on all data sets except genbase. The DAGs display the minimum label powersets and their relevant features. Third, H2PC compares favorably to MMHC. On scene for instance, the accuracy of MLP + MB has raised from 20% (with MMHC) to 56% (with H2PC). On yeast, it has raised from 7% to 23%. Without feature selection, the difference in global accuracy is less pronounced but still in favor of H2PC. A Wilcoxon signed rank paired test reveals statistically signi?cant improvements of H2PC over MMHC in the MLP approach without feature selection (p < 0:02). This trend is more pronounced when the feature selection is used (p < 0:001), using MLP-MB. Note however that the LP decomposition for H2PC and MMHC on yeast and image are strictly identical, hence the same accuracy values in Table 5 in the MLP column.
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
netbook olpc laptops
consumer tv switch
6767
cell
star
devices laptop netbooks phone notebook
suggests conducted risk people original india impact lunar lander japan matt taking size engineers finds shows center space ice professor black survey hubble format
television
trek
mobile Mobile broadband answers Interviews founder discuss iphone chrome apple os mac google Apple Search ubuntu standard code information macbook lines update open foundation source X8 android Linux linux search app kernel windows presidential election questions political
party
nobel robots official studios
launch image billion satellite
voting Politics transition obama
created create replace roland working power energy
study month red cells website mission artificial earth
astronomers nasa moon brain surface humans discovery
telescope Entertainment org mars physics spacecraft dk movie universe evidence discovered found water quantum holy bbc lucas123 coondoggie feel scientist pay suit companies firm computer financial thinking small tricks X50 good local work boss time wife data interesting reports imaginary led man isn santa news don guy recent trailer officials AskSlashdot dog disagree trial mirror begins washington wars upcoming king sends expansion running death nintendo blizzard e3 fallout interview snydeq series exception developers players IT show newyorkcountrylawyer infoworld downloadable javascript game games gaming points gamer worlds live video sales memory hp independent cpu chips industry pc amd BookReviews virtual revealed piracy hardware creator learning modern action warhammer management electronic spore bit jobs battle read war mmos wrote year online written content xbox sending ea X360 book X3d author chance guide army tips kotaku eldavojohn offers ways videos answers founder asked questions writes blog bring play massively job notes gameplay november Interviews employees user X3 evolution talks reilly set world feature anonymous activision decided pointed spoke readers details social part patent word tour article piece conference discussing jagslive committee anti.globalism excerpt makers Meta announced couple woman women server email slashdot plans release mail september announces today intellectual spam gmail released pickens pretty issued claim long monday men challenge million education student company experience vista firmware makes X1 BSD version releases security X4 ordered attack close computers stupid problem cheap
american
service planet
magnetic university institute solar
barack
machines
announcement mike unveiled screen built battery
scientific users research dna
researchers act high foundation operating pictures demonstrated silicon ad rocket Apache tech Technology python hat dark X6 mark X8 software download linux networks speech ubuntu active free kernel Linux single law open.source attempt gave YourRightsOnline drm submitted dmca names sony open illegal music transition eu warner isps zdnet president mentioned passed money digital cisco Hardware run mediasentry riaa director source legal player streaming case order desktop ruphus13 entertainment future language international flash rights team block windows java inside human prototype system technology sun federal copyright developed Science back scientists cancer kentuckyfc contest
science
country
received commission microsoft police property p2p Idle technica programming
baby
japanese
gamasutra edge story recently left
magazine coming console mmo warcraft
gamers
X1up
project development perl
Developers popular developer wii News littlebigplanet dead Games final books
pro author michael BookReviews books stoolpigeon book barence
Main
Apache
graphics
district appeal filed court canadian
call
guitar
gpu nvidia eee intel demo
supreme
ruled
appeals
judge
arcticstoat paul francisco
european agreement ruling senate
bill code
Hardware pc
proposed
plan pdf australian government
amd industry Meta ii legal
begin
house obama
internet
chip
chinese america
nvidia intel supercomputer censorship bittorrent access china driver gpu narramissic core demo performance things designed graphics ii pcs eee market michael barence ben ceo
gaming
perl
congress
barack
votes
tv Developers developer game video games Games programming developers
television Entertainment format
movie official
Politics
shows
states united europe mccain election gov
top state broadband
X500 steve stoolpigeon nokia
current Main
telecom presidential
voting debate political campaign vote palin
lenovo web
left IT dead readers notes points sends newyorkcountrylawyer Technology riaa research AskSlashdot writes Idle Science scientists YourRightsOnline ruphus13 developed News BSD X4
tuesday
john reporting policy night change
move machines issues privacy problems yahoo mobile party register google apparently account private Search applications sites X0 site
browser services
Apple apps chrome notebook search Mobile iphone takes apple itunes X3g
android
mac
work coondoggie company
blackberry
standard app macbook os youtube store
text devices t.mobile netbooks wireless early
laptops
phones phone psystar lamont connection update information
laptop
include noted lines pro line
experience
g1 reported netbook atari
cell
calls olpc fcc network
child
million
PDOC00030 PS50102 PDOC00018 PS00018 PDOC00653 PDOC00380 PS50004 PDOC00224 PS50049 PDOC00561 PS50050 PDOC00660 PS50007 PDOC50007 PS50008 PDOC00343 PS50067 PDOC00271 PS50053 PDOC00791 PDOC00670 PS50052 PDOC50017 PDOC00014 PS00014 PDOC50199 PS50199 PS50017 PDOC50156 PS50156
PDOC00030
PDOC00653
PDOC00750 PS50235
PDOC00018
PS50804
PDOC00660 PS50245 PDOC00014 PS00014 PS50051 PDOC50106 PS50106 PS50052 PDOC00224
PDOC50003 PS50003 PDOC50106 PS50106
PS00018 PDOC50196 PS50229
PDOC00662
PDOC00670
PDOC00064 PS50065
PDOC00343
PS50049 PS50067 PS50008
PDOC50006 PS50006 PS00027 PDOC00100 PS50011
PDOC50003
PS50003 PDOC50007 PS00066 PS50007 PS50065
PDOC50017 PS50017 PDOC00380 PS50004
PDOC50002 PS50002 PDOC00154 PS50072 PDOC50001 PS50001
PDOC00791 PDOC00064 PS00318 PDOC00154
PDOC00271 PS50053 PS50072 PDOC00561 PS01031 PS50050 PDOC00100 PDOC50002 PS50002 PDOC50006 PDOC50001 PS50001 PDOC00750 PS50235 PS50102 PDOC50199 PS50199 PS50006 PDOC50156 PS50156 PS50011
PDOC50196
PDOC00662 PS50051 PS01031
Fig. 5. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Slashdot and Genbase.
Fourth, regarding the utility of the feature selection, it is dif?cult to reach any conclusion. Whether BR or MLP is used, the use of the selected features as inputs to the classi?cation model is not shown greatly bene?cial in terms of global accuracy on average. The performance of BR and our MLP with all the features outperforms that with the selected features in 6 data sets but the feature selection leads to actual improvements in 3 data sets. The difference in accuracy with and without the feature selection was not shown to be statistically signi?cant (p > 0:20 with a Wilcoxon signed rank paired test). Surprisingly, the feature selection did exceptionally well on genbase. On this data set, the increase in accuracy is the most impressive: it raised from 7% to 98% which is atypical. The dramatic increase in accuracy on genbase is due solely to the restricted feature set as input to the classi?cation model. This is also observed on medical, to a lesser extent though. Interestingly, on large and densely connected networks (e.g. bibtex,
slashdot and corel5K), the feature selection performed very well in terms of global accuracy and signi?cantly reduced the input space which is noticeable. On emotions, yeast, image, scene and genbase, the method reduced the feature set down to nearly 1/100 its original size. The feature selection should be evaluated in view of its effectiveness at balancing the increasing error and the decreasing computational burden by drastically reducing the feature space. We should also keep in mind that the feature relevance cannot be de?ned independently of the learner and the modelperformance metric (e.g., the loss function used). Admittedly, our feature selection based on the Markov boundaries is not necessarily optimal for the base MLC learner used here, namely the Random Forest model. As far as the overall running time performance is concerned, we see from Table 6 that for both methods, the running time grows somewhat exponentially with the size of the Markov boundary
6768
mass
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
wiedemann
syndrome
conveyed abdominal Class?44?786_07 Class?26?V42_0 transplant Class?9?599_0 marrow Class?7?511_9 drainage microscopic Class?12?V13_09 Class?43?279_12 Class?23?786_50 hematuria flank chest Class?15?593_1 cystic fibrosis Class?30?277_00 yearly Class?20?V67_09 date Class?22?789_09 enuresis Class?5?786_2 unilateral Class?28?753_21 9?day incontinence hydronephrosis Class?35?493_90 echogenicity moderate Class?42?599_7 present Class?8?596_8 routine Class?19?785_6 followup Class?36?788_30 alternatively calcium duplicated horseshoe
Class?34?V13_02 viral days Class?13?791_0 Class?15?593_1 proteinuria pneumonia bilaterally Class?1?079_99
hemihypertrophy
Class?14?789_00 recent grossly reactive uti myelomeningocele Class?39?758_6 pain normal meningomyelocele fever vesicoureteral reflux kidneys Class?32?486 ii renal Class?0?593_70 cystogram
increased sounds
9?month?24?day technical significant
parenchymal
calix
utero
wheezing
junction
mass pyelectasis abnormal decrease unchanged stool conveyed
ago
decreased
improvement underlying
abdominal
Class?11?593_5 thickening appears dilatation clear related ureters resolves distention moderate hydroureter bilateral Class?14?789_00 deflux pyelocaliectasis hydronephrosis Class?20?V67_09 history Class?17?592_0 resolution mild ureteral urogenital prenatal Class?41?591 prior Class?36?788_30
gross Class?6?V72_5 grade
residual
cystogram reflux voiding
Class?40?741_90 turner
Class?24?596_54 neurogenic
Class?16?462 pharyngitis
left nuclear Class?22?789_09 calculi pain Class?30?277_00 yearly
Class?6?V72_5 cystic fibrosis Class?12?V13_09 woman
vesicoureteral
low
change
kidney;
infections
cough
low Class?4?753_0 Class?10?518_0 Class?31?780_6
2
made incontinence
Class?0?593_70 evidence grade ii partially flank 4 2 duplex duplicated present Class?23?786_50 appearing 5 growth kidney sonographic interval followup Class?19?785_6 lungs study renal x?ray size titer ppd partial stable Class?38?593_89 2; wetting r=9 anuresis views cough enuresis night bladder radiograph including Class?29?783_0 Class?26?V42_0 normal pleural Class?4?753_0 day transplant marrow radiographic started appetite urinary Class?25?787_03 ultrasound positive vesicostomy infections chest grossly myelomeningocele Class?39?758_6 web meningomyelocele vomiting Class?9?599_0 febrile tract infection Class?18?786_59 10?year?9?month uti duplication kidneys reimplantation cortical Class?33?788_41 intraluminal Class?37?753_3 horseshoe Class?8?596_8 white pole thorax evaluation collecting
echogenicity
evaluation antibody 2; Class?21?795_5 positive viral Class?38?593_89 ppd
lobe pneumonia atelectasis patchy
Class?17?592_0
Class?21?795_5
routine
row
focal kidney
lung
calculi pole evidence
Class?42?599_7
weeks
approximately appearance wiskott Class?31?780_6 Class?5?786_2 unilateral 2?3
ago
years
Class?35?493_90
degrees
utis
9?day
Class?3?759_89 9;
month
recent
count
residual
Class?27?786_05 distention Class?41?591 collecting
2001
pyelonephritis
shortness
nonconventional year?old
features
Class?10?518_0 rule focal male examination fever lobe gross Class?32?486 Class?43?279_12 microscopic hematuria Class?2?786_09 hypoventilation
etiology Class?7?511_9 drainage
pyelocaliectasis change Class?25?787_03
aspirated vomiting
Class?37?753_3 duplication
asthma
atelectasis Class?28?753_21 consolidation patchy subsegmental band lung streaky reactive opacities area lobar density represent
Class?27?786_05 shortness
growth study
pyelectasis
year?old features
radiographic caliber aldrich syndrome
accounts partial
hyperinflation
airway airways
disease
findings
Class?33?788_41 intraluminal
Class?34?V13_02 Class?1?079_99 asthma hemihypertrophy neurogenic Class?16?462
Class?18?786_59 10?year?9?month
consistent
Class?24?596_54
Class?13?791_0 proteinuria Class?11?593_5
minimally air
Class?40?741_90 turner
pharyngitis
Class?44?786_07
Class?29?783_0 appetite Class?2?786_09 Class?3?759_89
hydroureter proximally
wheezing
good week
sale big
D.D4
D.D1 B.B11 B.B4 D.D3
million utility 100 B.B10
news
D.D18
D.D17
D.D16 proposal
article
B.B9 site producers expense total
A.A2 D.D6 C.C11 D.D2 language B.B12 comments D.D15 C.C10 legal B.B6 C.C4 intended D.D12 confidential B.B3 C.C13 ca D.D10 B.B9 B.B7 D.D11 deregulation california C.C6 http B.B1 www A.A4 attorney enroncc A.A6 B.B10 draft
D.D9
prices http
A.A7
file securities
based markets power required interest fall D.D9 B.B11 policies D.D19 trading ago line office market 15 D.D1 C.C13 B.B7 number D.D17 republican D.D18 making real industry price including questions www D.D6
attached
doc B.B13
september
D.D3
procedures
weeks
cc
investigation settlement commission reasonable sarah
iso C.C12 D.D7 B.B12 solution head caps government williams comments language america proposal points C.C11 draft press talking D.D15 D.D4 assets key D.D16 D.D5
linda filed 4
rto order
authority
filing C.C1 ferc D.D12 A.A2 C.C7 release
pmto A.A3 D.D5
wood
commissioners
issues data
C.C4 B.B13 B.B3 mara securities A.A6 staff A.A1 D.D13 D.D14 attached james
wholesale
A.A7 C.C9
11 amto
residential legislature rate shortages legislators potential lawmakers pacific ways crisis ca sdg attorney diego deregulation
alan
C.C6
C.C8
C.C2 file C.C5 A.A3 fw cpuc message C.C3 friday meeting A.A4 original thursday A.A5 kaminski july scott find tuesday B.B5 doc B.B8
california
C.C10
hit B.B2 media C.C3
discussion
passed
san
confidential mailto yesterday concern board information system D.D10 legal intended cash 00 pmto june job risk stanford B.B2 D.D2 kean development amto vince 2001 london cc 27 world 22 talk copyright reserved article ordered monday D.D11 shut
budget
meeting A.A8 C.C8 C.C1 D.D13 D.D19 C.C5 C.C7 B.B5 D.D7 B.B8 D.D8 C.C9 C.C2 thursday job D.D14 A.A1 C.C12 A.A5 staff visit
doesn puc didn jan
D.D8 subject 1 3 5 12 06 enroncc 02
B.B1
B.B4
A.A8 pm 04 2000 enron forwarded 16 31 angeles 05 valley hou 28 09 01 08 steven la 10 net held los ed spokesman cut 03 homes 23 megawatts reached temporary
effect
bill
law
political
senate B.B6 bills bad chris
al
Fig. 6. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Medical and Enron.
and the number of features, hence the considerable rise in total running time on bibtex. H2PC takes almost 200 times longer on bibtex (1826 variables) and enron (1001 variables) which is quite considerable but still affordable (44 h of single-CPU time on bibtex and 13 h on enron). We also observe that the size of the parent set with H2PC is 2.5 (on bibtex) and 3.6 (on enron) times larger than that of MMHC (which may hurt interpretability). In fact, the running time is known to increase exponentially with the parent set size of the true underlying network. This is mainly due the computational overhead of greedy search-and-score procedure with larger parent sets which is the most promising part to optimize in terms of computational gains as we discuss in the Conclusion. 6. Discussion & practical applications Our prime conclusion is that H2PC is a promising approach to constructing BN global or local structures around speci?c nodes of interest, with potentially thousands of variables. Concentrating
on higher recall values while keeping the false positive rate as low as possible pays off in terms of goodness of ?t and structure accuracy. Historically, the main practical dif?culty in the application of BN structure discovery approaches has been a lack of suitable computing resources and relevant accessible software. The number of variables which can be included in exact BN analysis is still limited. As a guide, this might be less than about 40 variables for exact structural search techniques (Kojima et al., 2010; Perrier et al., 2008). In contrast, constraint-based heuristics like the one presented in this study is capable of processing many thousands of features within hours on a personal computer, while maintaining a very high structural accuracy. H2PC and MMHC could potentially handle up to 10,000 labels in a few days of single-CPU time and far less by parallelizing the skeleton identi?cation algorithm as discussed in Tsamardinos et al. (2006) and Villanueva and Maciel (2014). The advantages in terms of structure accuracy and its ability to scale to thousands of variables opens up many avenues of future possible applications of H2PC in various domains as we shall
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
6769
discuss next. For example, BNs have especially proven to be useful abstractions in computational biology (Aussem et al., 2012, Aussem, Tchernof, Rodrigues de Morais, & Rome, 2010; Nagarajan, Scutari, & Lbre, 2013; Pe?a, Bj?rkegren, & Tegnér, 2005; Pe?a, 2008; Prestat et al., 2013; Scutari & Nagarajan, 2013). Identifying the gene network is crucial for understanding the behavior of the cell which, in turn, can lead to better diagnosis and treatment of diseases. This is also of great importance for characterizing the function of genes and the proteins they encode in determining traits, psychology, or development of an organism. Genome sequencing uses high-throughput techniques like DNA microarrays, proteomics, metabolomics and mutation analysis to describe the function and interactions of thousands of genes (Zhang & Zhou, 2006). Learning BN models of gene networks from these huge data is still a dif?cult task (Badea, 2004; Bernard & Hartemink, 2005; Friedman, Nachman, & Peer, 1999a; Ott, Imoto, & Miyano, 2004; Peer, Regev, Elidan, & Friedman, 2001). In these studies, the authors had decide in advance which genes were included in the learning process, in all the cases less than 1000, and which genes were excluded from it (Pe?a, 2008). H2PC over-
come the problem by focusing the search around a targeted gene: the key step is the identi?cation of the vicinity of a node X (Pe?a et al., 2005). Our second objective in this study was to demonstrate the potential utility of hybrid BN structure discovery to multi-label learning. In multi-label data where many inter-dependencies between the labels may be present, explicitly modeling all relationships between the labels is intuitively far more reasonable (as demonstrated in our experiments). BNs explicitly account for such inter-dependencies and the DAG allows us to identify an optimal set of predictors for every label powerset. The experiments presented here support the conclusion that local structural learning in the form of local neighborhood induction and Markov blanket is a theoretically well-motivated approach that can serve as a powerful learning framework for label dependency analysis geared toward multi-label learning. Multi-label scenarios are found in many application domains, such as multimedia annotation (Snoek, Worring, Gemert, Geusebroek, & Smeulders, 2006; Trohidis, Tsoumakas, Kalliris, & Vlahavas, 2008), tag recommendation, text categorization (McCallum, 1999; Zhang & Zhou, 2006),
success TAG_informationretrieval libraries answer relationships
articles relations concepts business
configuration know TAG_toread TAG_datamining programs service
analyses
concept define
TAG_programming
popular explicit
TAG_collaborative
TAG_mining analysis TAG_analysis evolving changes TAG_concept data re library natural languages discussion TAG_language
TAG_annotation
defining
written specification
shared
TAG_collaboration categories entities mining creation organization change TAG_formal provides de organizations communications
evolution
behavioral political organizational TAG_technology generating others
adapted there source sense public management link construction culture innovation ideas program implementation collaborative solving market supporting managing integrating they life
reasoning
aspect sciences TAG_folksonomy semantics TAG_metrics
code
industrial
TAG_semantics
int
what
etc TAG_narrative
mathematical
TAG_knowledgemanagement
managing
discovery
formal
specification types then annotation people TAG_fca general meaning TAG_diplomathesis document TAG_knowledge TAG_classification metadata tools reuse TAG_knowledgemanagement object
cultural
tagging
TAG_development
decisions
roles
TAG_survey
look experience training
TAG_formal TAG_evolution decision
building
digital
programming
acquisition
members
activities presentation game situations practices logic technological
TAG_semantics management TAG_rdf semantics rdf TAG_maintenance evolutionary TAG_system
TAG_notag extraction
TAG_community resources
testing services TAG_engineering language topic
existing
needs
learn
TAG_semanticweb competitive genetic TAG_requirements TAG_competition TAG_management TAG_statistics knowledge statistics competition off TAG_sna TAG_engineering TAG_datamining
words
TAG_context metrics TAG_tagging conf
la
communication
TAG_mathematics
TAG_games
enabling discovery
TAG_2005 online oriented TAG_requirements al
maintaining objects
TAG_maintenance
TAG_social
engine
contexts
developers
enables
n
generation
developing
TAG_kaldesignresearch
collaboration
challenges
query TAG_empirical some vector extraction queries TAG_notag aware documents prior community communities automatic
emerging
years
games
representation mechanics representations TAG_representation gap TAG_software biology gene
its
functionality
world
survey
mathematics project topics
strategy structure TAG_structure first word structures relevance sources TAG_myown indexing kind TAG_informationretrieval enable information special next technology technical adaptation might regarding retrieval social probability share supports projects representing opportunities
TAG_quantum TAG_knowledge
TAG_topic8 maintenance engineering quantum vector
access text TAG_socialnets requirements TAG_simulations class TAG_management categories tool links proc TAG_nlp pages TAG_wiki TAG_software software TAG_objectoriented classification TAG_clustering biological TAG_semantic second TAG_ontology formal relevant being create intelligent topology user web numbers TAG_representation machine xml emergence virtual TAG_ontologies ontologies semantic context users created knowledge usage reasoning creating
sharing
internet
navigation hypermedia learned adaptive TAG_learning
how
play
classical markets
focuses
mechanical ontology
evidence TAG_classification dna TAG_2006 TAG_graph TAG_concept
providing
processing
learning historical experiences integrated TAG_computing
TAG_mathgamespatterns
young issues discusses
ontologies
maintenance
hierarchical clustering financial TAG_random TAG_kaldesignresearch TAG_ontology cluster software process development
TAG_topic10 consumption make
individual
development
des goals base tasks descriptions TAG_education
settings
academic questions
TAG_process TAG_topic3 non automatically acm global TAG_montecarlo TAG_nonequilibrium bioinformatics changing TAG_nepomuk investigate latent probabilistic dynamical equilibrium TAG_empirical TAG_dynamics TAG_data numerical now reduce need most TAG_fornepomuk concept TAG_networks achieve TAG_topic3 initial their identify function itself TAG_process workshop policy driven components the built about required family symposium graph secondary TAG_network TAG_mining mining execution search rather www following develop resource classification http architecture over standards
design TAG_prolearn choice TAG_logic TAG_children course TAG_design section overview needed understanding interactive
designing
engineering
TAG_diplomathesis
TAG_article reading ieee processes random proteins
databases
personal
requirements students technologies teaching children impact
variety
conclusions
teachers
year
genome
two
ontology from designed TAG_analysis annual sci
TAG_topic11
efforts connections TAG_architecture architectures relatively environments TAG_information education vision TAG_marotzkiwinfried research trends output multimedia
foundation
TAG_disability
assessment
video sind concerns computers skills child care von
test TAG_ontologies computing
la kinetics
requires
thus made lead concepts use related ability within making finally available database furthermore behaviors data novel
provide
support
developmental
s
background
infrastructure
computers factors supported
simulation dynamics protein empirical
kinetic
rate
interesting
increasing
appear
educational methodology interfaces participants quality u illustrated criteria parallel perspective measures successful sich analyzing TAG_pattern partial f zu 24 den explore
TAG_folksonomy TAG_networks internet human stochastic care
TAG_simulation them TAG_molecular res simulations storage TAG_computing markets constant molecules computer TAG_computer TAG_theory theoretical TAG_equation child TAG_kinetic TAG_development TAG_children TAG_technology molecular constants mechanisms values financial comprehensive calculated TAG_statistics easy while key extend completely limited construct
likely 06 identity understand allow TAG_search cost network TAG_compounds automation TAG_rdf or kinds principles light
definition characteristics rdf paper nodes wide representation reveals reliability techniques TAG_litreview little computing TAG_of visualization analyses currently art der eine exploring dem pairs exploration
TAG_network
manner
order
distributions complexity
TAG_complex
joint
past
on
TAG_web20
TAG_web change empirical assess advances m other presented individuals domain TAG_evolution platform generated elements practice early devices extent conclusion
already
network
TAG_social cultural
online
auf
ist die
center type
TAG_2007
date constructed distributed contribution european
into
find
TAG_langen
scientific TAG_complexity affected uses test movement problem of estimate analysis neural 2004 current enhanced health
age
und
attention
male
TAG_collaborative TAG_tagging
consequences
towards networks young theory TAG_litreview social tagging probability apob provided children then TAG_semanticweb TAG_web20 second collaborative first multimedia TAG_mathematics nodes mathematical functional modelling contexts TAG_modeling mathematics collaboration TAG_semantic people probabilistic b aware wide services notes TAG_web world TAG_annotation compounds graphs p education agent personal TAG_fca TAG_myown TAG_2005 TAG_compounds TAG_2007 word words pages semantic web apo scientific journal apolipoprotein lipoprotein ab TAG_prolearn metadata c psychology initio multi not applying increase leading t extensive combination perform single TAG_search calculations TAG_psycholinguistics text about engine graph search latent users indexing techniques TAG_nepomuk algorithms TAG_graph retrieval algorithm TAG_elearning TAG_epitope novel language links TAG_2006 platform TAG_homogeneous user processing TAG_sna antigen onto implemented cell predictions built efficient energy training conference machine anti TAG_methodology membrane antibody mapping 02 like body reduction institute carried subject r mean structural involving show 13 natural interface TAG_architecture TAG_narrative free homogeneous link interfaces surface has reaction TAG_date TAG_learning TAG_immunosensor languages TAG_toread automation heterogeneous method TAG_physics TAG_energy application serum enzyme surfaces smaller restricted 1997 weight predicted 2000 give distribution constants obtained subjects scaling total interacting simulated TAG_complexity TAG_topic4 glucose systems reduce TAG_games TAG_systems TAG_information complexity design TAG_mathgamespatterns fluctuations community law patients communities TAG_programming entities system predict TAG_amperometry reviewed programs siamese participants program logic TAG_bettasplendens fish TAG_topic2 physical TAG_visualization entropy relationships relations oriented co practice fighting critical TAG_granular monte game coupled TAG_book interactive developments yield considered introduced computed calculate atomic TAG_critical complex molecules exponents TAG_review letters review TAG_design disease mechanics simulation possibility TAG_simulations wave fock exhibits self electrode prediction TAG_review physical games field rate equal organized parameters good qm prototype state performed normal designing correlation clin chem local 6 expansion electrochemical regime 1999 geometry calculated exact solvent dependence review hard chemical molecular investigated TAG_simulation profile differences TAG_electrochemistry amperometric TAG_elisa TAG_structure ion discrete automated chosen gaussian 18 simulations compared five determination excellent effects states TAG_random transfer linked hybrid dynamical combined identified TAG_bibteximport tested properties y 2 physics assay TAG_quantum least having minimum TAG_immunoassay degrees transactions dynamics rates included equilibrium i amount reviewed ground TAG_kinetic consistent d experiment agreement references qualitative immunoassay TAG_agdetection physics sizes far long series TAG_book energies did noise 1 evaluated j shift energy conducted right corresponding analyzed constraints either further its journal TAG_antibody found values architecture TAG_language information architectures mass map computation recently rapidly TAG_mapping set uniform american under 12 TAG_liposome antibodies TAG_research his distance shows optimization overall error clearly power structures thermal task indicated basis phys by capture used TAG_algorithms TAG_science TAG_fornepomuk TAG_sequence cholesterol sequence learning who experimentally 9 lower work clear phenomena non been therefore 2001 with 0 c TAG_langen sequences lipid TAG_socialnets 2007 instance monoclonal unified generally reading modeled where range demonstrated average 2002 must action evaluate experimental h applicable p high also broad TAG_cognition using statistics each computational positive groups modeling experiments however examine additional determine distinct mixture composed accurate annotation density adaptive TAG_apob school TAG_cognition alternative effective TAG_model modelling principle meaningful than TAG_nonequilibrium TAG_context context science american v TAG_education hypermedia adaptation 1998 have approximate reserved model 2003 educational higher improve respect without it you argued when procedures that w is for level changes evaluation among TAG_system 2006 TAG_psycholinguistics performance application TAG_antibody efficiency become value better entire modeling generate 98 established proposed international complex through best 15 standard studies target TAG_article TAG_date conceptual association sets a content technology together way facilitate allowing TAG_collaboration models dynamic but errors contributions systems we control efficient will society equations establish due co able as precision TAG_model TAG_models static upon embedded perturbation to lett applications accuracy graphs all comparable conditions proceedings TAG_data whole model TAG_topic11 reducing more equation memory situation according daily these allows basic points o line number new TAG_disability fit networks applied
ieee
process
TAG_hci
interpretation
scale
TAG_systems
environment
als mit papers views TAG_visualization
zur algorithm framework technique this assessment TAG_algorithms space TAG_elearning production present matching researchers visual dna auch our areas future im
bases an national sites notes both subsequent evolutionary algorithms genome school description working input biology parts recognition
TAG_patterns
many
patterns in behavior device similarity influence previously levels sensor computer domains protocol area mobile comparative display interface describing TAG_visual differ
TAG_wiki
expected
features
technologies
graphical during collected primary role TAG_and evolution controlled namely highly representations approach evolving strategies separation analyze res
TAG_research
approaches
media
medium
based up character involved examined group shown similar different compounds reverse 05 and rev terms procedure specific evaluating processes TAG_mapping several turn complete TAG_sequence brain
pattern
TAG_bettasplendens
sequence fighting fish siamese behaviour
TAG_bibteximport
students cognitive
models
TAG_modeling
2007
such at TAG_models
suggesting genetic identification
factors are extended observed particular directly significantly maximum directed forces selected associated importance lack entropy human selection electrochemical biol disease hierarchical conventional
sequences
length
arbitrary which no out developed free direct spatial methods goal findings method TAG_evaluation resulting revealed TAG_methodology multiple map apob view patients
TAG_cortex peptide genes
be
study
gene monitoring showed product actual cognitive united chem degree assessed 30 across stability addition including combining layer plasma
TAG_immunosensor
heterogeneous
classical
conference
leads
book
major random case results system 60
relaxation
TAG_molecular imaging reference theoretical TAG_energy connected ml TAG_computer vs 20 significant growth large measure monoclonal science cortex events relative cross left fragments TAG_homogeneous min contrast day can formation betta theoretically TAG_epitope v time had b temporal reliable serum TAG_ldl TAG_agdetection expressed regions TAG_topic10 resolution biological TAG_imaging clinical
directions
2005
statistical
mapping
were
origin
structure
characterization
g
determining
thermodynamic
four
simple
e three
orientation volume was 10 every could 25 society close ratio reports improved anti psychology TAG_electrochemistry ldl immobilized risk concentrations TAG_amperometry oxidation
homogeneous
maps
l cholesterol short 17 95 40 50 theories antibody argue TAG_immunoassay magnetic TAG_apob clin apolipoprotein
modification
medical TAG_diffusion
combines
comparison
law
competition
recognized
detection various treated x theory defined bulk site matter 100 linear amperometric period bound flow diffusion determination assay affinity supported
expression
relation
described
since resonance protein interaction curves respectively studied onto symmetry activity contact increased clustering TAG_topic4 growing abstract cells TAG_granular limits identical images
TAG_hci
specificity main chemistry alpha region barrier differential benefits activation amino treatment surface concentration assays proteins TAG_immunoelectrode beta samples competitive
TAG_community
variation
format limit derived
consumption applications power
TAG_of 8 quantum TAG_dynamics 5
TAG_science 4 residues TAG_clustering nonlinear electronic produced between sensitive velocity apo optimized species antibodies sensitivity detect TAG_liposome
TAG_physics
TAG_competition
sample
achieved
see
run
rapid splendens mobility
TAG_and
TAG_theory
signal
antigen
functional
seven particle decreased mol receptor intrinsic containing double lipid response polymer
programming
coefficient quantitative spin those image
active
less
synthesis
whereas electron
binding
against
enzyme
responses
lipoprotein
labeled
progress
glucose
modified
cell variations
enzymes
analytical
arise reaction acid motion lattice TAG_elisa chain isolated carbon measurements coefficients electrode
measurement
kinetic
effect potential density gas surfaces lipoproteins immunoassay greater step detected fraction
coupling
calculations
interactions
3 TAG_immunoelectrode electron sci maximum TAG_logic formation visual visualization production measure evaluated s TAG_ldl u mm TAG_pattern intelligence functions evaluation attractive letters mechanical fully calculation complexes TAG_complex TAG_patterns chains point concentration literature obtain equations TAG_topic8 generalized strongly finite 7 charge gradient stationary potentials approximation constant gap
square
employed located substrate clusters linked particles TAG_topic7
dft
catalytic cluster mediated amounts membrane
formed
determined
liquid
offers
low
TAG_evaluation
ldl
TAG_marotzkiwinfried
numerical
induced composition
primary evaluating left object cortex TAG_visual solution description pattern depend electrostatic TAG_objectoriented TAG_cortex TAG_statphys23 brain motion apparent frequencies TAG_montecarlo TAG_survey survey gas TAG_metrics metrics transition initio measuring side 11 kinetics TAG_spin patterns hydrogen receptor TAG_nlp TAG_topic7 velocity n ions water strong atoms bond force exchange molecule correlations charged
mixed ph estimated very
solid
iii contained
flow objects particle
reactions
improvement pathway mechanism decrease rich
correlated polymer particles TAG_topic1 boltzmann indicates discuss parameter small path acids dependent fluid
trans cause
advantages
interact TAG_imaging phase temperature systematic chain full
equation
phase
imaging
images
one
crystal
TAG_topic6 TAG_statphys23 xxiii
TAG_diffusion image resonance
boltzmann
ab
critical strength transitions frequency solid
transition magnetic diffusion
occurs intermediate
realistic
TAG_transition TAG_equation spin occur
begin phases glass coefficient state liquid TAG_spin TAG_topic1 TAG_phase continuous TAG_topic9 TAG_transition temperature diagram
TAG_topic9 glass materials
relaxation TAG_topic6 crystal
TAG_critical exponents
TAG_phase TAG_topic2
light
terrace
roofs
night
Cluster345
temple locomotive
Cluster59
monastery railroad
Cluster439
pagoda
Cluster487
Cluster471
whales
guard
Cluster22 moss
Cluster348
train Cluster284 buddhist
Cluster190 Cluster159 Cluster278 castle
Cluster138
formation Cluster133 fountain sunrise Cluster232 reefs bridge Cluster408 Cluster423 sphinx Cluster427
Cluster305
sponges
courtyard silhouette horizon Cluster282 palace coral Cluster431 figures steel
rice
mosque
aerial lily orchid military spider rapids fog Cluster183 pole canoe blossoms steps basket ladder vendor crops vegetables Cluster109 shell monument sails
vehicle dock
saguaro
formula sunset Cluster149
Cluster17
entrance hands sailboats boeing glacier giant tortoise designs tomb diver anemone
arch Cluster426
statue forest Cluster252 ocean
Cluster117
Cluster447
grouper race museum art trail
wall Cluster318
Cluster79 arctic clearing kit sculpture Cluster392 Cluster170 tails Cluster150 moose bulls autumn Cluster244 pillar
Cluster431 carvings wood fruit
bush harvest
cafe
sidewalk plaza snake reptile cathedral doorway log lizard
Cluster87 Cluster443 Cluster205 Cluster405 tent paintings Cluster410 vegetation interior
Cluster347 moon
angelfish crystals
helicopter
jeep cave hawaii
star fox elk cow antlers stone Cluster20 buddha fan caribou hillside land Cluster380 mast girl lighthouse hands giraffe saguaro orchid Cluster185 Cluster114 Cluster39 outside tomb shirt african
Cluster338 glass Cluster215 Cluster486 Cluster426 Cluster302 Cluster222 Cluster226 Cluster372
perch
rice Cluster330 interior column
coyote
tundra
pattern
pond chairs facade perch entrance sails Cluster101
texture Cluster362 Cluster285 chrysanthemums
gate
Cluster247 Cluster207 Cluster424 Cluster353 Cluster327 dust Cluster51 pack Cluster497 pups arctic Cluster174 Cluster265 Cluster340 pond cottage Cluster412 seals canal Cluster474 head Cluster228 garden Cluster122 Cluster416 Cluster98 Cluster466 Cluster159 forest coyote Cluster411 Cluster385 river cat Cluster299 tundra farms Cluster365 Cluster167 Cluster89 window Cluster156 tree Cluster349 Cluster15 water beach Cluster423 Cluster433 grass plane Cluster477 Cluster202 Cluster11 Cluster66 Cluster168 sheep Cluster190 Cluster185 Cluster286 Cluster213 baby Cluster282 ground horses Cluster260 Cluster300 Cluster498 dall monks ruins Cluster308 Cluster436 Cluster170 Cluster128 relief column Cluster268 fence fawn Cluster26 squirrel horns stairs shadows Cluster147 Cluster117 Cluster389 porcupine rodent Cluster330 stone Cluster144 Cluster451 Cluster171 Cluster492 Cluster239 Cluster65 Cluster133 Cluster386 mosque Cluster25 Cluster335 Cluster221 pyramid slope Cluster376 Cluster283 Cluster333 Cluster235 Cluster151 dunes school Cluster219 Cluster448 Cluster468 clouds dance man umbrella hills girl Cluster53 black sand woman formation Cluster323 crowd Cluster350 indian goat dress scotland bear vineyard Cluster427 people oahu hats Cluster181 antelope runway jet prop Cluster495 steel Cluster127 desert polar house palm ships Cluster71 Cluster173 costume sky hut Cluster435 autumn Cluster115 plain wolf door fox Cluster430 Cluster273 Cluster116 face Cluster390 Cluster425 Cluster84 Cluster35 Cluster27 Cluster107 peaks snow rocks Cluster83 wall Cluster359 kit tails run den Cluster6 Cluster119 Cluster326 tracks Cluster58 ice Cluster494 Cluster421 prototype turn frozen Cluster5 Cluster45
ceiling
furniture Cluster210
shade black bear meadow grizzly sea sun polar
rose
Cluster144
Cluster23 sponges dance
Cluster67
branch
wings
mast
fall
Cluster4 Cluster480
Cluster17 Cluster296
Cluster257 Cluster130 Cluster187
snow calf vendor dress farms diver anemone drum
Cluster155 eagle pepper albatross flight Cluster146 stream
Cluster275
Cluster396 frost
Cluster304 Cluster1 Cluster394
ice pyramid frozen ruins
canoe race Cluster260 face
dust pack umbrella porcupine fly
wings museum Cluster457 art Cluster458 nest albatross birds
Cluster460 vines
Cluster479 stems
frost relief shrine
blossoms
nets
mushrooms decoration costume hats cafe sidewalk Cluster341 Cluster166
Cluster336
flight
balcony vineyard detail harvest Cluster227
Cluster420
Cluster492
clouds
decoration
booby
fly stick Cluster341 Cluster178
Cluster240 path petals Cluster415 drum Cluster209 Cluster293 pots Cluster50 Cluster62 Cluster467 poppies nest Cluster112 blooms Cluster496 Cluster493 leaf lawn Cluster383 landscape Cluster317 Cluster9 Cluster217
Cluster2 Cluster478
straightaway formula Cluster263 Cluster199 winter Cluster136 Cluster7 Cluster264 Cluster47
Cluster41 Cluster129 Cluster438 Cluster437 Cluster14 Cluster294
Cluster398
maui
rabbit Cluster498 mountain waves barn sign writing Cluster424 Cluster142 Cluster331 rodent
Cluster128
Cluster436
Cluster479
row
Cluster288 Cluster407 birds pottery Cluster465 branch Cluster142 Cluster255
straightaway
shrubs
dock Cluster70
Cluster402 Cluster110 Cluster399
Cluster241 tulip Cluster478 Cluster126 squirrel field prayer sand athlete dunes garden oahu sky Cluster29 valley scotland Cluster113 display hills plain blooms Cluster243 tusks coast lawn cactus Cluster347 trail trunk Cluster193 smoke mare post pottery floor stick Cluster42 horses ground flag pots desert room dog paintings mural Cluster461 Cluster157
plants booby
Cluster483
antelope
bench
needles parade cactus Cluster38
Cluster364 foals Cluster268 pair
cars Cluster231 Cluster397 Cluster224 Cluster489 Cluster162 canyon pool Cluster271 swimmers Cluster279 penguin island Cluster449 tower shore valley Cluster21 Cluster297 street Cluster408 Cluster96 harbor mountain bridge boats f?16 Cluster243 Cluster94 range Cluster352 Cluster106 Cluster118 coral park Cluster102 Cluster164 Cluster321 Cluster413 windmills Cluster194 Cluster459
athlete
Cluster19 Cluster393 balcony Cluster100 Cluster154
Cluster153
insect
Cluster192 Cluster491
Cluster49 Cluster93 Cluster56 Cluster235 sailboats church ceremony lake petals Cluster362 flowers
butterfly
f?16
Cluster267
chrysanthemums
elephant
Cluster55 Cluster166
Cluster470
Cluster442 sheep beach canyon water Cluster308 palm grass zebra mist dall Cluster62 stems herd railing
Cluster432
Cluster366 Cluster12
cow Cluster37 Cluster295
close?up
Cluster165
Cluster148 Cluster267 chapel Cluster266 Cluster278 sign Cluster369 Cluster291 Cluster481 reefs Cluster60 fish Cluster287 Cluster276 Cluster331 Cluster91 Cluster381 Cluster499 Cluster463 lion Cluster175 Cluster395 Cluster476 Cluster454 Cluster329 Cluster320 writing
tree Cluster390 reflection
jet carvings plane doorway basket wood stairs penguin
Cluster81
Cluster281
Cluster403 rose Cluster464 clearing
flowers truck bengal moose Cluster179 Cluster242 antlers Cluster121 Cluster262 Cluster180 herd Cluster325 kauai tiger bulls meadow Cluster490 lynx Cluster241 caribou Cluster284 rockface hillside tulip row Cluster52 Cluster256 Cluster232 cubs Cluster138 field Cluster74 grizzly Cluster301
Cluster442
Cluster225 Cluster458
Cluster306
Cluster152 leaf Cluster165 river people rocks boats Cluster195 helicopter jeep shops plants bay baby mule run close?up cathedral fruit crops
elk
prop seals pups canal Cluster214 Cluster256 cubs deer ceiling furniture Cluster313 white?tailed vegetation head bush snake fawn bench
Cluster378
Cluster195
Cluster77 storm
Cluster351
Cluster404
pepper pool
lion
shell
Cluster342 Cluster211 Cluster46
slope goat lynx Cluster82
giraffe
store
harbor Cluster317 man woman turn butterfly insect reptile Cluster383 lizard island swimmers log Cluster205 Cluster351 tent cat tiger vegetables crab
market Cluster21
Cluster357 Cluster470
Cluster443
Cluster254
Cluster457
Cluster456
Cluster429
guard clothes
Cluster289
food
Cluster482
cheese street indian tables
Cluster9 Cluster399 path chapel
Cluster405
bengal
hawaii Cluster137
shade
Cluster322
nets Cluster56
village reflection train arch
Cluster422 town skyline smoke market locomotive ocean store Cluster450 cliff
truck gate Cluster482 Cluster173 restaurant fish plaza windmills shore
Cluster412
stream Cluster493 Cluster209
tower
Cluster286 school monument hawk terrace
barn
Cluster140
Cluster358 elephant Cluster82
Cluster420
crystals eagle cars kauai Cluster403 Cluster230 Cluster59 Cluster356
Cluster269
fall
crafts
trunk Cluster417 Cluster63
zebra Cluster377
cheese african waves Cluster22 coast
pole monks Cluster110 horns
pattern texture Cluster217 Cluster425
fog Cluster409 Cluster305 designs
castle
railroad shops
Cluster96
Cluster129
Cluster434
marsh festival ships Cluster353 Cluster326 architecture
Cluster292 Cluster152 Cluster227 Cluster233
waterfalls Cluster441 Cluster16 Cluster488 mist rainbow
town
tracks park storm peaks giant tortoise fence iguana marine poppies
Cluster280 Cluster111
maui
festival Cluster123 Cluster387 crab
buildings lighthouse architecture
Cluster291
parade
shadows
winter
runway
boeing
den
Cluster61 tusks room hawk Cluster230 Cluster191 model floor lichen Cluster343 Cluster322 Cluster76 Cluster270 Cluster332 Cluster137 Cluster214 animals shrubs foals Cluster419 mare Cluster356 pebbles crafts Cluster75 carpet Cluster157 Cluster252 Cluster169 Cluster309 Cluster8
courtyard ceremony Cluster316 sea
Cluster197 Cluster246 mural food cougar hotel
buildings Cluster1 hotel angelfish rockface road moon
Cluster394
flag
prototype cliff
crowd
glass
lily
cave
grouper
military
mushrooms
ladder steps model carpet vines cottage
Cluster81 Cluster485 Cluster124
Cluster88 palace city light Cluster471 fountain
prayer
skyline wolf whales animals rapids spider needles
star calf fan post display leopard
city
lichen
pebbles
vehicle
aerial
range
glacier
Cluster203
Cluster101 Cluster244
remains waterfalls
leopard
cougar
village
sun
Cluster39 night church statue lake Cluster380 tables Cluster143 Cluster361 Cluster249 bay shrine Cluster120 restaurant Cluster414 Cluster238 chairs facade remains railing detail iguana marine
Cluster182
landscape
rainbow
Cluster406 moss Cluster313 Cluster200
Cluster364 Cluster461 Cluster149 Cluster234 Cluster163 rabbit Cluster370 marsh Cluster345 Cluster208 Cluster193 Cluster48 Cluster43 Cluster258 Cluster277 Cluster259 Cluster375 Cluster348 Cluster440 Cluster72 Cluster204 Cluster462 pillar Cluster220 Cluster79 Cluster114 sunset figures Cluster318 temple Cluster310 Cluster23 pair Cluster444 Cluster384 sphinx Cluster20 Cluster68 Cluster186 Cluster150 land
house Cluster145
Cluster108 Cluster126 Cluster40 Cluster455 Cluster70 shirt Cluster85 Cluster3
Cluster363
Cluster57 sculpture
Cluster18
deer
Cluster274
Cluster113 buddha
Cluster344
window Cluster201
Cluster475
hut
Cluster281 Cluster104 dog mule
Cluster392
Cluster139
Cluster439 outside silhouette Cluster34 horizon buddhist
Cluster391
Cluster184 Cluster29 Cluster30 Cluster354 Cluster49
Cluster206 Cluster196
door
white?tailed Cluster379 Cluster223
clothes
Cluster445
road
Cluster311
Cluster400 Cluster382
roofs
Cluster160
Cluster446
Cluster360
sunrise pagoda
Cluster97 monastery
Cluster24
Cluster95
Cluster44
Fig. 7. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Bibtex and Corel5k.
6770
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
protein function classi?cation (Roth & Fischer, 2007), and antiretroviral drug categorization (Borchani, Bielza, Toro, & Larra?aga, 2013). 7. Conclusion & avenues for future research We ?rst discussed a hybrid algorithm for global or local (around target nodes) BN structure learning called Hybrid HPC (H2PC). Our extensive experiments showed that H2PC outperforms the stateof-the-art MMHC by a signi?cant margin in terms of edge recall without sacri?cing the number of extra edges, which is crucial for the soundness of the super-structure used during the second stage of hybrid methods, like the ones proposed in Perrier et al. (2008) and Kojima et al. (2010). The code of H2PC is open-source and publicly available online at https://github.com/madbix/ bnlearn-clone-3.4. Second, we discussed an application of H2PC to the multi-label learning problem which is a challenging problem in many real-world application domains. We established theoretical results, under the faithfulness condition, in order to characterize graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution and their respective Markov boundaries. As far as we know, this is the ?rst investigation of Markov boundary principles to the optimal variable/feature selection problem in multi-label learning. These formal results offer a simple guideline to characterize graphically: (i) the minimal label powerset decomposition, (i.e. into minimal subsets YLP # Y such that YLP Y n YLP j X), and (ii) the minimal subset of features, w.r.t an Information Theory criterion, needed to predict each label powerset, thereby reducing the input space and the computational burden of the multi-label classi?cation. The theoretical analysis laid the foundation for a practical multi-label classi?cation procedure. Another set of experiments were carried out on a broad range of multi-label data sets from different problem domains to demonstrate its effectiveness. H2PC was shown to outperform MMHC by a signi?cant margin in terms of global accuracy. We suggest several avenues for future research. As far as BN structure learning is concerned, future work will aim at: (1) ascertaining which independence test (e.g. tests targeting speci?c distributions, employing parametric assumptions etc.) is most suited to the data at hand (Scutari, 2011; Tsamardinos & Borboudakis, 2010); (2) controlling the false discovery rate of the edges in the graph output by H2PC (Pe?a, 2008) especially when dealing with more nodes than samples, e.g. learning gene networks from gene expression data. In this study, H2PC was run independently on each node without keeping track of the dependencies found previously. This lead to some loss of ef?ciency due to redundant calculations. The optimization of the H2PC code is currently being undertaken to lower the computational cost while maintaining its performance. These optimizations will include the use of a cache to store the (in) dependencies and the use of a global structure. Other interesting research avenues to explore are extensions and modi?cations to the greedy search-and-score procedure which is the most promising part to optimize in terms of computational gains. Regarding the multi-label learning problem, experiments with several thousands of labels are currently been conducted and will be reported in due course. We also intend to work on relaxing the faithfulness assumption and derive a practical multilabel classi?cation algorithm based on H2PC that is correct under milder assumptions underlying the joint distribution (e.g. Composition, Intersection). This needs further substantiation through more analysis. Acknowledgments The authors thank Marco Scutari for sharing his bnlearn package in R. The research was also supported by grants from the European
ENIAC Joint Undertaking (INTEGRATE project) and from the French Rh?ne-Alpes Complex Systems Institute (IXXI). Appendix A Lemma 8. 8YLPi ; YLPj 2 LP, then YLPi \ YLPj 2 LP.
Proof. To keep the notation uncluttered and for the sake of simplicity, consider a partition fY1 ; Y2 ; Y3 ; Y4 g of Y such that YLPi ? Y1 [ Y2 , YLPj ? Y2 [ Y3 . From the label powerset assumption for YLPi and YLPj , we have ?Y1 [ Y2 ? ?Y3 [ Y4 ? j X and ?Y2 [ Y3 ? ?Y1 [ Y4 ? j X . From the Decomposition, we have that Y2 ?Y3 [ Y4 ? j X . Using the Weak union, we obtain that Y2 Y1 j ?Y3 [ Y4 [ X? . From these two facts, we can use the Contraction property to show that Y2 ?Y1 [ Y3 [ Y4 ? j X . Therefore, Y2 ? YLPi \ YLPj is a label powerset by de?nition. h Lemma 9. Let Y i and Y j denote two distinct labels in Y and de?ne by YLPi and YLPj their respective minimal label powerset. Then we have,
9Z # Y n fY i ; Y j g; fY i g
fY j g j ?X [ Z? ) YLPi ? YLPj
Proof. Let us suppose YLPi – YLPj . By the label powerset de?nition for YLPi , we have YLPi Y n YLPi j X. As YLPi \ YLPj ? ; owing to Lemma 8, we have that Y j R YLPi . 8Z # Y n fY i ; Y j g; Z can be decomposed as Zi [ Zj such that Zi ? Z \ ?YLPi n fY i g? and Zj ? Z n Zi . Using the Decomposition property, we have that ?fY i g [ Zi ? ?fY j g [ Zj ? j X . Using the Weak union property, we have that fY i g fY j g j ?X [ Z? . As this is true 8Z # Y n fY i ; Y j g, then = Z # Y n fY i ; Y j g; fY i g 9 fY j g j ?X [ Z?, which completes the proof. h Lemma 10. Consider YLP a minimal label powerset. Then, if p satis?es the Composition property, Z YLP n Z j X. Proof. By contradiction, suppose a nonempty Z exists, such that Z YLP n Z j X . From the label powerset assumption of YLP , we have that YLP Y n YLP j X . From these two facts, we can use the Composition property to show that Z Y n Z j X which contradicts the minimal label powerset assumption of YLP . This concludes the proof. h Theorem 6. Suppose p is faithful to a DAG G. Then, Y i and Y j belong to the same minimal label powerset if and only if there exists an undirected path in G between nodes Y i and Y j in Y such that all intermediate nodes Z are either (i) Z 2 Y, or (ii) Z 2 X and Z has two parents in Y (i.e. a collider of the form Y p ! X Y q ). Proof. Suppose such a path exists. By conditioning on all the intermediate colliders Y k in Y of the form Y p ! Y k Y q along an undirected path in G between nodes Y i and Y j , we ensure that 9Z # Y n fY i ; Y j g; dSep?Y i ; Y j j X [ Z?. Due to the faithfulness, this is equivalent to fY i g fY j g j ?X [ Z?. From Lemma 9, we conclude that Y i and Y j belong to the same minimal label powerset. To show the inverse, note that owing to Lemma 10, we know that there exists no partition fZi ; Zj g of Z such that ?fY i g [ Zi ? ?fY j g [ Zj ? j X. Due to the faithfulness, there exists at least a link between ?fY i g [ Zi ? and ?fY j g [ Zj ? in the DAG, hence by recursion, there exists a path between Y i and Y j such that all intermediate nodes Z are either (i) Z 2 Y, or (ii) Z 2 X and Z has two parents in Y (i.e. a collider of the form Y p ! X Y q ). h
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
6771
Theorem 7. Suppose p is faithful to a DAG G. Let YLP ? fY 1 ; Y 2 ; . . . ; Y n g be a label powerset. Then, its Markov boundary M in U is also its Markov boundary in X, and is given in G by S M? n j?1 fPCY j [ SPY j g n Y . Proof. First, we prove that M is a Markov boundary of YLP in U. De?ne Mi the Markov boundary of Y i in U, and M0i ? Mi n Y. From Theorem 5, Mi is given in G by Mi ? PCY j [ SPY j . We may now prove that M01 [ . . . [ M0n is a Markov boundary of fY 1 ; Y 2 ; . . . ; Y n g in U. We show ?rst that the statement holds for n ? 2 and then conclude that it holds for all n by induction. Let W denote U n ?Y1 [ Y2 [M01 [ M02 ?, and de?ne Y1 ? fY 1 g and Y2 ? fY 2 g. From the Markov blanket assumption for Y1 we have Y1 U n ?Y1 [ M1 ? j M1 . Using the Weak Union property, we obtain that Y1 W j M01 [ M02 [ Y2 . Similarly we can derive Y2 W j M02 [ M01 [ Y1 . Combining these two statements yields Y1 [ Y2 W j M01 [ M02 due to the Intersection property. Let M ? M01 [ M02 the last expression can be formulated as Y1 [ Y2 U n ?Y1 [ Y2 [ M? j M which is the de?nition of a Markov blanket of Y1 [ Y2 in U. We shall now prove that M is minimal. Let us suppose that it is not the case, i.e., 9Z & M such that Y1 [ Y2 Z [ ?U n ?Y1 [ Y2 [ M?? j M n Z. De?ne Z1 ? Z \ M1 and Z2 ? Z \ M2 , we may apply the Weak Union property to get Y1 Z1 j ?M n Z? [ Y2 [ Z2 [ ?U n ?Y [ M?? which can be rewritten more compactly as Y1 Z1 j ?M1 n Z1 ? [ ?U n ?Y1 [ M1 ?? . From the Markov blanket assumption, we have Y1 U n ?Y1 [ M1 ? j M1 . We may now apply the Intersection property on these two statements to obtain Y1 Z1 [ ?U n ?Y1 [ M1 ?? j M1 n Z1 . Similarly, we can derive Y2 Z2 [ ?U n ?Y2 [ M2 ?? j M2 n Z2 . From the Markov boundary assumption of M1 and M2 , we have necessarily Z1 ? ; and Z2 ? ;, which in turn yields Z ? ;. To conclude for any n > 2, it suf?ces S ?1 to set Y1 ? n j?1 fY j g and Y 2 ? fY n g to conclude by induction. Second, we prove that M is a Markov blanket of YLP in X. De?ne Z ? M \ Y. From the label powerset de?nition, we have YLP Y n YLP j X . Using the Weak Union property, we obtain YLP Z j U n ?YLP [ Z? which can be reformulated as YLP Z j ?U n?YLP [ M?? [ ?M n Z? . Now, the Markov blanket assumption for YLP in U yields YLP U n ?YLP [ M? j M which can be rewritten as YLP U n ?YLP [ M? j ?M n Z? [ Z . From the Intersection property, we get YLP Z [ ?U n ?YLP [ M?? j M n Z . From the Markov boundary assumption of M in U, we know that there exists no proper subset of M which satis?es this statement, and therefore Z ? M \ Y ? ;. From the Markov blanket assumption of M in U, we have YLP U n ?YLP [ M? j M . Using the Decomposition property, we obtain YLP X n M j M which, together with the assumption M \ Y ? ;, is the de?nition of a Markov blanket of YLP in X. Finally, we prove that M is a Markov boundary of YLP in X. Let us suppose that it is not the case, i.e., 9Z & M such that YLP Z[ ?X n M? j M n Z . From the label powerset assumption of YLP , we have YLP Y n YLP j X which can be rewritten as YLP Y n YLP j ?X nM? [ ?M n Z? [ Z . Due to the Contraction property, combining these two statements yields YLP Z [ ?U n ?M [ YLP ? j M n Z . From the Markov boundary assumption of M in U, we have necessarily Z ? ;, which suf?ces to prove that M is a Markov boundary of YLP in X. h References
Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley. Aliferis, C. F., Statnikov, A. R., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. (2010). Local causal and markov blanket induction for causal discovery and feature selection for classi?cation part i: Algorithms and empirical evaluation. Journal of Machine Learning Research, JMLR, 11, 171–234. Alvares-Cherman, E., Metz, J., & Monard, M. (2011). Incorporating label dependency into the binary relevance framework for multi-label classi?cation. Expert Systems With Applications, ESWA, 39, 1647–1655. Armen, A. P., & Tsamardinos, I. (2011). A uni?ed approach to estimation and control of the false discovery rate in Bayesian network skeleton identi?cation. In European symposium on arti?cial neural networks, ESANN.
Aussem, A., Rodrigues de Morais, S., & Corbex, M. (2012). Analysis of nasopharyngeal carcinoma risk factors with Bayesian networks. Arti?cial Intelligence in Medicine, 54. Aussem, A., Tchernof, A., Rodrigues de Morais, S., & Rome, S. (2010). Analysis of lifestyle and metabolic predictors of visceral obesity with Bayesian networks. BMC Bioinformatics, 11, 487. Badea, A. (2004). Determining the direction of causal in?uence in large probabilistic networks: A constraint-based approach. In Proceedings of the sixteenth european conference on arti?cial intelligence (pp. 263–267). Bernard, A., & Hartemink, A. (2005). Informative structure priors: Joint learning of dynamic regulatory networks from multiple types of data. In Proceedings of the paci?c symposium on biocomputing (pp. 459–470). Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Top-down induction of clustering trees. In J. W. Shavlik (Ed.), International conference on machine learning, ICML (pp. 55–63). Morgan Kaufman. Borchani, H., Bielza, C., Toro, C., & Larra?aga, P. (2013). Predicting human immunode?ciency virus inhibitors using multi-dimensional Bayesian network classi?ers. Arti?cial Intelligence in Medicine, 57, 219–229. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Brown, L. E., & Tsamardinos, I. (2008). A strategy for making predictions under manipulation. Journal of Machine Learning Research, JMLR, 3, 35–52. Buntine, W. (1991). Theory re?nement on Bayesian networks. In B. D’Ambrosio, P. Smets, & P. Bonissone (Eds.), Uncertainty in arti?cial intelligence, UAI (pp. 52–60). San Mateo, CA, USA: Morgan Kaufman. Cawley, G. (2008). Causal and non-causal feature selection for ridge regression. JMLR: Workshop and Conference Proceedings, 3, 107–128. Cheng, J., Greiner, R., Kelly, J., Bell, D. A., & Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. Arti?cial Intelligence, 137, 43–90. Chickering, D., Heckerman, D., & Meek, C. (2004). Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, JMLR, 5, 1287–1330. Chickering, D. M. (2002). Optimal structure identi?cation with greedy search. Journal of Machine Learning Research, JMLR, 3, 507–554. Cussens, J., & Bartlett, M. (2013). Advances in Bayesian network learning using integer programming. In Uncertainty in arti?cial intelligence (pp. 182–191). Dembczyski, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2012). On label dependence and loss minimization in multi-label classi?cation. Machine Learning, 88, 5–45. Ellis, B., & Wong, W. H. (2008). Learning causal Bayesian network structures from experimental data. Journal of the American Statistical Association, 103, 778–789. Friedman, N., Nachman, I., & Peer, D. (1999). Learning Bayesian network structure from massive datasets: The sparse candidate algorithm. In Proceedings of the ?fteenth conference on uncertainty in arti?cial intelligence (pp. 206–215). Friedman, N., Nachman, I., & Peer, D. (1999b). Learning Bayesian network structure from massive datasets: The sparse candidate algorithm. In K. B. Laskey & H. Prade (Eds.), Uncertainty in arti?cial intelligence, UAI (pp. 21–30). Morgan Kaufmann Publishers. Gasse, M., Aussem, A., & Elghazel, H. (2012). Comparison of hybrid algorithms for Bayesian network structure learning. In European conference on machine learning and principles and practice of knowledge discovery in databases, ECML-PKDD. Lecture notes in computer science (Vol. 7523). Springer. Gu, Q., Li, Z., & Han, J. (2011). Correlated multi-label feature selection. In C. Macdonald, I. Ounis, & I. Ruthven (Eds.), CIKM (pp. 1087–1096). ACM. Guo, Y., & Gu, S. (2011). Multi-label classi?cation using conditional dependency networks. In International joint conference on arti?cial intelligence, IJCAI (pp. 1300–1305). AAAI Press. Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243. ˇ eroski, S. (2007). Ensembles of multi-objective Kocev, D., Vens, C., Struyf, J., & Dz decision trees. In J. N. Kok (Ed.), European conference on machine learning, ECML. Lecture notes in arti?cial intelligence (Vol. 4701, pp. 624–631). Springer. Koivisto, M., & Sood, K. (2004). Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, JMLR, 5, 549–573. Kojima, K., Perrier, E., Imoto, S., & Miyano, S. (2010). Optimal search on clustered structural constraint for learning Bayesian network structure. Journal of Machine Learning Research, JMLR, 11, 285–310. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. MIT Press. Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In International conference on machine learning, ICML (pp. 284–292). Liaw, A., & Wiener, M. (2002). Classi?cation and regression by randomforest. R News, 2, 18–22. Luaces, O., Díez, J., Barranquero, J., del Coz, J. J., & Bahamonde, A. (2012). Binary relevance ef?cacy for multilabel classi?cation. Progress in AI, 1, 303–313. ˇ eroski, S. (2012). An extensive Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dz experimental comparison of methods for multi-label learning. Pattern Recognition, 45, 3084–3104. Maron, O., & Ratan, A. L. (1998). Multiple-instance learning for natural scene classi?cation. In International conference on machine learning, ICML (Vol. 7, pp. 341–349). Citeseer. McCallum, A. (1999). Multi-label text classi?cation with a mixture model trained by em. In AAAI workshop on text learning. Moore, A., & Wong, W. (2003). Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. In T. Fawcett, & N. Mishra (Eds.), International conference on machine learning, ICML.
6772
M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772 Scutari, M., & Nagarajan, R. (2013). Identifying signi?cant edges in graphical models of molecular networks. Arti?cial Intelligence in Medicine, 57, 207–217. Silander, T., & Myllym?ki, P. (2006). A simple approach for ?nding the globally optimal Bayesian network structure. In Uncertainty in arti?cial intelligence, UAI (pp. 445–452). Snoek, C., Worring, M., Gemert, J. V., Geusebroek, J., & Smeulders, A. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In ACM international conference on multimedia (pp. 421–430). ACM Press. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction, and search (2nd ed.). The MIT Press. Spola?r, N., Cherman, E. A., Monard, M. C., & Lee, H. D. (2013). A comparison of multi-label feature selection methods using the problem transformation approach. Electronic Notes in Theoretical Computer Science, 292, 135–151. ? , M., & Haws, D. (2014). Learning Bayesian network structure: Towards the Studeny essential graph by integer linear programming tools. International Journal of Approximate Reasoning, 55, 1043–1071. Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multi-label classi?cation of music into emotions. In ISMIR (pp. 325–330). Tsamardinos, I., Aliferis, C., & Statnikov, A. (2003). Algorithms for large scale Markov blanket discovery. In Florida arti?cial intelligence research society conference FLAIRS’03 (pp. 376–381). Tsamardinos, I., & Borboudakis, G. (2010). Permutation testing improves Bayesian network learning. In European conference on machine learning and knowledge discovery in databases, ECML-PKDD (pp. 322–337). Tsamardinos, I., Brown, L., & Aliferis, C. (2006). The max–min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65, 31–78. Tsamardinos, I., & Brown, L. E. (2008). Bounding the false discovery rate in local Bayesian network learning. In AAAI conference on arti?cial intelligence (pp. 1100– 1105). Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010a). Mining multi-label data. Transformation, 135, 1–20. Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010b). Random k-labelsets for multi-label classi?cation. IEEE Transactions on Knowledge and Data Engineering, TKDE, 23, 1–12. Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An ensemble method for multilabel classi?cation. In Proceedings of the 18th european conference on machine learning (Vol. 4701, pp. 406–417). Villanueva, E., & Maciel, C. (2012). Optimized algorithm for learning Bayesian network superstructures. In International conference on pattern recognition applications and methods, ICPRAM. Villanueva, E., & Maciel, C. D. (2014). Ef?cient methods for learning Bayesian network super-structures. Neurocomputing, 3–12. Zhang, M. L., & Zhang, K. (2010). Multi-label learning by exploiting label dependency. In European conference on machine learning and principles and practice of knowledge discovery in databases, ECML PKDD. Knowledge discovery in databases, KDD (Vol. 16, pp. 999). ACM Press. Zhang, M.-L., & Zhou, Z.-H. (2006). Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18, 1338–1351.
Nagarajan, R., Scutari, M., & Lbre, S. (2013). Bayesian networks in R: With applications in systems biology (use R!). Springer. Neapolitan, R. E. (2004). Learning Bayesian networks. Upper Saddle River, NJ: Pearson Prentice Hall. Ott, S., Imoto, S., & Miyano, S. (2004). Finding optimal models for small gene networks. In Proceedings of the paci?c symposium on biocomputing (pp. 557– 567). Pe?a, J., Nilsson, R., Bj?rkegren, J., & Tegnér, J. (2007). Towards scalable and data ef?cient learning of Markov boundaries. International Journal of Approximate Reasoning, 45, 211–232. Pe?a, J. M., Bj?rkegren, J., & Tegnér, J. (2005). Growing Bayesian network models of gene networks from seed genes. Bioinformatics, 40, 224–229. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, CA, USA: Morgan Kaufman. Pe?a, J. M. (2008). Learning gaussian graphical models of gene networks with false discovery rate control. In European conference on evolutionary computation, machine learning and data mining in bioinformatics (Vol. 6, pp. 165–176). Pe?a, J. M. (2012). Finding consensus Bayesian network structures. Journal of Arti?cial Intelligence Research, 42, 661–687. Perrier, E., Imoto, S., & Miyano, S. (2008). Finding optimal Bayesian network given a super-structure. Journal of Machine Learning Research, JMLR, 9, 2251–2286. Peer, D., Regev, A., Elidan, G., & Friedman, N. (2001). Inferring subnetworks from perturbed expression pro?les. Bioinformatics, 17, 215–224. Prestat, E., Rodrigues de Morais, S., Vendrell, J., Thollet, A., Gautier, C., Cohen, P., et al. (2013). Learning the local Bayesian network structure around the ZNF217 oncogene in breast tumours. Computers in Biology and Medicine, 4, 334–341. R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria. <http://www.Rproject.org>. Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classi?er chains for multilabel classi?cation. In European conference on machine learning and principles and practice of knowledge discovery in databases, ECML-PKDD. Lecture notes in computer science (Vol. 5782, pp. 254–269). Springer. Rodrigues de Morais, S., & Aussem, A. (2010). An ef?cient learning algorithm for local Bayesian network structure discovery. In European conference on machine learning and principles and practice of knowledge discovery in databases, ECMLPKDD (pp. 164–169). Rodrigues de Morais, S., & Aussem, A. (2010b). A novel Markov boundary based feature subset selection algorithm. Neurocomputing, 73, 578–584. Roth, V., & Fischer, B. (2007). Improved functional prediction of proteins by learning kernel combinations in multilabel settings. BMC Bioinformatics, 8, S12. Schwarz, G. E. (1978). Estimating the dimension of a model. Journal of Biomedical Informatics, 6, 461–464. Scutari, M. (2010). Learning Bayesian networks with the bnlearn R package. Journal of Statistical Software, 35, 1–22. Scutari, M. (2011). Measures of variability for graphical models (Ph.D. thesis). School in Statistical Sciences, University of Padova. Scutari, M., & Brogini, A. (2012). Bayesian network structure learning with permutation tests. Communications in Statistics Theory and Methods, 41, 3233–3243.