当前位置:首页 >> 机械/仪表 >>

A hybrid algorithm for Bayesian network structure learning

Expert Systems with Applications 41 (2014) 6755–6772

Contents lists available at ScienceDirect

Expert Systems with Applications
journal homepage: www.elsevier.com/locate/eswa

A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
Maxime Gasse, Alex Aussem ?, Haytham Elghazel
Université de Lyon, CNRS, Université Lyon 1, LIRIS, UMR5205, F-69622, France

a r t i c l e

i n f o

a b s t r a c t
We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It ?rst reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max–Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of ?t to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classi?cation problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classi?cation accuracy over ten multi-label data sets covering different application domains. Overall, our experiments support the conclusions that local structural learning with H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learning framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for the empirical tests are publicly available. ? 2014 Elsevier Ltd. All rights reserved.

Article history: Available online 9 May 2014 Keywords: Bayesian networks Multi-label learning Markov boundary Feature subset selection

1. Introduction A Bayesian network (BN) is a probabilistic model formed by a structure and parameters. The structure of a BN is a directed acyclic graph (DAG), whilst its parameters are conditional probability distributions associated with the variables in the model. The problem of ?nding the DAG that encodes the conditional independencies present in the data attracted a great deal of interest over the last years (Gasse, Aussem, & Elghazel, 2012; Kojima, Perrier, Imoto, & Miyano, 2010; Pe?a, 2012; Perrier, Imoto, & Miyano, 2008; Rodrigues de Morais & Aussem, 2010a; Scutari, 2010; Scutari & Brogini, 2012; Villanueva & Maciel, 2012). The inferred DAG is very useful for many applications, including feature selection (Aliferis, Statnikov, Tsamardinos, Mani, & Koutsoukos, 2010; Pe?a, Nilsson, Bj?rkegren, & Tegnér, 2007; Rodrigues de Morais & Aussem, 2010b), causal relationships inference from observational
? Corresponding author. Tel.: +33 426234466.
E-mail address: aaussem@univ-lyon1.fr (A. Aussem). http://dx.doi.org/10.1016/j.eswa.2014.04.032 0957-4174/? 2014 Elsevier Ltd. All rights reserved.

data (Aliferis et al., 2010; Aussem, Rodrigues de Morais, & Corbex, 2012, 2010; Brown & Tsamardinos, 2008; Cawley, 2008; Ellis & Wong, 2008; Prestat et al., 2013) and more recently multi-label learning (Dembczyski, Waegeman, Cheng, & Hüllermeier, 2012; Guo & Gu, 2011; Zhang & Zhang, 2010). Ideally the DAG should coincide with the dependence structure of the global distribution, or it should at least identify a distribution as close as possible to the correct one in the probability space. This step, called structure learning, is similar in approaches and terminology to model selection procedures for classical statistical models. Basically, constraint-based (CB) learning methods systematically check the data for conditional independence relationships and use them as constraints to construct a partially oriented graph representative of a BN equivalence class, whilst search-and-score (SS) methods make use of a goodness-of-?t score function for evaluating graphical structures with regard to the data set. Hybrid methods attempt to get the best of both worlds: they learn a skeleton with a CB approach and constrain on the DAGs considered during the SS phase.


M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772

In this study, we present a novel hybrid algorithm for Bayesian network structure learning, called H2PC.1 It ?rst reconstructs the skeleton of a Bayesian network and then performs a Bayesianscoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. HPC may be thought of as a way to compensate for the large number of false negatives at the output of the weak PC learner, by performing extra computations. As this may arise at the expense of the number of false positives, we control the expected proportion of false discoveries (i.e. false positive nodes) among all the discoveries made in PCT . We use a modi?cation of the Incremental association Markov boundary algorithm (IAMB), initially developed by Tsamardinos et al. in Tsamardinos, Aliferis, and Statnikov (2003) and later modi?ed by Pe?a in Pe?a (2008) to control the FDR of edges when learning Bayesian network models. HPC scales to thousands of variables and can deal with many fewer samples (n < q). To illustrate its performance by means of empirical evidence, we conduct two series of experimental comparisons of H2PC against Max–Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for BN structure learning (Tsamardinos, Brown, & Aliferis, 2006), using well-known BN benchmarks with various data sizes, to assess the goodness of ?t to new data as well as the quality of the network structure with respect to the true dependence structure of the data. We then address a real application of H2PC where the true dependence structure is unknown. More speci?cally, we investigate H2PC’s ability to encode the joint distribution of the label set conditioned on the input features in the multi-label classi?cation (MLC) problem. Many challenging applications, such as photo and video annotation and web page categorization, can bene?t from being formulated as MLC tasks with large number of categoˇ eroski, ries (Dembczyski et al., 2012; Kocev, Vens, Struyf, & Dz ˇ eroski, 2012; Read, 2007; Madjarov, Kocev, Gjorgjevikj, & Dz Pfahringer, Holmes, & Frank, 2009; Tsoumakas, Katakis, & Vlahavas, 2010b). Recent research in MLC focuses on the exploitation of the label conditional dependency in order to better predict the label combination for each example. We show that local BN structure discovery methods offer an elegant and powerful approach to solve this problem. We establish two theorems (Theorems 6 and 7) linking the concepts of marginal Markov boundaries, joint Markov boundaries and so-called label powersets under the faithfulness assumption. These Theorems offer a simple guideline to characterize graphically: (i) the minimal label powerset decomposition, (i.e. into minimal subsets YLP # Y such that YLP Y n YLP j X), and (ii) the minimal subset of features, w.r.t an Information Theory criterion, needed to predict each label powerset, thereby reducing the input space and the computational burden of the multi-label classi?cation. To solve the MLC problem with BNs, the DAG obtained from the data plays a pivotal role. So in this second series of experiments, we assess the comparative ability of H2PC and MMHC to encode the label dependency structure by means of an indirect goodness of ?t indicator, namely the 0=1 loss function, which makes sense in the MLC context. The rest of the paper is organized as follows: In the Section 2, we review the theory of BN and discuss the main BN structure learning strategies. We then present the H2PC algorithm in details in Section 3. Section 4 evaluates our proposed method and shows results for several tasks involving arti?cial data sampled from known BNs. Then we report, in Section 5, on our experiments on real-world data sets in a multi-label learning context so as to provide empirical support for the proposed methodology. The

main theoretical results appear formally as two theorems (Theorems 6 and 7) in Section 5. Their proofs are established in the Appendix. Finally, Section 6 raises several issues for future work and we conclude in Section 7 with a summary of our contribution. 2. Preliminaries We de?ne next some key concepts used along the paper and state some results that will support our analysis. In this paper, upper-case letters in italics denote random variables (e.g., X ; Y ) and lower-case letters in italics denote their values (e.g., x; y). Upper-case bold letters denote random variable sets (e.g., X; Y; Z) and lower-case bold letters denote their values (e.g., x; y). We denote by X Y j Z the conditional independence between X and Y given the set of variables Z. To keep the notation uncluttered, we use p?y j x? to denote p?Y ? y j X ? x?. 2.1. Bayesian networks Formally, a BN is a tuple hG; P i, where G ? hU; Ei is a directed acyclic graph (DAG) with nodes representing the random variables U and P a joint probability distribution in U . In addition, G and P must satisfy the Markov condition: every variable, X 2 U, is independent of any subset of its non-descendant variables conditioned on the set of its parents, denoted by PaG i . From the Markov condition, it is easy to prove (Neapolitan, 2004) that the joint probability distribution P on the variables in U can be factored as follows:

P?V? ? P?X 1 ; . . . ; X n ? ?

n Y i? 1

P?X i j PaG i?


Eq. (1) allows a parsimonious decomposition of the joint distribution P. It enables us to reduce the problem of determining a huge number of probability values to that of determining relatively few. A BN structure G entails a set of conditional independence assumptions. They can all be identi?ed by the d-separation criterion (Pearl, 1988). We use X ?G Y j Z to denote the assertion that X is d-separated from Y given Z in G. Formally, X ?G Y j Z is true when for every undirected path in G between X and Y, there exists a node W in the path such that either (1) W does not have two parents in the path and W 2 Z, or (2) W has two parents in the path and neither W nor its descendants is in Z. If hG; P i is a BN, X ?P Y j Z if X ?G Y j Z. The converse does not necessarily hold. We say that hG; Pi satis?es the faithfulness condition if the d-separations in G identify all and only the conditional independencies in P, i.e., X ?P Y j Z iff X ?G Y j Z. A Markov blanket MT of T is any set of variables such that T is conditionally independent of all the remaining variables given MT . By extension, a Markov blanket of T in V guarantees that MT # V, and that T is conditionally independent of the remaining variables in V, given MT . A Markov boundary, MBT , of T is any Markov blanket such that none of its proper subsets is a Markov blanket of T. We denote by PCG T , the set of parents and children of T in G, and by SPG T , the set of spouses of T in G. The spouses of T are the variables that have common children with T. These sets are unique for all G, such that < G; P > satis?es the faithfulness condition and so we will drop the superscript G. We denote by dSep?X ?, the set that d-separates X from the (implicit) target T. Theorem 1. Suppose < G; P > satis?es the faithfulness condition. Then X and Y are not adjacent in G iff 9Z 2 U n fX ; Y g such that X Y j Z. Moreover, MBX ? PCX [ SPX . A proof can be found for instance in Neapolitan (2004).

1 A ?rst version of HP2C without FDR control has been discussed in a paper that appeared in the Proceedings of ECML-PKDD, pp. 58–73, 2012.

M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772


Two graphs are said equivalent iff they encode the same set of conditional independencies via the d-separation criterion. The equivalence class of a DAG G is a set of DAGs that are equivalent to G. The next result showed by Pearl (1988) establishes that equivalent graphs have the same undirected graph but might disagree on the direction of some of the arcs. Theorem 2. Two DAGs are equivalent iff they have the same underlying undirected graph and the same set of v-structures ?i.e. converging edges into the same node, such as X ! Y Z ?. Moreover, an equivalence class of network structures can be uniquely represented by a partially directed DAG (PDAG), also called a DAG pattern. The DAG pattern is de?ned as the graph that has the same links as the DAGs in the equivalence class and has oriented all and only the edges common to the DAGs in the equivalence class. A structure learning algorithm from data is said to be correct (or sound) if it returns the correct DAG pattern (or a DAG in the correct equivalence class) under the assumptions that the independence tests are reliable and that the learning database is a sample from a distribution P faithful to a DAG G, The (ideal) assumption that the independence tests are reliable means that they decide (in) dependence iff the (in) dependence holds in P. 2.2. Conditional independence properties The following three theorems, borrowed from Pe?a et al. (2007), are proven in Pearl (1988): Theorem 3. Let X; Y; Z and W denote four mutually disjoint subsets of U. Any probability distribution p satis?es the following four properties: Symmetry X Y j Z ) Y X j Z, Decomposition, X ?Y [ W? j Z ) X Y j Z, Weak Union X ?Y [ W? j Z ) X Y j ?Z [ W? and Contraction, X Y j ?Z [ W? ^ X W j Z ) X ?Y [ W? j Z.

Theorem 4. If p is strictly positive, then p satis?es the previous four properties plus the Intersection property X Y j ?Z [ W? ^ X W j ?Z [ Y? ) X ?Y [ W? j Z. Additionally, each X 2 U has a unique Markov boundary, MBX Theorem 5. If p is faithful to a DAG G, then p satis?es the previous ?ve properties plus the Composition property X Y j Z ^ X W j Z ) X ?Y [ W? j Z and the local Markov property X ?NDX n PaX ? j PaX for each X 2 U, where NDX denotes the non-descendants of X in G. 2.3. Structure learning strategies The number of DAGs, G, is super-exponential in the number of random variables in the domain and the problem of learning the most probable a posteriori BN from data is worst-case NP-hard (Chickering, Heckerman, & Meek, 2004). One needs to resort to heuristical methods in order to be able to solve very large problems effectively. Both CB and SS heuristic approaches have advantages and disadvantages. CB approaches are relatively quick, deterministic, and have a well de?ned stopping criterion; however, they rely on an arbitrary signi?cance level to test for independence, and they can be unstable in the sense that an error early on in the search can have a cascading effect that causes many errors to be present in the ?nal graph. SS approaches have the advantage of being able to ?exibly incorporate users’ background knowledge in the form of prior probabilities over the structures and are also capable of dealing with incomplete records in the database (e.g. EM technique).

Although SS methods are favored in practice when dealing with small dimensional data sets, they are slow to converge and the computational complexity often prevents us from ?nding optimal BN structures (Kojima et al., 2010; Perrier et al., 2008). With currently available exact algorithms (Cussens & Bartlett, 2013; ? & Koivisto & Sood, 2004; Silander & Myllym?ki, 2006; Studeny Haws, 2014) and a decomposable score like BDeu, the computational complexity remains exponential, and therefore, such algorithms are intractable for BNs with more than around 30 vertices on current workstations. For larger sets of variables the computational burden becomes prohibitive and restrictions about the structure have to be imposed, such as a limit on the size of the parent sets. With this in mind, the ability to restrict the search locally around the target variable is a key advantage of CB methods over SS methods. They are able to construct a local graph around the target node without having to construct the whole BN ?rst, hence their scalability (Pe?a et al., 2007; Pe?a, 2008; Rodrigues de Morais & Aussem, 2010a, 2010b; Tsamardinos et al., 2006). With a view to balancing the computation cost with the desired accuracy of the estimates, several hybrid methods have been proposed recently. Tsamardinos et al. proposed in Tsamardinos et al. (2006) the Min–Max Hill Climbing (MMHC) algorithm and conducted one of the most extensive empirical comparison performed in recent years showing that MMHC was the fastest and the most accurate method in terms of structural error based on the structural hamming distance. More speci?cally, MMHC outperformed both in terms of time ef?ciency and quality of reconstruction the PC (Spirtes, Glymour, & Scheines, 2000), the Sparse Candidate (Friedman, Nachman, & Peer, 1999b), the Three Phase Dependency Analysis (Cheng, Greiner, Kelly, Bell, & Liu, 2002), the Optimal Reinsertion (Moore & Wong, 2003), the Greedy Equivalence Search (Chickering, 2002), and the greedy hill-climbing search on a variety of networks, sample sizes, and parameter values. Although MMHC is rather heuristic by nature (it returns a local optimum of the score function), it is currently considered as the most powerful state-of-the-art algorithm for BN structure learning capable of dealing with thousands of nodes in reasonable time. In order to enhance its performance on small dimensional data sets, Perrier et al. proposed in Perrier et al. (2008) a hybrid algorithm that can learn an optimal BN (i.e., it converges to the true model in the sample limit) when an undirected graph is given as a structural constraint. They de?ned this undirected graph as a super-structure (i.e., every DAG considered in the SS phase is compelled to be a subgraph of the super-structure). This algorithm can learn optimal BNs containing up to 50 vertices when the average degree of the super-structure is around two, that is, a sparse structural constraint is assumed. To extend its feasibility to BN with a few hundred of vertices and an average degree up to four, Kojima et al. proposed in Kojima et al. (2010) to divide the super-structure into several clusters and perform an optimal search on each of them in order to scale up to larger networks. Despite interesting improvements in terms of score and structural hamming distance on several benchmark BNs, they report running times about 103 times longer than MMHC on average, which is still prohibitive. Therefore, there is great deal of interest in hybrid methods capable of improving the structural accuracy of both CB and SS methods on graphs containing up to thousands of vertices. However, they make the strong assumption that the skeleton (also called super-structure) contains at least the edges of the true network and as small as possible extra edges. While controlling the false discovery rate (i.e., false extra edges) in BN learning has attracted some attention recently (Armen & Tsamardinos, 2011; Pe?a, 2008; Tsamardinos & Brown, 2008), to our knowledge, there is no work on controlling actively the rate of false-negative errors (i.e., false missing edges).


M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772

2.4. Constraint-based structure learning Before introducing the H2PC algorithm, we discuss the general idea behind CB methods. The induction of local or global BN structures is handled by CB methods through the identi?cation of local neighborhoods (i.e., PCX ), hence their scalability to very high dimensional data sets. CB methods systematically check the data for conditional independence relationships in order to infer a target’s neighborhood. Typically, the algorithms run either a G2 or a v2 independence test when the data set is discrete and a Fisher’s Z test when it is continuous in order to decide on dependence or independence, that is, upon the rejection or acceptance of the null hypothesis of conditional independence. Since we are limiting ourselves to discrete data, both the global and the local distributions are assumed to be multinomial, and the latter are represented as conditional probability tables. Conditional independence tests and network scores for discrete data are functions of these conditional probability tables through the observed frequencies fnijk ; i ? 1; . . . ; R; j ? 1; . . . ; C ; k ? 1; . . . ; Lg for the random variables X and Y and all the con?gurations of the levels of the conditioning P variables Z. We use ni?k as shorthand for the marginal j nijk and similarly for ni?k ; n??k and n??? ? n. We use a classic conditional independence test based on the mutual information. The mutual information is an information-theoretic distance measure de?ned as
R X C X L X nijk i?1 j?1 k?1

3.1. Parents and children discovery HPC (Algorithm 1) can be viewed as an ensemble method for combining many weak PC learners in an attempt to produce a stronger PC learner. HPC is based on three subroutines: DataEf?cient Parents and Children Superset (DE-PCS), Data-Ef?cient Spouses Superset (DE-SPS), and Incremental Association Parents and Children with false discovery rate control (FDR-IAPC), a weak PC learner based on FDR-IAMB (Pe?a, 2008) that requires little computation. HPC receives a target node T, a data set D and a set of variables U as input and returns an estimation of PCT . It is hybrid in that it combines the bene?ts of incremental and divide-and-conquer methods. The procedure starts by extracting a superset PCST of PCT (line 1) and a superset SPST of SPT (line 2) with a severe restriction on the maximum conditioning size (Z <? 2) in order to signi?cantly increase the reliability of the tests. A ?rst candidate PC set is then obtained by running the weak PC learner called FDR-IAPC on PCST [ SPST (line 3). The key idea is the decentralized search at lines 4–8 that includes, in the candidate PC set, all variables in the superset PCST n PCT that have T in their vicinity. So, HPC may be thought of as a way to compensate for the large number of false negatives at the output of the weak PC learner, by performing extra computations. Note that, in theory, X is in the output of FDR-IAPC (Y) if and only if Y is in the output of FDR-IAPC (X). However, in practice, this may not always be true, particularly when working in high-dimensional domains (Pe?a, 2008). By loosening the criteria by which two nodes are said adjacent, the effective restrictions on the size of the neighborhood are now far less severe. The decentralized search has a signi?cant impact on the accuracy of HPC as we shall see in the experiments. We proved in Rodrigues de Morais and Aussem (2010a) that the original HPC(T) is consistent, i.e. its output converges in probability to PCT , if the hypothesis tests are consistent. The proof also applies to the modi?ed version presented here. We now discuss the subroutines in more detail. FDR-IAPC (Algorithm 2) is a fast incremental method that receives a data set D and a target node T as its input and promptly returns a rough estimation of PCT , hence the term ‘‘weak’’ PC learner. In this study, we use FDR-IAPC because it aims at controlling the expected proportion of false discoveries (i.e., false positive nodes in PCT ) among all the discoveries made. FDR-IAPC is a straightforward extension of the algorithm IAMBFDR developed by Pe?a in Pe?a (2008), which is itself a modi?cation of the incremental association Markov boundary algorithm (IAMB) (Tsamardinos et al., 2003), to control the expected proportion of false discoveries (i.e., false positive nodes) in the estimated Markov boundary. FDR-IAPC simply removes, at lines 3–6, the spouses SPT from the estimated Markov boundary MBT output by IAMBFDR at line 1, and returns PCT assuming the faithfulness condition. The subroutines DE-PCS (Algorithm 3) and DE-SPS (Algorithm 4) search a superset of PCT and SPT respectively with a severe restriction on the maximum conditioning size (jZj <? 1 in DEPCS and jZj <? 2 in DE-SPS) in order to signi?cantly increase the reliability of the tests. The variable ?ltering has two advantages: (i) it allows HPC to scale to hundreds of thousands of variables by restricting the search to a subset of relevant variables, and (ii) it eliminates many true or approximate deterministic relationships that produce many false negative errors in the output of the algorithm, as explained in Rodrigues de Morais and Aussem (2010a, 2010b). DE-SPS works in two steps. First, a growing phase (lines 4–8) adds the variables that are dseparated from the target but still remain associated with the target when conditioned on another variable from PCST . The

M I?X ; Y j Z? ?



nijk n??k ni?k n?jk

It is proportional to the log-likelihood ratio test G2 (they differ by a 2n factor, where n is the sample size). The asymptotic null distribution is v2 with ?R ? 1??C ? 1?L degrees of freedom. For a detailed analysis of their properties we refer the reader to Agresti (2002). The main limitation of this test is the rate of convergence to its limiting distribution, which is particularly problematic when dealing with small samples and sparse contingency tables. The decision of accepting or rejecting the null hypothesis depends implicitly upon the degree of freedom which increases exponentially with the number of variables in the conditional set. Several heuristic solutions have emerged in the literature (Rodrigues de Morais & Aussem, 2010a; Spirtes et al., 2000; Tsamardinos & Borboudakis, 2010; Tsamardinos et al., 2006) to overcome some shortcomings of the asymptotic tests. In this study we use the two following heuristics that are used in MMHC. First, we do not perform MI?X ; Y j Z? and assume independence if there are not enough samples to achieve large enough power. We require that the average sample per count is above a user de?ned parameter, equal to 5, as in Tsamardinos et al. (2006). This heuristic is called the power rule. Second, we consider as structural zero either case n?jk or ni?k ? 0. For example, if n?jk ? 0, we consider y as a structurally forbidden value for Y when Z = z and we reduce R by 1 (as if we had one column less in the contingency table where Z = z). This is known as the degrees of freedom adjustment heuristic.

3. The H2PC algorithm In this section, we present our hybrid algorithm for Bayesian network structure learning, called Hybrid HPC (H2PC). It ?rst reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to ?lter and orient the edges. It is based on a CB subroutine called HPC to learn the parents and children of a single variable. So, we shall discuss HPC ?rst and then move to H2PC.

M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772


shrinking phase (lines 9–16) discards irrelevant variables that are ancestors or descendants of a target’s spouse. Pruning such irrelevant variables speeds up HPC. Algorithm 1. HPC Require: T: target; D: data set; U: set of variables Ensure PCT : Parents and Children of T 1: 2: 3: 4: 5: 6: 7: 8: DE ? PCS?T ; D; U? ?PCST ; dSep? DE-SPS?T ; D; U; PCST ; dSep? SPST PCT FDR-IAPC ?T ; D; ?T [ PCST [ SPST ?? for all X 2 PCST n PCT do if T 2 FDR-IAPC ?X ; D; ?T [ PCST [ SPST ?? then PCT PCT [ X then end if end for

Algorithm 4. DE-SPS Require: T: target; D: data set; U: the set of variables; PCST : parents and children superset of T; dSep: d-separating sets; Ensure: SPST : Superset of the spouses of T; ; 1: SPST 2: for all X 2 PCST do 3: 4: 5: 6: 7: 8: 9: 10: 11: ; SPSX T for all Y 2 U n fT [ PCST g do if ?T Y jdSep?Y ? [ X ? then SPSX T end if end for SPSX T [Y

for all Y 2 SPSX T do for all Z 2 SPSX T n Y do if ?T ? Y j X [ Z ? then

Algorithm 2. FDR-IAPC Require: T: target; D: data set; U: set of variables Ensure: PCT : Parents and children of T; ? Learn the Markov boundary of T 1: MBT IAMBFDR?X ; D; U? ? Remove spouses of T from MBT 2: PCT MBT 3: for all X 2 MBT do 4: if 9Z # ?MBT n X ? such that T ? X j Z then 5: PCT PCT n X 6: end if 7: end for

12: SPSX SPSX T T nY 13: break loop FOR 14: end if 15: end for 16: end for 17: SPST SPST [ SPSX T 18: end for

Algorithm 5. Hybrid HPC Require: D: data set; U: set of variables Ensure: A DAG G on the variables U 1: for all pair of nodes X ; Y 2 U do 2: Add X in PCY and Add Y in PCX if X 2 HPC ?Y ? and Y 2 HPC ?X ? 3: end for 4: Starting from an empty graph, perform greedy hillclimbing with operators add-edge, delete-edge, reverse-edge. Only try operator add-edge X ! Y if Y 2 PCX

Algorithm 3. DE-PCS Require: T: target; D: data set; U: set of variables; Ensure: PCST : parents and children superset of T; dSep: dseparating sets; 1: 2: 3: 4: 5: 6: 7: Phase I: Remove X if T ? X PCST UnT for all X 2 PCST do if ?T ? X ? then PCST PCST n X dSep?X ? ; end if end for

3.2. Hybrid HPC (H2PC) In this section, we discuss the SS phase. The following discussion draws strongly on Tsamardinos et al. (2006) as the SS phase in Hybrid HPC and MMHC are exactly the same. The idea of constraining the search to improve time-ef?ciency ?rst appeared in the Sparse Candidate algorithm (Friedman et al., 1999b). It results in ef?ciency improvements over the (unconstrained) greedy search. All recent hybrid algorithms build on this idea, but employ a sound algorithm for identifying the candidate parent sets. The Hybrid HPC ?rst identi?es the parents and children set of each variable, then performs a greedy hill-climbing search in the space of BN. The search begins with an empty graph. The edge addition, deletion, or direction reversal that leads to the largest increase in score (the BDeu score with uniform prior was used) is taken and the search continues in a similar fashion recursively. The important difference from standard greedy search is that the search is constrained to only consider adding an edge if it was discovered by HPC in the ?rst phase. We extend the greedy search with a TABU list (Friedman et al., 1999b). The list keeps the last 100

Phase II: Remove X if T ? X j Y 8: for all X 2 PCST do 9: for all Y 2 PCST n X do 10: if ?T ? X j Y ? then 11: PCST PCST n X 12: dSep?X ? Y 13: break loop FOR 14: end if 15: end for 16: end for


M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772

Table 1 Description of the BN benchmarks used in the experiments. Network # var. # edges Max. degree in/out Child Insurance Mildew Alarm Hail?nder Munin1 Pigs Link 20 27 35 37 56 186 441 724 25 52 46 46 66 273 592 1125 2/7 3/7 3/3 4/5 4/16 3/15 2/39 3/14 Domain Range 2–6 2–5 3–100 2–4 2–11 2–21 3–3 2–4 min/med/max PC set size 1/2/8 1/3/9 1/2/5 1/2/6 1/1.5/17 1/3/15 1/2/41 0/2/17

the end of the SS phase, we report the ?ve performance indicators (Scutari & Brogini, 2012) described below:  the posterior density of the network for the data it was learned from, as a measure of goodness of ?t. It is known as the Bayesian Dirichlet equivalent score (BDeu) from Heckerman, Geiger, and Chickering (1995) and Buntine (1991) and has a single parameter, the equivalent sample size, which can be thought of as the size of an imaginary sample supporting the prior distribution. The equivalent sample size was set to 10 as suggested in Koller and Friedman (2009);  the BIC score (Schwarz, 1978) of the network for the data it was learned from, again as a measure of goodness of ?t;  the posterior density of the network for a new data set, as a measure of how well the network generalizes to new data;  the BIC score of the network for a new data set, again as a measure of how well the network generalizes to new data;  the structural hamming distance (SHD) between the learned and the true structure of the network, as a measure of the quality of the learned dependence structure. The SHD between two PDAGs is de?ned as the number of the following operators required to make the PDAGs match: add or delete an undirected edge, and add, remove, or reverse the orientation of an edge. For each data set sampled from the true probability distribution of the benchmark, we ?rst learn a network structure with the H2PC and MMHC and then we compute the relevant performance indicators for each pair of network structures. The data set used to assess how well the network generalizes to new data is generated again from the true probability structure of the benchmark networks and contains 50,000 observations. Notice that using the BDeu score as a metric of reconstruction quality has the following two problems. First, the score corresponds to the a posteriori probability of a network only under certain conditions (e.g., a Dirichlet distribution of the hyper parameters); it is unknown to what degree these assumptions hold in distributions encountered in practice. Second, the score is highly sensitive to the equivalent sample size (set to 10 in our experiments) and depends on the network priors used. Since, typically, the same arbitrary value of this parameter is used both during learning and for scoring the learned network, the metric favors algorithms that use the BDeu score for learning. In fact, the BDeu score does not rely on the structure of the original, gold standard network at all; instead it employs several assumptions to score the networks. For those reasons, in addition to the score we also report the BIC score and the SHD metric. 4.2. Results In Fig. 1, we report the quality of the skeleton obtained with HPC over that obtained with MMPC (before the SS phase) as a function of the sample size. Results for each benchmark are not shown here in detail due to space restrictions. For sake of conciseness, the performance values are averaged over the 8 benchmarks depicted in Table 1. The increase factor for a given performance indicator is expressed as the ratio of the performance value obtained with HPC over that obtained with MMPC (the gold standard). Note that for some indicators, an increase is actually not an improvement but is worse (e.g., false positive rate, Euclidean distance). For clarity, we mention explicitly on the subplots whether an increase factor > 1 should be interpreted as an improvement or not. Regarding the quality of the superstructure, the advantages of HPC against MMPC are noticeable. As observed, HPC consistently increases the recall and reduces the rate of false negative edges. As expected this bene?t comes at a little expense in terms of false positive edges. HPC also improves the Euclidean distance from perfect

structures explored. Instead of applying the best local change, the best local change that results in a structure not on the list is performed in an attempt to escape local maxima. When 15 changes occur without an increase in the maximum score ever encountered during search, the algorithm terminates. The overall best scoring structure is then returned. Clearly, the more false positives the heuristic allows to enter candidate PC set, the more computational burden is imposed in the SS phase. 4. Experimental validation on synthetic data Before we proceed to the experiments on real-world multi-label data with H2PC, we ?rst conduct an experimental comparison of H2PC against MMHC on synthetic data sets sampled from eight well-known benchmarks with various data sizes in order to gauge the practical relevance of the H2PC. These BNs that have been previously used as benchmarks for BN learning algorithms (see Table 1 for details). Results are reported in terms of various performance indicators to investigate how well the network generalizes to new data and how well the learned dependence structure matches the true structure of the benchmark network. We implemented H2PC in R (R Core Team, 2013) and integrated the code into the bnlearn package from Scutari (2010). MMHC was implemented by Marco Scutari in bnlearn. The source code of H2PC as well as all data sets used for the empirical tests are publicly available.2 The threshold considered for the type I error of the test is 0:05. Our experiments were carried out on PC with Intel (R) Core (TM) i5–3470 M CPU @3.20 GHz 4Go RAM running under Linux 64 bits. We do not claim that those data sets resemble real-world problems, however, they make it possible to compare the outputs of the algorithms with the known structure. All BN benchmarks (structure and probability tables) were downloaded from the bnlearn repository.3 Ten sample sizes have been considered: 50, 100, 200, 500, 1000, 2000, 5000, 10,000, 20,000 and 50,000. All experiments are repeated 10 times for each sample size and each BN. We investigate the behavior of both algorithms using the same parametric tests as a reference. 4.1. Performance indicators We ?rst investigate the quality of the skeleton returned by H2PC during the CB phase. To this end, we measure the false positive edge ratio, the precision (i.e., the number of true positive edges in the output divided by the number of edges in the output), the recall (i.e., the number of true positive edges divided the true number of edges) and a combination of precision and recall de?ned q?????????????????????????????????????????????????????????????????? 2 2 as ?1 ? precision? ? ?1 ? recall? , to measure the Euclidean distance from perfect precision and recall, as proposed in Pe?a et al. (2007). Second, to assess the quality of the ?nal DAG output at
2 3

https://github.com/madbix/bnlearn-clone-3.4 http://www.bnlearn.com/bnrepository

M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772


False negative rate (lower is better)

False positive rate (lower is better)
0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 10000 20000 30000

mean rate

0.6 0.4 0.2 0.0 0 10000 20000 30000 40000 50000

mean rate


sample size

sample size

Recall (higher is better)
increase factor increase factor
8 6 4 2 0 50 100 200 500 1000 2000 5000 10000 20000 50000 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

Precision (higher is better)

sample size

Euclidian distance (lower is better)
increase factor increase factor
1.2 1.0 0.8 0.6 0.4 0.2 0.0 50 100 200 500 1000 2000 5000 10000 20000 50000 25 20 15 10 5 0

Number of statistical tests

sample size

Fig. 1. Quality of the skeleton obtained with HPC over that obtained with MMPC (after the CB phase). The 2 ?rst ?gures present mean values aggregated over the 8 benchmarks. The 4 last ?gures present increase factors of HPC/ MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line). (For interpretation of the references to color in this ?gure caption, the reader is referred to the web version of this article.)

precision and recall on all benchmarks, while increasing the number of independence tests and thus the running time in the CB phase (see number of statistical tests). It is worth noting that HPC is capable of reducing by 50% the Euclidean distance with 50,000 samples (lower left plot). These results are very much in line with other experiments presented in Rodrigues de Morais and Aussem (2010a) and Villanueva and Maciel (2012). In Fig. 2, we report the quality of the ?nal DAG obtained with H2PC over that obtained with MMHC (after the SS phase) as a function of the sample size. Regarding BDeu and BIC on both training and test data, the improvements are noteworthy. The results in terms of goodness of ?t to training data and new data using H2PC clearly dominate those obtained using MMHC, whatever the sample size considered, hence its ability to generalize better. Regarding the quality of the network structure itself (i.e., how close is the DAG to the true dependence structure of the data), this is pretty much a dead heat between the 2 algorithms on small

sample sizes (i.e., 50 and 100), however we found H2PC to perform signi?cantly better on larger sample sizes. The SHD increase factor decays rapidly (lower is better) as the sample size increases (lower left plot). For 50,000 samples, the SHD is on average only 50% that of MMHC. Regarding the computational burden involved, we may observe from Table 2 that H2PC has a little computational overhead compared to MMHC. The running time increase factor grows somewhat linearly with the sample size. With 50,000 samples, H2PC is approximately 10 times slower on average than MMHC. This is mainly due to the computational expense incurred in obtaining larger PC sets with HPC, compared to MMPC. 5. Application to multi-label learning In this section, we address the problem of multi-label learning with H2PC. MLC is a challenging problem in many real-world application domains, where each instance can be assigned

50 100 200 500 1000 2000 5000 10000 20000 50000

50 100 200 500 1000 2000 5000 10000 20000 50000

sample size

sample size



M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772

log(BDeu post.) on train data (lower is better)
increase factor increase factor
1.0 0.9 0.8 0.7 0.6 0.5 50 100 200 500 1000 2000 5000 10000 20000 50000 1.0 0.9 0.8 0.7 0.6 0.5

BIC on train data (lower is better)

sample size

log(BDeu post.) on test data (lower is better)
increase factor increase factor
1.0 0.9 0.8 0.7 0.6 0.5 50 100 200 500 1000 2000 5000 10000 20000 50000 1.0 0.9 0.8 0.7 0.6 0.5

sample size

Structural Hamming Distance (lower is better)
1.2 8

increase factor

1.0 0.8 0.6 0.4 0.2 0.0 50 100 200 500 1000 2000 5000 10000 20000 50000

increase factor

6 4 2

sample size

Fig. 2. Quality of the ?nal DAG obtained with H2PC over that obtained with MMHC (after the SS phase). The 6 ?gures present increase factors of HPC/MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line). (For interpretation of the references to color in this ?gure caption, the reader is referred to the web version of this article.)

simultaneously to multiple binary labels (Dembczyski et al., 2012; Kocev et al., 2007; Madjarov et al., 2012; Read et al., 2009; Tsoumakas et al., 2010b). Formally, learning from multi-label examples amounts to ?nding a mapping from a space of features to a space of labels. We shall assume throughout that X (a random vector in Rd ) is the feature set, Y (a random vector in f0; 1gn ) is the label set, U ? X [ Y and p a probability distribution de?ned over U. Given a multi-label training set D, the goal of multi-label learning is to ?nd a function which is able to map any unseen example to its proper set of labels. From the Bayesian point of view, this problem amounts to modeling the conditional joint distribution p?Y j X?.

5.1. Related work This MLC problem may be tackled in various ways (AlvaresCherman, Metz, & Monard, 2011; Blockeel, Raedt, & Ramon,

1998; Kocev et al., 2007; Luaces, Díez, Barranquero, del Coz, & Bahamonde, 2012; Read et al., 2009). Each of these approaches is supposed to capture – to some extent – the relationships between labels. The two most straightforward meta-learning methods (Madjarov et al., 2012) are: Binary Relevance (BR) (Luaces et al., 2012) and Label Powerset (LP) (Tsoumakas et al., 2010b; Tsoumakas & Vlahavas, 2007). Both methods can be regarded as opposite in the sense that BR does consider each label independently, while LP considers the whole label set at once (one multi-class problem). An important question remains: what shall we capture from the statistical relationships between labels exactly to solve the multi-label classi?cation problem? The problem attracted a great deal of interest (Dembczyski et al., 2012; Zhang & Zhang, 2010). It is well beyond the scope and purpose of this paper to delve deeper into these approaches, we point the reader to Dembczyski et al. (2012) and Tsoumakas et al. (2010b) for a review. The second fundamental problem that we wish to

50 100 200 500 1000 2000 5000 10000 20000 50000

50 100 200 500 1000 2000 5000 10000 20000 50000

50 100 200 500 1000 2000 5000 10000 20000 50000

sample size

BIC on test data (lower is better)

sample size

Number of scores

sample size

M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772


Table 2 Total running time ratio R (H2PC/MMHC). White cells indicate a ratio R < 1 (in favor of H2PC), while shaded cells indicate a ratio R > 1 (in favor of MMHC). The darker, the larger the ratio.

Table 3 Data sets characteristics. Data set Emotions Yyeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k Domain Music Biology Images Images Text Biology Text Text Text Images jDj 593 2417 2000 2407 3782 662 978 1702 7395 5000 dim?D? 72 103 135 294 1079 1186 1449 1001 1836 499 F ?D? cont. cont. cont. cont. disc. disc. disc. disc. disc. disc. L?D? 6 14 5 6 22 27 45 53 159 374 DL?D? 27 198 20 15 156 32 94 753 2856 3175

where fYLP1 ; . . . ; YLPL g is a partition of label powersets and MLPj is a Markov blanket of YLPj in X. From the above de?nition, we have YLPi YLPj j X; 8i – j. In the framework of MLC, one can consider a multitude of loss functions. In this study, we focus on a non label-wise decomposable loss function called the subset 0=1 loss which generalizes the well-known 0=1 loss from the conventional to the multi-label setting. The risk-minimizing prediction for subset 0=1 loss is simply given by the mode of the distribution (Dembczyski et al., 2012). Therefore, we seek a factorization into a product of minimal factors in order to facilitate the estimation of the mode of the conditional distribution (also called the most probable explanation (MPE)):

maxp?yjx? ?
address involves ?nding an optimal feature subset selection of a label set, w.r.t an Information Theory criterion (Koller & Sahami, 1996). As in the single-label case, multi-label feature selection has been studied recently and has encountered some success (Gu, Li, & Han, 2011; Spola?r, Cherman, Monard, & Lee, 2013). 5.2. Label powerset decomposition We shall ?rst introduce the concept of label powerset that will play a pivotal role in the factorization of the conditional distribution p?Y j X?. De?nition 1. YLP # Y is called a label powerset iff YLP Y n YLP j X. Additionally, YLP is said minimal if it is non empty and has no label powerset as proper subset. Let LP denote the set of all powersets de?ned over U, and minLP the set of all minimal label powersets. It is easily shown fY; ;g # LP. The key idea behind label powersets is the decomposition of the conditional distribution of the labels into a product of factors

L Y j?1

maxp?yLPj j x? ?
yLP j

L Y j?1

  maxp yLPj j mLPj
yLP j

The next section aims to obtain theoretical results for the characterization of the minimal label powersets YLPj and their Markov boundaries MLPj from a DAG, in order to be able to estimate the MPE more effectively. 5.3. Label powerset characterization In this section, we show that minimal label powersets can be depicted graphically when p is faithful to a DAG, Theorem 6. Suppose p is faithful to a DAG G. Then, Y i and Y j belong to the same minimal label powerset if and only if there exists an undirected path in G between nodes Y i and Y j in Y such that all intermediate nodes Z are either (i) Z 2 Y, or (ii) Z 2 X and Z has two parents in Y (i.e. a collider of the form Y p ! X Y q ). The proof is given in the Appendix. We shall now address the following questions: What is the Markov boundary in X of a minimal label powerset? Answering this question for general distributions is not trivial. We establish a useful relation between a label powerset and its Markov boundary in X when p is faithful to a DAG,

p?Y j X? ?

L L Y Y p?YLPj j X? ? p?YLPj j MLPj ? j? 1 j? 1


M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772

Theorem 7. Suppose p is faithful to a DAG G. Let YLP ? fY 1 ; Y 2 ; . . . ; Y n g be a label powerset. Then, its Markov boundary M in U is also its Markov boundary in X, and is given in G by S M? n j?1 fPCY j [ SPY j g n Y. The proof is given in the Appendix. 5.4. Experimental setup The MLC problem is decomposed into a series of multi-class classi?cation problems, where each multi-class variable encodes one label powerset, with as many classes as the number of possible combinations of labels, or those present in the training data. At this point, it should be noted that the LP method is a special case of this framework since the whole label set is a particular label powerset (not necessarily minimal though). The above procedure can been summarized as follows: 1. Learn the BN local graph G around the label set;

2. Read off G the minimal label powersets and their respective Markov boundaries; 3. Train an independent multi-class classi?er on each minimal LP, with the input space restricted to its Markov boundary in G; 4. Aggregate the prediction of each classi?er to output the most probable explanation, i.e. arg maxy p?y j x?. To assess separately the quality of the minimal label powerset decomposition and the feature subset selection with Markov boundaries, we investigate 4 scenarios:  BR without feature selection (denoted BR): a classi?er is trained on each single label with all features as input. This is the simplest approach as it does not exploit any label dependency. It serves as a baseline learner for comparison purposes.  BR with feature selection (denoted BR + MB): a classi?er is trained on each single label with the input space restricted to its Markov boundary in G. Compared to the previous strategy, we evaluate here the effectiveness of the feature selection task.

Fig. 3. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Emotions and Yeast.

M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772 Table 4 Distribution of the number and the size of the minimal label powersets output by H2PC (top) and MMHC (bottom). On each data set: mean number of powersets, minimum/median/maximum number of labels per powerset, and minimum/median/ maximum number of distinct classes per powerset. The total number of labels and distinct labels combinations is recalled for convenience. Dataset # LPs # labels/LP min/med/max H2PC Emotions Yeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k MMHC Emotions Yeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k 1:0 ? 0:0 2:0 ? 0:0 1:0 ? 0:0 1:0 ? 0:0 9:5 ? 0:8 26:9 ? 0:3 38:6 ? 1:8 24:5 ? 1:9 39:6 ? 4:3 97:7 ? 3:9 1:2 ? 0:4 2:0 ? 0:0 1:0 ? 0:0 1:0 ? 0:0 15:1 ? 0:6 27:0 ? 0:0 41:7 ? 1:4 36:8 ? 2:7 77:2 ? 3:1 165:3 ? 5:5 6/6/6 2/7/12 5/5/5 6/6/6 1/1/14 1/1/2 1/1/6 1/1/30 1/1/112 1/1/263 1/6/6 1/7/13 5/5/5 6/6/6 1/1/9 1/1/1 1/1/4 1/1/12 1/1/28 1/1/177 (6) (14) (5) (6) (22) (27) (45) (53) (159) (374) (6) (14) (5) (6) (22) (27) (45) (53) (159) (374) L?D? # classes/LP min/med/max 26/27/27 3/64/135 19/20/20 14/15/15 1/2/111 1/2/2 1/2/10 1/2/334 2/2/1588 1/2/2749 2/26/27 2/89/186 19/20/20 14/15/15 1/2/46 1/2/2 1/2/7 1/2/119 2/2/233 1/2/2294 (27) (198) (20) (15) (156) (32) (94) (753) (2856) (3175) (27) (198) (20) (15) (156) (32) (94) (753) (2856) (3175) DL?D? Table 6 DAG learning time (in seconds) and average degree of the label nodes. Dataset MMHC Time Emotions Yeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k 1.5 9.0 3.1 9.9 44.8 12.5 43.8 199.8 960.8 169.6 Label degree 2:6 ? 1:1 3:0 ? 1:6 4:0 ? 0:8 3:1 ? 1:0 2:7 ? 1:4 1:0 ? 0:3 1:7 ? 1:1 1:2 ? 1:5 2:6 ? 1:3 1:5 ? 1:4 H2PC Time 16.2 90.9 28.4 662.9 822.3 12.4 334.8 47863.0 159469.6 2513.0


Label degree 4:7 ? 1:8 5:0 ? 2:9 4:6 ? 1:0 6:6 ? 1:8 4:0 ? 5:2 1:1 ? 0:3 2:9 ? 2:4 4:3 ? 5:2 6:4 ? 10:3 2:9 ? 2:8

Table 5 Global classi?cation accuracies using 4 learning methods (BR, MLP, BR + MB, MLP + MB) based on the DAG obtained with H2PC and MMHC. Best values between H2PC and MMHC are boldfaced. Dataset BR MLP MMHC Emotions Yeast Image Scene Slashdot Genbase Medical Enron Bibtex Corel5k 0.327 0.169 0.342 0.580 0.355 0.069 0.454 0.134 0.108 0.002 0.371 0.273 0.504 0.733 0.419 0.069 0.499 0.165 0.115 0.030 H2PC 0.391 0.271 0.504 0.733 0.474 0.069 0.502 0.180 0.167 0.043 BR + MB MMHC 0.147 0.062 0.227 0.123 0.274 0.974 0.630 0.024 0.120 0.000 H2PC 0.266 0.155 0.252 0.361 0.373 0.976 0.651 0.079 0.133 0.002 MLP + MB MMHC 0.189 0.073 0.308 0.199 0.278 0.974 0.636 0.079 0.121 0.010 H2PC 0.285 0.233 0.360 0.565 0.439 0.976 0.669 0.157 0.177 0.034

5.4.1. Data and performance indicators A total of 10 multi-label data sets are collected for experiments in this section, whose characteristics are summarized in Table 3. These data sets come from different problem domains including text, biology, and music. They can be found on the Mulan5 repository, except for image which comes from Zhou6 (Maron & Ratan, 1998). Let D be the multi-label data set, we use jDj; dim?D?; L?D?; F ?D? to represent the number of examples, number of features, number of possible labels, and feature type respectively. DL?D? ? jfY j9x : ?x; Y ? 2 Dgj counts the number of distinct label combinations appearing in the data set. Continues values are binarized during the BN structure learning phase. The performance of a multi-label classi?er can be assessed by several evaluation measures (Tsoumakas, Katakis, & Vlahavas, 2010a). We focus on maximizing a non-decomposable score function: the global accuracy (also termed subset accuracy, complement of the 0=1-loss), which measures the correct classi?cation rate of the whole label set (exact match of all labels required). Note that the global accuracy implicitly takes into account the label correlations. It is therefore a very strict evaluation measure as it requires an exact match of the predicted and the true set of labels. It was recently proved in Dembczyski et al. (2012) that BR is optimal for decomposable loss functions (e.g., the hamming loss), while non-decomposable loss functions (e.g. subset loss) inherently require the knowledge of the label conditional distribution. 10-fold cross-validation was performed for the evaluation of the MLC methods. 5.4.2. Results Table 4 reports the outputs of H2PC and MMHC, Table 5 shows the global accuracy of each method on the 10 data sets. Table 6 reports the running time and the average node degree of the labels in the DAGs obtained with both methods. Figs. 3–7 display graphically the local DAG structures around the labels, obtained with H2PC and MMHC, for illustration purposes. Several conclusions may be drawn from these experiments. First, we may observe by inspection of the average degree of the label nodes in Table 6 that several DAGs are densely connected, like scene or bibtex, while others are rather sparse, like genbase, medical, corel5k. The DAGs displayed in Figs. 3–7 lend themselves to interpretation. They can be used for encoding as well as portraying the conditional independencies, and the d-sep criterion can be used to read them off the graph. Many label powersets that are reported in Table 4 can be identi?ed by graphical inspection of the DAGs. For instance, the two label powersets in yeast (Fig. 3, bottom plot) are clearly noticeable in both DAGs. Clearly, BNs have a number of advantages over alternative methods. They lay bare

 Minimum label powerset method without feature selection (denoted MLP): the minimal label powersets are obtained from the DAG. All features are used as inputs.  Minimum label powerset method with feature selection (denoted MLP + MB): the minimal label powersets are obtained from the DAG. the input space is restricted to the Markov boundary of the labels in that powerset. We use the same base learner in each meta-learning method: the well-known Random Forest classi?er (Breiman, 2001). RF achieves good performance in standard classi?cation as well as in multi-label problems (Kocev et al., 2007; Madjarov et al., 2012), and is able to handle both continuous and discrete data easily, which is much appreciated. The standard RF implementation in R (Liaw & Wiener, 2002)4 was used. For practical purposes, we restricted the forest size of RF to 100 trees, and left the other parameters to their default values.




http://mulan.sourceforge.net/datasets.html http://lamda.nju.edu.cn/data_MIMLimage.ashx


M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772

X24 X12 X20 X24 X129 X21 X77 sunset mountains X12 trees sea X23 X38 X118


X40 X37 X46 X21 X20 sunset X78 X75 mountains X61 trees desert X77 sea X67 X76 X79 X73




X7 X8 X89 X16





X6 X18

X3 X11

Att92 Att91 Att36

Att86 Att183


Urban Sunset
Att64 Att23 Att15 Att71 Att29



Att114 Att122

Beach Att44 Mountain FallFoliage

Att116 Att165 Att22 Att108





Att263 Att270

Field FallFoliage Beach Att238 Att245

Att67 Att172 Att264 Urban







Att280 Att294





Att166 Att75

Att69 Att61

Att26 Att27



Att46 Att38

Att25 Att33 Att19 Att28 Att20

Att34 Att3 Att76


Fig. 4. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Image and Scene.

useful information about the label dependencies which is crucial if one is interested in gaining an understanding of the underlying domain. It is however well beyond the scope of this paper to delve deeper into the DAG interpretation. Overall, it appears that the structures recovered by H2PC are signi?cantly denser, compared to MMHC. On bibtex and enron, the increase of the average label degree is the most spectacular. On bibtex (resp. enron), the average label degree has raised from 2.6 to 6.4 (resp. from 1.2 to 3.3). This result is in nice agreement with the experiments in the previous section, as H2PC was shown to consistently reduce the rate of false negative edges with respect to MMHC (at the cost of a slightly higher false discovery rate). Second, Table 4 is very instructive as it reveals that on emotions, image, and scene, the MLP approach boils down to the LP scheme. This is easily seen as there is only one minimal label powerset extracted on average. In contrast, on genbase the MLP approach boils down to the simple BR scheme. This is con?rmed by inspecting Table 5: the performances of MMHC and H2PC (without feature selection) are equal. An interesting observation upon looking at the distribution of the label powerset size shows that for the remaining data sets, the MLP mostly decomposes the label set in two parts:

one on which it performs BR and the other one on which it performs LP. Take for instance enron, it can be seen from Table 4 that there are approximately 23 label singletons and a powerset of 30 labels with H2PC for a total of 53 labels. The gain in performance with MLP over BR, our baseline learner, can be ascribed to the quality of the label powerset decomposition as BR is ignoring label dependencies. As expected, the results using MLP clearly dominate those obtained using BR on all data sets except genbase. The DAGs display the minimum label powersets and their relevant features. Third, H2PC compares favorably to MMHC. On scene for instance, the accuracy of MLP + MB has raised from 20% (with MMHC) to 56% (with H2PC). On yeast, it has raised from 7% to 23%. Without feature selection, the difference in global accuracy is less pronounced but still in favor of H2PC. A Wilcoxon signed rank paired test reveals statistically signi?cant improvements of H2PC over MMHC in the MLP approach without feature selection (p < 0:02). This trend is more pronounced when the feature selection is used (p < 0:001), using MLP-MB. Note however that the LP decomposition for H2PC and MMHC on yeast and image are strictly identical, hence the same accuracy values in Table 5 in the MLP column.

M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772
netbook olpc laptops
consumer tv switch



devices laptop netbooks phone notebook
suggests conducted risk people original india impact lunar lander japan matt taking size engineers finds shows center space ice professor black survey hubble format



mobile Mobile broadband answers Interviews founder discuss iphone chrome apple os mac google Apple Search ubuntu standard code information macbook lines update open foundation source X8 android Linux linux search app kernel windows presidential election questions political


nobel robots official studios

launch image billion satellite

voting Politics transition obama
created create replace roland working power energy

study month red cells website mission artificial earth

astronomers nasa moon brain surface humans discovery

telescope Entertainment org mars physics spacecraft dk movie universe evidence discovered found water quantum holy bbc lucas123 coondoggie feel scientist pay suit companies firm computer financial thinking small tricks X50 good local work boss time wife data interesting reports imaginary led man isn santa news don guy recent trailer officials AskSlashdot dog disagree trial mirror begins washington wars upcoming king sends expansion running death nintendo blizzard e3 fallout interview snydeq series exception developers players IT show newyorkcountrylawyer infoworld downloadable javascript game games gaming points gamer worlds live video sales memory hp independent cpu chips industry pc amd BookReviews virtual revealed piracy hardware creator learning modern action warhammer management electronic spore bit jobs battle read war mmos wrote year online written content xbox sending ea X360 book X3d author chance guide army tips kotaku eldavojohn offers ways videos answers founder asked questions writes blog bring play massively job notes gameplay november Interviews employees user X3 evolution talks reilly set world feature anonymous activision decided pointed spoke readers details social part patent word tour article piece conference discussing jagslive committee anti.globalism excerpt makers Meta announced couple woman women server email slashdot plans release mail september announces today intellectual spam gmail released pickens pretty issued claim long monday men challenge million education student company experience vista firmware makes X1 BSD version releases security X4 ordered attack close computers stupid problem cheap


service planet

magnetic university institute solar


announcement mike unveiled screen built battery

scientific users research dna

researchers act high foundation operating pictures demonstrated silicon ad rocket Apache tech Technology python hat dark X6 mark X8 software download linux networks speech ubuntu active free kernel Linux single law open.source attempt gave YourRightsOnline drm submitted dmca names sony open illegal music transition eu warner isps zdnet president mentioned passed money digital cisco Hardware run mediasentry riaa director source legal player streaming case order desktop ruphus13 entertainment future language international flash rights team block windows java inside human prototype system technology sun federal copyright developed Science back scientists cancer kentuckyfc contest



received commission microsoft police property p2p Idle technica programming



gamasutra edge story recently left

magazine coming console mmo warcraft



project development perl

Developers popular developer wii News littlebigplanet dead Games final books

pro author michael BookReviews books stoolpigeon book barence



district appeal filed court canadian



gpu nvidia eee intel demo





arcticstoat paul francisco

european agreement ruling senate

bill code

Hardware pc


plan pdf australian government

amd industry Meta ii legal


house obama



chinese america

nvidia intel supercomputer censorship bittorrent access china driver gpu narramissic core demo performance things designed graphics ii pcs eee market michael barence ben ceo





tv Developers developer game video games Games programming developers

television Entertainment format

movie official


states united europe mccain election gov

top state broadband

X500 steve stoolpigeon nokia

current Main

telecom presidential

voting debate political campaign vote palin

lenovo web

left IT dead readers notes points sends newyorkcountrylawyer Technology riaa research AskSlashdot writes Idle Science scientists YourRightsOnline ruphus13 developed News BSD X4

john reporting policy night change

move machines issues privacy problems yahoo mobile party register google apparently account private Search applications sites X0 site

browser services

Apple apps chrome notebook search Mobile iphone takes apple itunes X3g



work coondoggie company

standard app macbook os youtube store

text devices t.mobile netbooks wireless early


phones phone psystar lamont connection update information


include noted lines pro line

g1 reported netbook atari


calls olpc fcc network



PDOC00030 PS50102 PDOC00018 PS00018 PDOC00653 PDOC00380 PS50004 PDOC00224 PS50049 PDOC00561 PS50050 PDOC00660 PS50007 PDOC50007 PS50008 PDOC00343 PS50067 PDOC00271 PS50053 PDOC00791 PDOC00670 PS50052 PDOC50017 PDOC00014 PS00014 PDOC50199 PS50199 PS50017 PDOC50156 PS50156


PDOC00750 PS50235


PDOC00660 PS50245 PDOC00014 PS00014 PS50051 PDOC50106 PS50106 PS50052 PDOC00224

PDOC50003 PS50003 PDOC50106 PS50106

PS00018 PDOC50196 PS50229



PDOC00064 PS50065

PS50049 PS50067 PS50008

PDOC50006 PS50006 PS00027 PDOC00100 PS50011

PS50003 PDOC50007 PS00066 PS50007 PS50065

PDOC50017 PS50017 PDOC00380 PS50004

PDOC50002 PS50002 PDOC00154 PS50072 PDOC50001 PS50001
PDOC00791 PDOC00064 PS00318 PDOC00154

PDOC00271 PS50053 PS50072 PDOC00561 PS01031 PS50050 PDOC00100 PDOC50002 PS50002 PDOC50006 PDOC50001 PS50001 PDOC00750 PS50235 PS50102 PDOC50199 PS50199 PS50006 PDOC50156 PS50156 PS50011


PDOC00662 PS50051 PS01031

Fig. 5. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Slashdot and Genbase.

Fourth, regarding the utility of the feature selection, it is dif?cult to reach any conclusion. Whether BR or MLP is used, the use of the selected features as inputs to the classi?cation model is not shown greatly bene?cial in terms of global accuracy on average. The performance of BR and our MLP with all the features outperforms that with the selected features in 6 data sets but the feature selection leads to actual improvements in 3 data sets. The difference in accuracy with and without the feature selection was not shown to be statistically signi?cant (p > 0:20 with a Wilcoxon signed rank paired test). Surprisingly, the feature selection did exceptionally well on genbase. On this data set, the increase in accuracy is the most impressive: it raised from 7% to 98% which is atypical. The dramatic increase in accuracy on genbase is due solely to the restricted feature set as input to the classi?cation model. This is also observed on medical, to a lesser extent though. Interestingly, on large and densely connected networks (e.g. bibtex,

slashdot and corel5K), the feature selection performed very well in terms of global accuracy and signi?cantly reduced the input space which is noticeable. On emotions, yeast, image, scene and genbase, the method reduced the feature set down to nearly 1/100 its original size. The feature selection should be evaluated in view of its effectiveness at balancing the increasing error and the decreasing computational burden by drastically reducing the feature space. We should also keep in mind that the feature relevance cannot be de?ned independently of the learner and the modelperformance metric (e.g., the loss function used). Admittedly, our feature selection based on the Markov boundaries is not necessarily optimal for the base MLC learner used here, namely the Random Forest model. As far as the overall running time performance is concerned, we see from Table 6 that for both methods, the running time grows somewhat exponentially with the size of the Markov boundary


M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772


conveyed abdominal Class?44?786_07 Class?26?V42_0 transplant Class?9?599_0 marrow Class?7?511_9 drainage microscopic Class?12?V13_09 Class?43?279_12 Class?23?786_50 hematuria flank chest Class?15?593_1 cystic fibrosis Class?30?277_00 yearly Class?20?V67_09 date Class?22?789_09 enuresis Class?5?786_2 unilateral Class?28?753_21 9?day incontinence hydronephrosis Class?35?493_90 echogenicity moderate Class?42?599_7 present Class?8?596_8 routine Class?19?785_6 followup Class?36?788_30 alternatively calcium duplicated horseshoe
Class?34?V13_02 viral days Class?13?791_0 Class?15?593_1 proteinuria pneumonia bilaterally Class?1?079_99


Class?14?789_00 recent grossly reactive uti myelomeningocele Class?39?758_6 pain normal meningomyelocele fever vesicoureteral reflux kidneys Class?32?486 ii renal Class?0?593_70 cystogram
increased sounds

9?month?24?day technical significant





mass pyelectasis abnormal decrease unchanged stool conveyed


improvement underlying


Class?11?593_5 thickening appears dilatation clear related ureters resolves distention moderate hydroureter bilateral Class?14?789_00 deflux pyelocaliectasis hydronephrosis Class?20?V67_09 history Class?17?592_0 resolution mild ureteral urogenital prenatal Class?41?591 prior Class?36?788_30

gross Class?6?V72_5 grade

cystogram reflux voiding

Class?40?741_90 turner

Class?24?596_54 neurogenic

Class?16?462 pharyngitis

left nuclear Class?22?789_09 calculi pain Class?30?277_00 yearly

Class?6?V72_5 cystic fibrosis Class?12?V13_09 woman







low Class?4?753_0 Class?10?518_0 Class?31?780_6

made incontinence

Class?0?593_70 evidence grade ii partially flank 4 2 duplex duplicated present Class?23?786_50 appearing 5 growth kidney sonographic interval followup Class?19?785_6 lungs study renal x?ray size titer ppd partial stable Class?38?593_89 2; wetting r=9 anuresis views cough enuresis night bladder radiograph including Class?29?783_0 Class?26?V42_0 normal pleural Class?4?753_0 day transplant marrow radiographic started appetite urinary Class?25?787_03 ultrasound positive vesicostomy infections chest grossly myelomeningocele Class?39?758_6 web meningomyelocele vomiting Class?9?599_0 febrile tract infection Class?18?786_59 10?year?9?month uti duplication kidneys reimplantation cortical Class?33?788_41 intraluminal Class?37?753_3 horseshoe Class?8?596_8 white pole thorax evaluation collecting


evaluation antibody 2; Class?21?795_5 positive viral Class?38?593_89 ppd

lobe pneumonia atelectasis patchy





focal kidney


calculi pole evidence


approximately appearance wiskott Class?31?780_6 Class?5?786_2 unilateral 2?3







Class?3?759_89 9;





Class?27?786_05 distention Class?41?591 collecting


nonconventional year?old


Class?10?518_0 rule focal male examination fever lobe gross Class?32?486 Class?43?279_12 microscopic hematuria Class?2?786_09 hypoventilation

etiology Class?7?511_9 drainage

pyelocaliectasis change Class?25?787_03

aspirated vomiting

Class?37?753_3 duplication

atelectasis Class?28?753_21 consolidation patchy subsegmental band lung streaky reactive opacities area lobar density represent

Class?27?786_05 shortness

growth study


year?old features

radiographic caliber aldrich syndrome

accounts partial


airway airways



Class?33?788_41 intraluminal

Class?34?V13_02 Class?1?079_99 asthma hemihypertrophy neurogenic Class?16?462

Class?18?786_59 10?year?9?month


Class?13?791_0 proteinuria Class?11?593_5
minimally air

Class?40?741_90 turner



Class?29?783_0 appetite Class?2?786_09 Class?3?759_89

hydroureter proximally

good week

sale big


D.D1 B.B11 B.B4 D.D3

million utility 100 B.B10




D.D16 proposal

B.B9 site producers expense total

A.A2 D.D6 C.C11 D.D2 language B.B12 comments D.D15 C.C10 legal B.B6 C.C4 intended D.D12 confidential B.B3 C.C13 ca D.D10 B.B9 B.B7 D.D11 deregulation california C.C6 http B.B1 www A.A4 attorney enroncc A.A6 B.B10 draft

prices http


file securities
based markets power required interest fall D.D9 B.B11 policies D.D19 trading ago line office market 15 D.D1 C.C13 B.B7 number D.D17 republican D.D18 making real industry price including questions www D.D6


doc B.B13




investigation settlement commission reasonable sarah

iso C.C12 D.D7 B.B12 solution head caps government williams comments language america proposal points C.C11 draft press talking D.D15 D.D4 assets key D.D16 D.D5

linda filed 4

rto order


filing C.C1 ferc D.D12 A.A2 C.C7 release

pmto A.A3 D.D5



issues data

C.C4 B.B13 B.B3 mara securities A.A6 staff A.A1 D.D13 D.D14 attached james


A.A7 C.C9

11 amto
residential legislature rate shortages legislators potential lawmakers pacific ways crisis ca sdg attorney diego deregulation




C.C2 file C.C5 A.A3 fw cpuc message C.C3 friday meeting A.A4 original thursday A.A5 kaminski july scott find tuesday B.B5 doc B.B8



hit B.B2 media C.C3




confidential mailto yesterday concern board information system D.D10 legal intended cash 00 pmto june job risk stanford B.B2 D.D2 kean development amto vince 2001 london cc 27 world 22 talk copyright reserved article ordered monday D.D11 shut


meeting A.A8 C.C8 C.C1 D.D13 D.D19 C.C5 C.C7 B.B5 D.D7 B.B8 D.D8 C.C9 C.C2 thursday job D.D14 A.A1 C.C12 A.A5 staff visit
doesn puc didn jan

D.D8 subject 1 3 5 12 06 enroncc 02



A.A8 pm 04 2000 enron forwarded 16 31 angeles 05 valley hou 28 09 01 08 steven la 10 net held los ed spokesman cut 03 homes 23 megawatts reached temporary





senate B.B6 bills bad chris


Fig. 6. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Medical and Enron.

and the number of features, hence the considerable rise in total running time on bibtex. H2PC takes almost 200 times longer on bibtex (1826 variables) and enron (1001 variables) which is quite considerable but still affordable (44 h of single-CPU time on bibtex and 13 h on enron). We also observe that the size of the parent set with H2PC is 2.5 (on bibtex) and 3.6 (on enron) times larger than that of MMHC (which may hurt interpretability). In fact, the running time is known to increase exponentially with the parent set size of the true underlying network. This is mainly due the computational overhead of greedy search-and-score procedure with larger parent sets which is the most promising part to optimize in terms of computational gains as we discuss in the Conclusion. 6. Discussion & practical applications Our prime conclusion is that H2PC is a promising approach to constructing BN global or local structures around speci?c nodes of interest, with potentially thousands of variables. Concentrating

on higher recall values while keeping the false positive rate as low as possible pays off in terms of goodness of ?t and structure accuracy. Historically, the main practical dif?culty in the application of BN structure discovery approaches has been a lack of suitable computing resources and relevant accessible software. The number of variables which can be included in exact BN analysis is still limited. As a guide, this might be less than about 40 variables for exact structural search techniques (Kojima et al., 2010; Perrier et al., 2008). In contrast, constraint-based heuristics like the one presented in this study is capable of processing many thousands of features within hours on a personal computer, while maintaining a very high structural accuracy. H2PC and MMHC could potentially handle up to 10,000 labels in a few days of single-CPU time and far less by parallelizing the skeleton identi?cation algorithm as discussed in Tsamardinos et al. (2006) and Villanueva and Maciel (2014). The advantages in terms of structure accuracy and its ability to scale to thousands of variables opens up many avenues of future possible applications of H2PC in various domains as we shall

M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772


discuss next. For example, BNs have especially proven to be useful abstractions in computational biology (Aussem et al., 2012, Aussem, Tchernof, Rodrigues de Morais, & Rome, 2010; Nagarajan, Scutari, & Lbre, 2013; Pe?a, Bj?rkegren, & Tegnér, 2005; Pe?a, 2008; Prestat et al., 2013; Scutari & Nagarajan, 2013). Identifying the gene network is crucial for understanding the behavior of the cell which, in turn, can lead to better diagnosis and treatment of diseases. This is also of great importance for characterizing the function of genes and the proteins they encode in determining traits, psychology, or development of an organism. Genome sequencing uses high-throughput techniques like DNA microarrays, proteomics, metabolomics and mutation analysis to describe the function and interactions of thousands of genes (Zhang & Zhou, 2006). Learning BN models of gene networks from these huge data is still a dif?cult task (Badea, 2004; Bernard & Hartemink, 2005; Friedman, Nachman, & Peer, 1999a; Ott, Imoto, & Miyano, 2004; Peer, Regev, Elidan, & Friedman, 2001). In these studies, the authors had decide in advance which genes were included in the learning process, in all the cases less than 1000, and which genes were excluded from it (Pe?a, 2008). H2PC over-

come the problem by focusing the search around a targeted gene: the key step is the identi?cation of the vicinity of a node X (Pe?a et al., 2005). Our second objective in this study was to demonstrate the potential utility of hybrid BN structure discovery to multi-label learning. In multi-label data where many inter-dependencies between the labels may be present, explicitly modeling all relationships between the labels is intuitively far more reasonable (as demonstrated in our experiments). BNs explicitly account for such inter-dependencies and the DAG allows us to identify an optimal set of predictors for every label powerset. The experiments presented here support the conclusion that local structural learning in the form of local neighborhood induction and Markov blanket is a theoretically well-motivated approach that can serve as a powerful learning framework for label dependency analysis geared toward multi-label learning. Multi-label scenarios are found in many application domains, such as multimedia annotation (Snoek, Worring, Gemert, Geusebroek, & Smeulders, 2006; Trohidis, Tsoumakas, Kalliris, & Vlahavas, 2008), tag recommendation, text categorization (McCallum, 1999; Zhang & Zhou, 2006),

success TAG_informationretrieval libraries answer relationships

articles relations concepts business

configuration know TAG_toread TAG_datamining programs service


concept define


popular explicit


TAG_mining analysis TAG_analysis evolving changes TAG_concept data re library natural languages discussion TAG_language



written specification


TAG_collaboration categories entities mining creation organization change TAG_formal provides de organizations communications


behavioral political organizational TAG_technology generating others

adapted there source sense public management link construction culture innovation ideas program implementation collaborative solving market supporting managing integrating they life


aspect sciences TAG_folksonomy semantics TAG_metrics






etc TAG_narrative






specification types then annotation people TAG_fca general meaning TAG_diplomathesis document TAG_knowledge TAG_classification metadata tools reuse TAG_knowledgemanagement object







look experience training

TAG_formal TAG_evolution decision






activities presentation game situations practices logic technological

TAG_semantics management TAG_rdf semantics rdf TAG_maintenance evolutionary TAG_system

TAG_notag extraction

TAG_community resources

testing services TAG_engineering language topic




TAG_semanticweb competitive genetic TAG_requirements TAG_competition TAG_management TAG_statistics knowledge statistics competition off TAG_sna TAG_engineering TAG_datamining


TAG_context metrics TAG_tagging conf





enabling discovery

TAG_2005 online oriented TAG_requirements al

maintaining objects













query TAG_empirical some vector extraction queries TAG_notag aware documents prior community communities automatic




representation mechanics representations TAG_representation gap TAG_software biology gene





mathematics project topics

strategy structure TAG_structure first word structures relevance sources TAG_myown indexing kind TAG_informationretrieval enable information special next technology technical adaptation might regarding retrieval social probability share supports projects representing opportunities

TAG_quantum TAG_knowledge

TAG_topic8 maintenance engineering quantum vector

access text TAG_socialnets requirements TAG_simulations class TAG_management categories tool links proc TAG_nlp pages TAG_wiki TAG_software software TAG_objectoriented classification TAG_clustering biological TAG_semantic second TAG_ontology formal relevant being create intelligent topology user web numbers TAG_representation machine xml emergence virtual TAG_ontologies ontologies semantic context users created knowledge usage reasoning creating



navigation hypermedia learned adaptive TAG_learning



classical markets


mechanical ontology

evidence TAG_classification dna TAG_2006 TAG_graph TAG_concept



learning historical experiences integrated TAG_computing


young issues discusses



hierarchical clustering financial TAG_random TAG_kaldesignresearch TAG_ontology cluster software process development

TAG_topic10 consumption make



des goals base tasks descriptions TAG_education


academic questions

TAG_process TAG_topic3 non automatically acm global TAG_montecarlo TAG_nonequilibrium bioinformatics changing TAG_nepomuk investigate latent probabilistic dynamical equilibrium TAG_empirical TAG_dynamics TAG_data numerical now reduce need most TAG_fornepomuk concept TAG_networks achieve TAG_topic3 initial their identify function itself TAG_process workshop policy driven components the built about required family symposium graph secondary TAG_network TAG_mining mining execution search rather www following develop resource classification http architecture over standards

design TAG_prolearn choice TAG_logic TAG_children course TAG_design section overview needed understanding interactive




TAG_article reading ieee processes random proteins



requirements students technologies teaching children impact







ontology from designed TAG_analysis annual sci


efforts connections TAG_architecture architectures relatively environments TAG_information education vision TAG_marotzkiwinfried research trends output multimedia




video sind concerns computers skills child care von

test TAG_ontologies computing

la kinetics


thus made lead concepts use related ability within making finally available database furthermore behaviors data novel







computers factors supported

simulation dynamics protein empirical






educational methodology interfaces participants quality u illustrated criteria parallel perspective measures successful sich analyzing TAG_pattern partial f zu 24 den explore

TAG_folksonomy TAG_networks internet human stochastic care

TAG_simulation them TAG_molecular res simulations storage TAG_computing markets constant molecules computer TAG_computer TAG_theory theoretical TAG_equation child TAG_kinetic TAG_development TAG_children TAG_technology molecular constants mechanisms values financial comprehensive calculated TAG_statistics easy while key extend completely limited construct

likely 06 identity understand allow TAG_search cost network TAG_compounds automation TAG_rdf or kinds principles light

definition characteristics rdf paper nodes wide representation reveals reliability techniques TAG_litreview little computing TAG_of visualization analyses currently art der eine exploring dem pairs exploration




distributions complexity






TAG_web change empirical assess advances m other presented individuals domain TAG_evolution platform generated elements practice early devices extent conclusion



TAG_social cultural



ist die

center type


date constructed distributed contribution european




scientific TAG_complexity affected uses test movement problem of estimate analysis neural 2004 current enhanced health





TAG_collaborative TAG_tagging


towards networks young theory TAG_litreview social tagging probability apob provided children then TAG_semanticweb TAG_web20 second collaborative first multimedia TAG_mathematics nodes mathematical functional modelling contexts TAG_modeling mathematics collaboration TAG_semantic people probabilistic b aware wide services notes TAG_web world TAG_annotation compounds graphs p education agent personal TAG_fca TAG_myown TAG_2005 TAG_compounds TAG_2007 word words pages semantic web apo scientific journal apolipoprotein lipoprotein ab TAG_prolearn metadata c psychology initio multi not applying increase leading t extensive combination perform single TAG_search calculations TAG_psycholinguistics text about engine graph search latent users indexing techniques TAG_nepomuk algorithms TAG_graph retrieval algorithm TAG_elearning TAG_epitope novel language links TAG_2006 platform TAG_homogeneous user processing TAG_sna antigen onto implemented cell predictions built efficient energy training conference machine anti TAG_methodology membrane antibody mapping 02 like body reduction institute carried subject r mean structural involving show 13 natural interface TAG_architecture TAG_narrative free homogeneous link interfaces surface has reaction TAG_date TAG_learning TAG_immunosensor languages TAG_toread automation heterogeneous method TAG_physics TAG_energy application serum enzyme surfaces smaller restricted 1997 weight predicted 2000 give distribution constants obtained subjects scaling total interacting simulated TAG_complexity TAG_topic4 glucose systems reduce TAG_games TAG_systems TAG_information complexity design TAG_mathgamespatterns fluctuations community law patients communities TAG_programming entities system predict TAG_amperometry reviewed programs siamese participants program logic TAG_bettasplendens fish TAG_topic2 physical TAG_visualization entropy relationships relations oriented co practice fighting critical TAG_granular monte game coupled TAG_book interactive developments yield considered introduced computed calculate atomic TAG_critical complex molecules exponents TAG_review letters review TAG_design disease mechanics simulation possibility TAG_simulations wave fock exhibits self electrode prediction TAG_review physical games field rate equal organized parameters good qm prototype state performed normal designing correlation clin chem local 6 expansion electrochemical regime 1999 geometry calculated exact solvent dependence review hard chemical molecular investigated TAG_simulation profile differences TAG_electrochemistry amperometric TAG_elisa TAG_structure ion discrete automated chosen gaussian 18 simulations compared five determination excellent effects states TAG_random transfer linked hybrid dynamical combined identified TAG_bibteximport tested properties y 2 physics assay TAG_quantum least having minimum TAG_immunoassay degrees transactions dynamics rates included equilibrium i amount reviewed ground TAG_kinetic consistent d experiment agreement references qualitative immunoassay TAG_agdetection physics sizes far long series TAG_book energies did noise 1 evaluated j shift energy conducted right corresponding analyzed constraints either further its journal TAG_antibody found values architecture TAG_language information architectures mass map computation recently rapidly TAG_mapping set uniform american under 12 TAG_liposome antibodies TAG_research his distance shows optimization overall error clearly power structures thermal task indicated basis phys by capture used TAG_algorithms TAG_science TAG_fornepomuk TAG_sequence cholesterol sequence learning who experimentally 9 lower work clear phenomena non been therefore 2001 with 0 c TAG_langen sequences lipid TAG_socialnets 2007 instance monoclonal unified generally reading modeled where range demonstrated average 2002 must action evaluate experimental h applicable p high also broad TAG_cognition using statistics each computational positive groups modeling experiments however examine additional determine distinct mixture composed accurate annotation density adaptive TAG_apob school TAG_cognition alternative effective TAG_model modelling principle meaningful than TAG_nonequilibrium TAG_context context science american v TAG_education hypermedia adaptation 1998 have approximate reserved model 2003 educational higher improve respect without it you argued when procedures that w is for level changes evaluation among TAG_system 2006 TAG_psycholinguistics performance application TAG_antibody efficiency become value better entire modeling generate 98 established proposed international complex through best 15 standard studies target TAG_article TAG_date conceptual association sets a content technology together way facilitate allowing TAG_collaboration models dynamic but errors contributions systems we control efficient will society equations establish due co able as precision TAG_model TAG_models static upon embedded perturbation to lett applications accuracy graphs all comparable conditions proceedings TAG_data whole model TAG_topic11 reducing more equation memory situation according daily these allows basic points o line number new TAG_disability fit networks applied








als mit papers views TAG_visualization

zur algorithm framework technique this assessment TAG_algorithms space TAG_elearning production present matching researchers visual dna auch our areas future im

bases an national sites notes both subsequent evolutionary algorithms genome school description working input biology parts recognition



patterns in behavior device similarity influence previously levels sensor computer domains protocol area mobile comparative display interface describing TAG_visual differ





graphical during collected primary role TAG_and evolution controlled namely highly representations approach evolving strategies separation analyze res





based up character involved examined group shown similar different compounds reverse 05 and rev terms procedure specific evaluating processes TAG_mapping several turn complete TAG_sequence brain



sequence fighting fish siamese behaviour


students cognitive




such at TAG_models

suggesting genetic identification

factors are extended observed particular directly significantly maximum directed forces selected associated importance lack entropy human selection electrochemical biol disease hierarchical conventional



arbitrary which no out developed free direct spatial methods goal findings method TAG_evaluation resulting revealed TAG_methodology multiple map apob view patients

TAG_cortex peptide genes



gene monitoring showed product actual cognitive united chem degree assessed 30 across stability addition including combining layer plasma







major random case results system 60


TAG_molecular imaging reference theoretical TAG_energy connected ml TAG_computer vs 20 significant growth large measure monoclonal science cortex events relative cross left fragments TAG_homogeneous min contrast day can formation betta theoretically TAG_epitope v time had b temporal reliable serum TAG_ldl TAG_agdetection expressed regions TAG_topic10 resolution biological TAG_imaging clinical














e three

orientation volume was 10 every could 25 society close ratio reports improved anti psychology TAG_electrochemistry ldl immobilized risk concentrations TAG_amperometry oxidation



l cholesterol short 17 95 40 50 theories antibody argue TAG_immunoassay magnetic TAG_apob clin apolipoprotein


medical TAG_diffusion






detection various treated x theory defined bulk site matter 100 linear amperometric period bound flow diffusion determination assay affinity supported




since resonance protein interaction curves respectively studied onto symmetry activity contact increased clustering TAG_topic4 growing abstract cells TAG_granular limits identical images


specificity main chemistry alpha region barrier differential benefits activation amino treatment surface concentration assays proteins TAG_immunoelectrode beta samples competitive



format limit derived

consumption applications power

TAG_of 8 quantum TAG_dynamics 5

TAG_science 4 residues TAG_clustering nonlinear electronic produced between sensitive velocity apo optimized species antibodies sensitivity detect TAG_liposome







rapid splendens mobility






seven particle decreased mol receptor intrinsic containing double lipid response polymer


coefficient quantitative spin those image




whereas electron










cell variations



arise reaction acid motion lattice TAG_elisa chain isolated carbon measurements coefficients electrode



effect potential density gas surfaces lipoproteins immunoassay greater step detected fraction




3 TAG_immunoelectrode electron sci maximum TAG_logic formation visual visualization production measure evaluated s TAG_ldl u mm TAG_pattern intelligence functions evaluation attractive letters mechanical fully calculation complexes TAG_complex TAG_patterns chains point concentration literature obtain equations TAG_topic8 generalized strongly finite 7 charge gradient stationary potentials approximation constant gap


employed located substrate clusters linked particles TAG_topic7


catalytic cluster mediated amounts membrane










induced composition

primary evaluating left object cortex TAG_visual solution description pattern depend electrostatic TAG_objectoriented TAG_cortex TAG_statphys23 brain motion apparent frequencies TAG_montecarlo TAG_survey survey gas TAG_metrics metrics transition initio measuring side 11 kinetics TAG_spin patterns hydrogen receptor TAG_nlp TAG_topic7 velocity n ions water strong atoms bond force exchange molecule correlations charged

mixed ph estimated very


iii contained

flow objects particle


improvement pathway mechanism decrease rich

correlated polymer particles TAG_topic1 boltzmann indicates discuss parameter small path acids dependent fluid

trans cause


interact TAG_imaging phase temperature systematic chain full







TAG_topic6 TAG_statphys23 xxiii

TAG_diffusion image resonance



critical strength transitions frequency solid

transition magnetic diffusion

occurs intermediate


TAG_transition TAG_equation spin occur

begin phases glass coefficient state liquid TAG_spin TAG_topic1 TAG_phase continuous TAG_topic9 TAG_transition temperature diagram

TAG_topic9 glass materials

relaxation TAG_topic6 crystal

TAG_critical exponents

TAG_phase TAG_topic2






temple locomotive

monastery railroad






Cluster22 moss


train Cluster284 buddhist

Cluster190 Cluster159 Cluster278 castle


formation Cluster133 fountain sunrise Cluster232 reefs bridge Cluster408 Cluster423 sphinx Cluster427



courtyard silhouette horizon Cluster282 palace coral Cluster431 figures steel


aerial lily orchid military spider rapids fog Cluster183 pole canoe blossoms steps basket ladder vendor crops vegetables Cluster109 shell monument sails

vehicle dock

formula sunset Cluster149


entrance hands sailboats boeing glacier giant tortoise designs tomb diver anemone

arch Cluster426

statue forest Cluster252 ocean



grouper race museum art trail

wall Cluster318

Cluster79 arctic clearing kit sculpture Cluster392 Cluster170 tails Cluster150 moose bulls autumn Cluster244 pillar

Cluster431 carvings wood fruit

bush harvest


sidewalk plaza snake reptile cathedral doorway log lizard

Cluster87 Cluster443 Cluster205 Cluster405 tent paintings Cluster410 vegetation interior

Cluster347 moon

angelfish crystals


jeep cave hawaii

star fox elk cow antlers stone Cluster20 buddha fan caribou hillside land Cluster380 mast girl lighthouse hands giraffe saguaro orchid Cluster185 Cluster114 Cluster39 outside tomb shirt african

Cluster338 glass Cluster215 Cluster486 Cluster426 Cluster302 Cluster222 Cluster226 Cluster372

rice Cluster330 interior column



pond chairs facade perch entrance sails Cluster101

texture Cluster362 Cluster285 chrysanthemums


Cluster247 Cluster207 Cluster424 Cluster353 Cluster327 dust Cluster51 pack Cluster497 pups arctic Cluster174 Cluster265 Cluster340 pond cottage Cluster412 seals canal Cluster474 head Cluster228 garden Cluster122 Cluster416 Cluster98 Cluster466 Cluster159 forest coyote Cluster411 Cluster385 river cat Cluster299 tundra farms Cluster365 Cluster167 Cluster89 window Cluster156 tree Cluster349 Cluster15 water beach Cluster423 Cluster433 grass plane Cluster477 Cluster202 Cluster11 Cluster66 Cluster168 sheep Cluster190 Cluster185 Cluster286 Cluster213 baby Cluster282 ground horses Cluster260 Cluster300 Cluster498 dall monks ruins Cluster308 Cluster436 Cluster170 Cluster128 relief column Cluster268 fence fawn Cluster26 squirrel horns stairs shadows Cluster147 Cluster117 Cluster389 porcupine rodent Cluster330 stone Cluster144 Cluster451 Cluster171 Cluster492 Cluster239 Cluster65 Cluster133 Cluster386 mosque Cluster25 Cluster335 Cluster221 pyramid slope Cluster376 Cluster283 Cluster333 Cluster235 Cluster151 dunes school Cluster219 Cluster448 Cluster468 clouds dance man umbrella hills girl Cluster53 black sand woman formation Cluster323 crowd Cluster350 indian goat dress scotland bear vineyard Cluster427 people oahu hats Cluster181 antelope runway jet prop Cluster495 steel Cluster127 desert polar house palm ships Cluster71 Cluster173 costume sky hut Cluster435 autumn Cluster115 plain wolf door fox Cluster430 Cluster273 Cluster116 face Cluster390 Cluster425 Cluster84 Cluster35 Cluster27 Cluster107 peaks snow rocks Cluster83 wall Cluster359 kit tails run den Cluster6 Cluster119 Cluster326 tracks Cluster58 ice Cluster494 Cluster421 prototype turn frozen Cluster5 Cluster45


furniture Cluster210

shade black bear meadow grizzly sea sun polar



Cluster23 sponges dance





Cluster4 Cluster480

Cluster17 Cluster296

Cluster257 Cluster130 Cluster187

snow calf vendor dress farms diver anemone drum

Cluster155 eagle pepper albatross flight Cluster146 stream


Cluster396 frost

Cluster304 Cluster1 Cluster394

ice pyramid frozen ruins

canoe race Cluster260 face

dust pack umbrella porcupine fly

wings museum Cluster457 art Cluster458 nest albatross birds

Cluster460 vines

Cluster479 stems

frost relief shrine



mushrooms decoration costume hats cafe sidewalk Cluster341 Cluster166


balcony vineyard detail harvest Cluster227





fly stick Cluster341 Cluster178

Cluster240 path petals Cluster415 drum Cluster209 Cluster293 pots Cluster50 Cluster62 Cluster467 poppies nest Cluster112 blooms Cluster496 Cluster493 leaf lawn Cluster383 landscape Cluster317 Cluster9 Cluster217

Cluster2 Cluster478

straightaway formula Cluster263 Cluster199 winter Cluster136 Cluster7 Cluster264 Cluster47

Cluster41 Cluster129 Cluster438 Cluster437 Cluster14 Cluster294



rabbit Cluster498 mountain waves barn sign writing Cluster424 Cluster142 Cluster331 rodent





Cluster288 Cluster407 birds pottery Cluster465 branch Cluster142 Cluster255



dock Cluster70

Cluster402 Cluster110 Cluster399

Cluster241 tulip Cluster478 Cluster126 squirrel field prayer sand athlete dunes garden oahu sky Cluster29 valley scotland Cluster113 display hills plain blooms Cluster243 tusks coast lawn cactus Cluster347 trail trunk Cluster193 smoke mare post pottery floor stick Cluster42 horses ground flag pots desert room dog paintings mural Cluster461 Cluster157

plants booby




needles parade cactus Cluster38
Cluster364 foals Cluster268 pair

cars Cluster231 Cluster397 Cluster224 Cluster489 Cluster162 canyon pool Cluster271 swimmers Cluster279 penguin island Cluster449 tower shore valley Cluster21 Cluster297 street Cluster408 Cluster96 harbor mountain bridge boats f?16 Cluster243 Cluster94 range Cluster352 Cluster106 Cluster118 coral park Cluster102 Cluster164 Cluster321 Cluster413 windmills Cluster194 Cluster459


Cluster19 Cluster393 balcony Cluster100 Cluster154



Cluster192 Cluster491

Cluster49 Cluster93 Cluster56 Cluster235 sailboats church ceremony lake petals Cluster362 flowers






Cluster55 Cluster166


Cluster442 sheep beach canyon water Cluster308 palm grass zebra mist dall Cluster62 stems herd railing


Cluster366 Cluster12

cow Cluster37 Cluster295



Cluster148 Cluster267 chapel Cluster266 Cluster278 sign Cluster369 Cluster291 Cluster481 reefs Cluster60 fish Cluster287 Cluster276 Cluster331 Cluster91 Cluster381 Cluster499 Cluster463 lion Cluster175 Cluster395 Cluster476 Cluster454 Cluster329 Cluster320 writing

tree Cluster390 reflection

jet carvings plane doorway basket wood stairs penguin



Cluster403 rose Cluster464 clearing

flowers truck bengal moose Cluster179 Cluster242 antlers Cluster121 Cluster262 Cluster180 herd Cluster325 kauai tiger bulls meadow Cluster490 lynx Cluster241 caribou Cluster284 rockface hillside tulip row Cluster52 Cluster256 Cluster232 cubs Cluster138 field Cluster74 grizzly Cluster301


Cluster225 Cluster458


Cluster152 leaf Cluster165 river people rocks boats Cluster195 helicopter jeep shops plants bay baby mule run close?up cathedral fruit crops


prop seals pups canal Cluster214 Cluster256 cubs deer ceiling furniture Cluster313 white?tailed vegetation head bush snake fawn bench



Cluster77 storm



pepper pool



Cluster342 Cluster211 Cluster46
slope goat lynx Cluster82



harbor Cluster317 man woman turn butterfly insect reptile Cluster383 lizard island swimmers log Cluster205 Cluster351 tent cat tiger vegetables crab

market Cluster21

Cluster357 Cluster470






guard clothes




cheese street indian tables

Cluster9 Cluster399 path chapel



hawaii Cluster137


nets Cluster56

village reflection train arch

Cluster422 town skyline smoke market locomotive ocean store Cluster450 cliff

truck gate Cluster482 Cluster173 restaurant fish plaza windmills shore


stream Cluster493 Cluster209


Cluster286 school monument hawk terrace



Cluster358 elephant Cluster82


crystals eagle cars kauai Cluster403 Cluster230 Cluster59 Cluster356




trunk Cluster417 Cluster63

zebra Cluster377

cheese african waves Cluster22 coast

pole monks Cluster110 horns

pattern texture Cluster217 Cluster425

fog Cluster409 Cluster305 designs


railroad shops



marsh festival ships Cluster353 Cluster326 architecture

Cluster292 Cluster152 Cluster227 Cluster233

waterfalls Cluster441 Cluster16 Cluster488 mist rainbow


tracks park storm peaks giant tortoise fence iguana marine poppies

Cluster280 Cluster111

festival Cluster123 Cluster387 crab

buildings lighthouse architecture








Cluster61 tusks room hawk Cluster230 Cluster191 model floor lichen Cluster343 Cluster322 Cluster76 Cluster270 Cluster332 Cluster137 Cluster214 animals shrubs foals Cluster419 mare Cluster356 pebbles crafts Cluster75 carpet Cluster157 Cluster252 Cluster169 Cluster309 Cluster8

courtyard ceremony Cluster316 sea

Cluster197 Cluster246 mural food cougar hotel

buildings Cluster1 hotel angelfish rockface road moon



prototype cliff







ladder steps model carpet vines cottage

Cluster81 Cluster485 Cluster124

Cluster88 palace city light Cluster471 fountain

skyline wolf whales animals rapids spider needles

star calf fan post display leopard









Cluster101 Cluster244

remains waterfalls





Cluster39 night church statue lake Cluster380 tables Cluster143 Cluster361 Cluster249 bay shrine Cluster120 restaurant Cluster414 Cluster238 chairs facade remains railing detail iguana marine




Cluster406 moss Cluster313 Cluster200

Cluster364 Cluster461 Cluster149 Cluster234 Cluster163 rabbit Cluster370 marsh Cluster345 Cluster208 Cluster193 Cluster48 Cluster43 Cluster258 Cluster277 Cluster259 Cluster375 Cluster348 Cluster440 Cluster72 Cluster204 Cluster462 pillar Cluster220 Cluster79 Cluster114 sunset figures Cluster318 temple Cluster310 Cluster23 pair Cluster444 Cluster384 sphinx Cluster20 Cluster68 Cluster186 Cluster150 land

house Cluster145

Cluster108 Cluster126 Cluster40 Cluster455 Cluster70 shirt Cluster85 Cluster3


Cluster57 sculpture




Cluster113 buddha


window Cluster201


Cluster281 Cluster104 dog mule



Cluster439 outside silhouette Cluster34 horizon buddhist


Cluster184 Cluster29 Cluster30 Cluster354 Cluster49

Cluster206 Cluster196


white?tailed Cluster379 Cluster223




Cluster400 Cluster382





sunrise pagoda

Cluster97 monastery



Fig. 7. The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Bibtex and Corel5k.


M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772

protein function classi?cation (Roth & Fischer, 2007), and antiretroviral drug categorization (Borchani, Bielza, Toro, & Larra?aga, 2013). 7. Conclusion & avenues for future research We ?rst discussed a hybrid algorithm for global or local (around target nodes) BN structure learning called Hybrid HPC (H2PC). Our extensive experiments showed that H2PC outperforms the stateof-the-art MMHC by a signi?cant margin in terms of edge recall without sacri?cing the number of extra edges, which is crucial for the soundness of the super-structure used during the second stage of hybrid methods, like the ones proposed in Perrier et al. (2008) and Kojima et al. (2010). The code of H2PC is open-source and publicly available online at https://github.com/madbix/ bnlearn-clone-3.4. Second, we discussed an application of H2PC to the multi-label learning problem which is a challenging problem in many real-world application domains. We established theoretical results, under the faithfulness condition, in order to characterize graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution and their respective Markov boundaries. As far as we know, this is the ?rst investigation of Markov boundary principles to the optimal variable/feature selection problem in multi-label learning. These formal results offer a simple guideline to characterize graphically: (i) the minimal label powerset decomposition, (i.e. into minimal subsets YLP # Y such that YLP Y n YLP j X), and (ii) the minimal subset of features, w.r.t an Information Theory criterion, needed to predict each label powerset, thereby reducing the input space and the computational burden of the multi-label classi?cation. The theoretical analysis laid the foundation for a practical multi-label classi?cation procedure. Another set of experiments were carried out on a broad range of multi-label data sets from different problem domains to demonstrate its effectiveness. H2PC was shown to outperform MMHC by a signi?cant margin in terms of global accuracy. We suggest several avenues for future research. As far as BN structure learning is concerned, future work will aim at: (1) ascertaining which independence test (e.g. tests targeting speci?c distributions, employing parametric assumptions etc.) is most suited to the data at hand (Scutari, 2011; Tsamardinos & Borboudakis, 2010); (2) controlling the false discovery rate of the edges in the graph output by H2PC (Pe?a, 2008) especially when dealing with more nodes than samples, e.g. learning gene networks from gene expression data. In this study, H2PC was run independently on each node without keeping track of the dependencies found previously. This lead to some loss of ef?ciency due to redundant calculations. The optimization of the H2PC code is currently being undertaken to lower the computational cost while maintaining its performance. These optimizations will include the use of a cache to store the (in) dependencies and the use of a global structure. Other interesting research avenues to explore are extensions and modi?cations to the greedy search-and-score procedure which is the most promising part to optimize in terms of computational gains. Regarding the multi-label learning problem, experiments with several thousands of labels are currently been conducted and will be reported in due course. We also intend to work on relaxing the faithfulness assumption and derive a practical multilabel classi?cation algorithm based on H2PC that is correct under milder assumptions underlying the joint distribution (e.g. Composition, Intersection). This needs further substantiation through more analysis. Acknowledgments The authors thank Marco Scutari for sharing his bnlearn package in R. The research was also supported by grants from the European

ENIAC Joint Undertaking (INTEGRATE project) and from the French Rh?ne-Alpes Complex Systems Institute (IXXI). Appendix A Lemma 8. 8YLPi ; YLPj 2 LP, then YLPi \ YLPj 2 LP.

Proof. To keep the notation uncluttered and for the sake of simplicity, consider a partition fY1 ; Y2 ; Y3 ; Y4 g of Y such that YLPi ? Y1 [ Y2 , YLPj ? Y2 [ Y3 . From the label powerset assumption for YLPi and YLPj , we have ?Y1 [ Y2 ? ?Y3 [ Y4 ? j X and ?Y2 [ Y3 ? ?Y1 [ Y4 ? j X . From the Decomposition, we have that Y2 ?Y3 [ Y4 ? j X . Using the Weak union, we obtain that Y2 Y1 j ?Y3 [ Y4 [ X? . From these two facts, we can use the Contraction property to show that Y2 ?Y1 [ Y3 [ Y4 ? j X . Therefore, Y2 ? YLPi \ YLPj is a label powerset by de?nition. h Lemma 9. Let Y i and Y j denote two distinct labels in Y and de?ne by YLPi and YLPj their respective minimal label powerset. Then we have,

9Z # Y n fY i ; Y j g; fY i g

fY j g j ?X [ Z? ) YLPi ? YLPj

Proof. Let us suppose YLPi – YLPj . By the label powerset de?nition for YLPi , we have YLPi Y n YLPi j X. As YLPi \ YLPj ? ; owing to Lemma 8, we have that Y j R YLPi . 8Z # Y n fY i ; Y j g; Z can be decomposed as Zi [ Zj such that Zi ? Z \ ?YLPi n fY i g? and Zj ? Z n Zi . Using the Decomposition property, we have that ?fY i g [ Zi ? ?fY j g [ Zj ? j X . Using the Weak union property, we have that fY i g fY j g j ?X [ Z? . As this is true 8Z # Y n fY i ; Y j g, then = Z # Y n fY i ; Y j g; fY i g 9 fY j g j ?X [ Z?, which completes the proof. h Lemma 10. Consider YLP a minimal label powerset. Then, if p satis?es the Composition property, Z YLP n Z j X. Proof. By contradiction, suppose a nonempty Z exists, such that Z YLP n Z j X . From the label powerset assumption of YLP , we have that YLP Y n YLP j X . From these two facts, we can use the Composition property to show that Z Y n Z j X which contradicts the minimal label powerset assumption of YLP . This concludes the proof. h Theorem 6. Suppose p is faithful to a DAG G. Then, Y i and Y j belong to the same minimal label powerset if and only if there exists an undirected path in G between nodes Y i and Y j in Y such that all intermediate nodes Z are either (i) Z 2 Y, or (ii) Z 2 X and Z has two parents in Y (i.e. a collider of the form Y p ! X Y q ). Proof. Suppose such a path exists. By conditioning on all the intermediate colliders Y k in Y of the form Y p ! Y k Y q along an undirected path in G between nodes Y i and Y j , we ensure that 9Z # Y n fY i ; Y j g; dSep?Y i ; Y j j X [ Z?. Due to the faithfulness, this is equivalent to fY i g fY j g j ?X [ Z?. From Lemma 9, we conclude that Y i and Y j belong to the same minimal label powerset. To show the inverse, note that owing to Lemma 10, we know that there exists no partition fZi ; Zj g of Z such that ?fY i g [ Zi ? ?fY j g [ Zj ? j X. Due to the faithfulness, there exists at least a link between ?fY i g [ Zi ? and ?fY j g [ Zj ? in the DAG, hence by recursion, there exists a path between Y i and Y j such that all intermediate nodes Z are either (i) Z 2 Y, or (ii) Z 2 X and Z has two parents in Y (i.e. a collider of the form Y p ! X Y q ). h

M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772


Theorem 7. Suppose p is faithful to a DAG G. Let YLP ? fY 1 ; Y 2 ; . . . ; Y n g be a label powerset. Then, its Markov boundary M in U is also its Markov boundary in X, and is given in G by S M? n j?1 fPCY j [ SPY j g n Y . Proof. First, we prove that M is a Markov boundary of YLP in U. De?ne Mi the Markov boundary of Y i in U, and M0i ? Mi n Y. From Theorem 5, Mi is given in G by Mi ? PCY j [ SPY j . We may now prove that M01 [ . . . [ M0n is a Markov boundary of fY 1 ; Y 2 ; . . . ; Y n g in U. We show ?rst that the statement holds for n ? 2 and then conclude that it holds for all n by induction. Let W denote U n ?Y1 [ Y2 [M01 [ M02 ?, and de?ne Y1 ? fY 1 g and Y2 ? fY 2 g. From the Markov blanket assumption for Y1 we have Y1 U n ?Y1 [ M1 ? j M1 . Using the Weak Union property, we obtain that Y1 W j M01 [ M02 [ Y2 . Similarly we can derive Y2 W j M02 [ M01 [ Y1 . Combining these two statements yields Y1 [ Y2 W j M01 [ M02 due to the Intersection property. Let M ? M01 [ M02 the last expression can be formulated as Y1 [ Y2 U n ?Y1 [ Y2 [ M? j M which is the de?nition of a Markov blanket of Y1 [ Y2 in U. We shall now prove that M is minimal. Let us suppose that it is not the case, i.e., 9Z & M such that Y1 [ Y2 Z [ ?U n ?Y1 [ Y2 [ M?? j M n Z. De?ne Z1 ? Z \ M1 and Z2 ? Z \ M2 , we may apply the Weak Union property to get Y1 Z1 j ?M n Z? [ Y2 [ Z2 [ ?U n ?Y [ M?? which can be rewritten more compactly as Y1 Z1 j ?M1 n Z1 ? [ ?U n ?Y1 [ M1 ?? . From the Markov blanket assumption, we have Y1 U n ?Y1 [ M1 ? j M1 . We may now apply the Intersection property on these two statements to obtain Y1 Z1 [ ?U n ?Y1 [ M1 ?? j M1 n Z1 . Similarly, we can derive Y2 Z2 [ ?U n ?Y2 [ M2 ?? j M2 n Z2 . From the Markov boundary assumption of M1 and M2 , we have necessarily Z1 ? ; and Z2 ? ;, which in turn yields Z ? ;. To conclude for any n > 2, it suf?ces S ?1 to set Y1 ? n j?1 fY j g and Y 2 ? fY n g to conclude by induction. Second, we prove that M is a Markov blanket of YLP in X. De?ne Z ? M \ Y. From the label powerset de?nition, we have YLP Y n YLP j X . Using the Weak Union property, we obtain YLP Z j U n ?YLP [ Z? which can be reformulated as YLP Z j ?U n?YLP [ M?? [ ?M n Z? . Now, the Markov blanket assumption for YLP in U yields YLP U n ?YLP [ M? j M which can be rewritten as YLP U n ?YLP [ M? j ?M n Z? [ Z . From the Intersection property, we get YLP Z [ ?U n ?YLP [ M?? j M n Z . From the Markov boundary assumption of M in U, we know that there exists no proper subset of M which satis?es this statement, and therefore Z ? M \ Y ? ;. From the Markov blanket assumption of M in U, we have YLP U n ?YLP [ M? j M . Using the Decomposition property, we obtain YLP X n M j M which, together with the assumption M \ Y ? ;, is the de?nition of a Markov blanket of YLP in X. Finally, we prove that M is a Markov boundary of YLP in X. Let us suppose that it is not the case, i.e., 9Z & M such that YLP Z[ ?X n M? j M n Z . From the label powerset assumption of YLP , we have YLP Y n YLP j X which can be rewritten as YLP Y n YLP j ?X nM? [ ?M n Z? [ Z . Due to the Contraction property, combining these two statements yields YLP Z [ ?U n ?M [ YLP ? j M n Z . From the Markov boundary assumption of M in U, we have necessarily Z ? ;, which suf?ces to prove that M is a Markov boundary of YLP in X. h References
Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley. Aliferis, C. F., Statnikov, A. R., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. (2010). Local causal and markov blanket induction for causal discovery and feature selection for classi?cation part i: Algorithms and empirical evaluation. Journal of Machine Learning Research, JMLR, 11, 171–234. Alvares-Cherman, E., Metz, J., & Monard, M. (2011). Incorporating label dependency into the binary relevance framework for multi-label classi?cation. Expert Systems With Applications, ESWA, 39, 1647–1655. Armen, A. P., & Tsamardinos, I. (2011). A uni?ed approach to estimation and control of the false discovery rate in Bayesian network skeleton identi?cation. In European symposium on arti?cial neural networks, ESANN.

Aussem, A., Rodrigues de Morais, S., & Corbex, M. (2012). Analysis of nasopharyngeal carcinoma risk factors with Bayesian networks. Arti?cial Intelligence in Medicine, 54. Aussem, A., Tchernof, A., Rodrigues de Morais, S., & Rome, S. (2010). Analysis of lifestyle and metabolic predictors of visceral obesity with Bayesian networks. BMC Bioinformatics, 11, 487. Badea, A. (2004). Determining the direction of causal in?uence in large probabilistic networks: A constraint-based approach. In Proceedings of the sixteenth european conference on arti?cial intelligence (pp. 263–267). Bernard, A., & Hartemink, A. (2005). Informative structure priors: Joint learning of dynamic regulatory networks from multiple types of data. In Proceedings of the paci?c symposium on biocomputing (pp. 459–470). Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Top-down induction of clustering trees. In J. W. Shavlik (Ed.), International conference on machine learning, ICML (pp. 55–63). Morgan Kaufman. Borchani, H., Bielza, C., Toro, C., & Larra?aga, P. (2013). Predicting human immunode?ciency virus inhibitors using multi-dimensional Bayesian network classi?ers. Arti?cial Intelligence in Medicine, 57, 219–229. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Brown, L. E., & Tsamardinos, I. (2008). A strategy for making predictions under manipulation. Journal of Machine Learning Research, JMLR, 3, 35–52. Buntine, W. (1991). Theory re?nement on Bayesian networks. In B. D’Ambrosio, P. Smets, & P. Bonissone (Eds.), Uncertainty in arti?cial intelligence, UAI (pp. 52–60). San Mateo, CA, USA: Morgan Kaufman. Cawley, G. (2008). Causal and non-causal feature selection for ridge regression. JMLR: Workshop and Conference Proceedings, 3, 107–128. Cheng, J., Greiner, R., Kelly, J., Bell, D. A., & Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. Arti?cial Intelligence, 137, 43–90. Chickering, D., Heckerman, D., & Meek, C. (2004). Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, JMLR, 5, 1287–1330. Chickering, D. M. (2002). Optimal structure identi?cation with greedy search. Journal of Machine Learning Research, JMLR, 3, 507–554. Cussens, J., & Bartlett, M. (2013). Advances in Bayesian network learning using integer programming. In Uncertainty in arti?cial intelligence (pp. 182–191). Dembczyski, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2012). On label dependence and loss minimization in multi-label classi?cation. Machine Learning, 88, 5–45. Ellis, B., & Wong, W. H. (2008). Learning causal Bayesian network structures from experimental data. Journal of the American Statistical Association, 103, 778–789. Friedman, N., Nachman, I., & Peer, D. (1999). Learning Bayesian network structure from massive datasets: The sparse candidate algorithm. In Proceedings of the ?fteenth conference on uncertainty in arti?cial intelligence (pp. 206–215). Friedman, N., Nachman, I., & Peer, D. (1999b). Learning Bayesian network structure from massive datasets: The sparse candidate algorithm. In K. B. Laskey & H. Prade (Eds.), Uncertainty in arti?cial intelligence, UAI (pp. 21–30). Morgan Kaufmann Publishers. Gasse, M., Aussem, A., & Elghazel, H. (2012). Comparison of hybrid algorithms for Bayesian network structure learning. In European conference on machine learning and principles and practice of knowledge discovery in databases, ECML-PKDD. Lecture notes in computer science (Vol. 7523). Springer. Gu, Q., Li, Z., & Han, J. (2011). Correlated multi-label feature selection. In C. Macdonald, I. Ounis, & I. Ruthven (Eds.), CIKM (pp. 1087–1096). ACM. Guo, Y., & Gu, S. (2011). Multi-label classi?cation using conditional dependency networks. In International joint conference on arti?cial intelligence, IJCAI (pp. 1300–1305). AAAI Press. Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243. ˇ eroski, S. (2007). Ensembles of multi-objective Kocev, D., Vens, C., Struyf, J., & Dz decision trees. In J. N. Kok (Ed.), European conference on machine learning, ECML. Lecture notes in arti?cial intelligence (Vol. 4701, pp. 624–631). Springer. Koivisto, M., & Sood, K. (2004). Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, JMLR, 5, 549–573. Kojima, K., Perrier, E., Imoto, S., & Miyano, S. (2010). Optimal search on clustered structural constraint for learning Bayesian network structure. Journal of Machine Learning Research, JMLR, 11, 285–310. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. MIT Press. Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In International conference on machine learning, ICML (pp. 284–292). Liaw, A., & Wiener, M. (2002). Classi?cation and regression by randomforest. R News, 2, 18–22. Luaces, O., Díez, J., Barranquero, J., del Coz, J. J., & Bahamonde, A. (2012). Binary relevance ef?cacy for multilabel classi?cation. Progress in AI, 1, 303–313. ˇ eroski, S. (2012). An extensive Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dz experimental comparison of methods for multi-label learning. Pattern Recognition, 45, 3084–3104. Maron, O., & Ratan, A. L. (1998). Multiple-instance learning for natural scene classi?cation. In International conference on machine learning, ICML (Vol. 7, pp. 341–349). Citeseer. McCallum, A. (1999). Multi-label text classi?cation with a mixture model trained by em. In AAAI workshop on text learning. Moore, A., & Wong, W. (2003). Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. In T. Fawcett, & N. Mishra (Eds.), International conference on machine learning, ICML.


M. Gasse et al. / Expert Systems with Applications 41 (2014) 6755–6772 Scutari, M., & Nagarajan, R. (2013). Identifying signi?cant edges in graphical models of molecular networks. Arti?cial Intelligence in Medicine, 57, 207–217. Silander, T., & Myllym?ki, P. (2006). A simple approach for ?nding the globally optimal Bayesian network structure. In Uncertainty in arti?cial intelligence, UAI (pp. 445–452). Snoek, C., Worring, M., Gemert, J. V., Geusebroek, J., & Smeulders, A. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In ACM international conference on multimedia (pp. 421–430). ACM Press. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction, and search (2nd ed.). The MIT Press. Spola?r, N., Cherman, E. A., Monard, M. C., & Lee, H. D. (2013). A comparison of multi-label feature selection methods using the problem transformation approach. Electronic Notes in Theoretical Computer Science, 292, 135–151. ? , M., & Haws, D. (2014). Learning Bayesian network structure: Towards the Studeny essential graph by integer linear programming tools. International Journal of Approximate Reasoning, 55, 1043–1071. Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multi-label classi?cation of music into emotions. In ISMIR (pp. 325–330). Tsamardinos, I., Aliferis, C., & Statnikov, A. (2003). Algorithms for large scale Markov blanket discovery. In Florida arti?cial intelligence research society conference FLAIRS’03 (pp. 376–381). Tsamardinos, I., & Borboudakis, G. (2010). Permutation testing improves Bayesian network learning. In European conference on machine learning and knowledge discovery in databases, ECML-PKDD (pp. 322–337). Tsamardinos, I., Brown, L., & Aliferis, C. (2006). The max–min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65, 31–78. Tsamardinos, I., & Brown, L. E. (2008). Bounding the false discovery rate in local Bayesian network learning. In AAAI conference on arti?cial intelligence (pp. 1100– 1105). Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010a). Mining multi-label data. Transformation, 135, 1–20. Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010b). Random k-labelsets for multi-label classi?cation. IEEE Transactions on Knowledge and Data Engineering, TKDE, 23, 1–12. Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An ensemble method for multilabel classi?cation. In Proceedings of the 18th european conference on machine learning (Vol. 4701, pp. 406–417). Villanueva, E., & Maciel, C. (2012). Optimized algorithm for learning Bayesian network superstructures. In International conference on pattern recognition applications and methods, ICPRAM. Villanueva, E., & Maciel, C. D. (2014). Ef?cient methods for learning Bayesian network super-structures. Neurocomputing, 3–12. Zhang, M. L., & Zhang, K. (2010). Multi-label learning by exploiting label dependency. In European conference on machine learning and principles and practice of knowledge discovery in databases, ECML PKDD. Knowledge discovery in databases, KDD (Vol. 16, pp. 999). ACM Press. Zhang, M.-L., & Zhou, Z.-H. (2006). Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18, 1338–1351.

Nagarajan, R., Scutari, M., & Lbre, S. (2013). Bayesian networks in R: With applications in systems biology (use R!). Springer. Neapolitan, R. E. (2004). Learning Bayesian networks. Upper Saddle River, NJ: Pearson Prentice Hall. Ott, S., Imoto, S., & Miyano, S. (2004). Finding optimal models for small gene networks. In Proceedings of the paci?c symposium on biocomputing (pp. 557– 567). Pe?a, J., Nilsson, R., Bj?rkegren, J., & Tegnér, J. (2007). Towards scalable and data ef?cient learning of Markov boundaries. International Journal of Approximate Reasoning, 45, 211–232. Pe?a, J. M., Bj?rkegren, J., & Tegnér, J. (2005). Growing Bayesian network models of gene networks from seed genes. Bioinformatics, 40, 224–229. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, CA, USA: Morgan Kaufman. Pe?a, J. M. (2008). Learning gaussian graphical models of gene networks with false discovery rate control. In European conference on evolutionary computation, machine learning and data mining in bioinformatics (Vol. 6, pp. 165–176). Pe?a, J. M. (2012). Finding consensus Bayesian network structures. Journal of Arti?cial Intelligence Research, 42, 661–687. Perrier, E., Imoto, S., & Miyano, S. (2008). Finding optimal Bayesian network given a super-structure. Journal of Machine Learning Research, JMLR, 9, 2251–2286. Peer, D., Regev, A., Elidan, G., & Friedman, N. (2001). Inferring subnetworks from perturbed expression pro?les. Bioinformatics, 17, 215–224. Prestat, E., Rodrigues de Morais, S., Vendrell, J., Thollet, A., Gautier, C., Cohen, P., et al. (2013). Learning the local Bayesian network structure around the ZNF217 oncogene in breast tumours. Computers in Biology and Medicine, 4, 334–341. R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria. <http://www.Rproject.org>. Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classi?er chains for multilabel classi?cation. In European conference on machine learning and principles and practice of knowledge discovery in databases, ECML-PKDD. Lecture notes in computer science (Vol. 5782, pp. 254–269). Springer. Rodrigues de Morais, S., & Aussem, A. (2010). An ef?cient learning algorithm for local Bayesian network structure discovery. In European conference on machine learning and principles and practice of knowledge discovery in databases, ECMLPKDD (pp. 164–169). Rodrigues de Morais, S., & Aussem, A. (2010b). A novel Markov boundary based feature subset selection algorithm. Neurocomputing, 73, 578–584. Roth, V., & Fischer, B. (2007). Improved functional prediction of proteins by learning kernel combinations in multilabel settings. BMC Bioinformatics, 8, S12. Schwarz, G. E. (1978). Estimating the dimension of a model. Journal of Biomedical Informatics, 6, 461–464. Scutari, M. (2010). Learning Bayesian networks with the bnlearn R package. Journal of Statistical Software, 35, 1–22. Scutari, M. (2011). Measures of variability for graphical models (Ph.D. thesis). School in Statistical Sciences, University of Padova. Scutari, M., & Brogini, A. (2012). Bayesian network structure learning with permutation tests. Communications in Statistics Theory and Methods, 41, 3233–3243.

...Supervised Competitive Learning Algorithm for Pr...
Learning Algorithm for Probabilistic Neural Network... Bayesian-decision 1 Introduction Probabilistic ...network is to use the structure of the neural ...
AI language Algebraic Linguistics algorithm ...Bayesian learning Bayesian statistics behaviorism ...structure network neural network Neurolinguistics ...
Then, the basic concepts of the Bayesian network, algorithm of parameter learning and algorithm of structure learning algorithm were introduced. Test case ...
estimation algorithm include the random hybrid ...network self-learning but it is tardiness to ...s structure, so it is difficult to extract the...
Networks Affine Layer Affinity matrix Agent ...learning algorithm Batch Normalization/BN Bayes ...Hybrid computing Hyperparameter Hypothesis Hypothesis ...
structure of the brain to solve very complex ...training algorithm for networks with moderate size....Trainbr (Bayesian regularization) is a modified ...