当前位置：首页 > news >正文

[ICML 2024]Learning High-Order Relationships of Brain Regions

news 2025/7/13 17:00:26

论文网址：学习大脑区域的高阶关系 |打开评论

论文代码：GitHub - Graph-and-Geometric-Learning/HyBRiD: Official implementation of paper "Learning High-Order Relationships of Brain Regions" [ICML 2024]

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Problem Definition & Notations

2.4. Related Work

2.5. Method

2.5.1. Learning the Hypergraph by Multi-head Masking

2.5.2. Optimization Framework

2.6. Experiments

2.6.1. Predictive Performance of Hypereges

2.6.2. Further Analysis

2.7. Conclusion

3. Reference

1. 心得

（1）oops哥们儿小标题太多了

（2）插播一个翻译笑话，因此告诫大家平时要多看英文喔：

（3）换了一个好少见的指标评估方式

2. 论文逐段精读

2.1. Abstract

①Existing problems: ignores the focus on high order relationship but only consider pairwise connection

②They reckon that the high order relations are maximally informative and minimally redundant (MIMR), so they proposed HYBRID

2.2. Introduction

①Hypergraph:

②Existing information bottleneck (IB) methods in MIMR prefer to extract the compressed representations of input rather than identifying the underlying structure of hyper graph

③They proposed Hypergraph of Brain Regions via multi-head Drop-bottleneck (HYBRID) and the pipeline shows:

2.3. Problem Definition & Notations

（1）Input

①The subject's feature and its phenotype can be paired as $\left ( X,Y \right )$ , where $X \in \mathbb{R}^{N \times d}$ denotes features come from Pearson correlation, $N$ is the number of brain region, $d$ denotes the dimension of features, $Y \in \mathbb{R}$ denotes phenotype

（2）Goal

①HYBRID identified the hyper graph set $H=\left ( \boldsymbol{h}^1,\boldsymbol{h}^2,...,\boldsymbol{h}^k \right )$ and assigned weights for them $\boldsymbol{w}=\left [ w^1,w^2,...,w^k \right ]^\top$

（3）Representation of Hyperedges

①They represent hyper edges by:

$\boldsymbol{h}^k=\boldsymbol{m}^k \odot X \in \mathbb{R}^{N \times d}$

where $\boldsymbol{m}^k \in \left \{ 0,1 \right \}^N$ denotes mask vector, $\odot$ denotes broad-casting element wise multiplication

②“换句话说，每个 $\boldsymbol{h}^k$ 都是 $X$ 的随机归零版本。”（在方法部分解释了具体的）

2.4. Related Work

（1）Hypergraph Construction

①They think methods like $k$ nearest neighbors or regulatization caused different number of hyperedges in different instances

②Clustering or iteration methods cannot learn MIMR hyperedges

（2）High-Order Relationships in fMRI

①Relevant methods describe the synergy and redundancy instead of measure the most informative methods for cognition score

（3）Information Bottleneck

①HYBRID prefers to identify the underlying structures

（4）Connectivity-based Phenotypic Prediction

①GNN relies on pairwise connections

2.5. Method

（1）Method Overview

①The pipeline of HYBRID:

$X\underset{\mathcal{F}_c}{\rightarrow}H\underset{\mathcal{F}_w}{\rightarrow}\boldsymbol{w}\underset{\mathcal{F}_t}{\rightarrow}Y$

这个是之前说过的，就是 $\mathcal{F}_c$ 是CONSTRUCTOR，构建出来超图，然后 $\mathcal{F}_w$ 是WEIGHTER给每个超边权重， $\mathcal{F}_t$ 是LINEARHEAD拿来预测label $Y$ 的

2.5.1. Learning the Hypergraph by Multi-head Masking

①Instance can be represented as $X=\left [ X_1,X_2,...,X_N \right ]^\top \in \mathbb{R}^{N \times d}$

（1）Hyperedges Construction

①Task function: $H=\mathcal{F}_c\left ( X \right )$

②The predefined number of hyperedges: $K$

③Assign a head to each hyperedge and construct hyper edges:

where $p_{\theta.i}^{k}\in[0,1],i=1,2,\cdots,N$ are learnable probabilities, $\mathbf{1}:[0,1]\mapsto\{0,1\}$ is indicator function:

$\mathbf{1}\left ( x \right )=\left\{\begin{matrix} 1\, \,\, \, if\, \,\, \, x> 0.5\\ 0\, \, \, \, if\, \, \, \, x\leq 0.5 \end{matrix}\right.$

and $\boldsymbol{m}^k$ denotes column vector corresponding to the $k$ -th hyperedge and approximates it by stop gradient technique

④So the hyperedges are:

$\begin{aligned}\boldsymbol{h}^{k}&=\boldsymbol{m}^{k}\odot X\\&=[\boldsymbol{m}_{1}^{k}X_{1},\boldsymbol{m}_{2}^{k}X_{2},\cdots,\boldsymbol{m}_{N}^{k}X_{N}]\in\mathbb{R}^{N\times d}\end{aligned}$

where $\boldsymbol{m}_{j}^{k}$ denotes the $j$ -th element of the vector $\boldsymbol{m}^k$ . Combining them to:

$H=(\boldsymbol{h}^1,\boldsymbol{h}^2,\cdots,\boldsymbol{h}^K)$

（2）Hyperedge Weighting

①Task function: $\boldsymbol{w}=\mathcal{F}_w\left ( H \right )$

②The weight of hyperedges obtained by sum all the correlated nodes fetures up and excute a dimension reduction:

$w^k=\mathrm{Readout}(\boldsymbol{h}^k)=\mathrm{DimReduction}(\boldsymbol{m}^{k^T}\boldsymbol{h}^k)\in\mathbb{R}$

where $w^k$ is the weight of the $k$ -th hyperedge, $\boldsymbol{w} = [w^{1},w^{2},\cdots,w^{K}]^{T} \in \mathbb{R}^{K}$ , DimReduction is MLP with ReLU activations with output dimension equals to 1

③The linear head:

$\hat{Y}=\mathcal{F}_l(\boldsymbol{w})\in\mathbb{R}$

（3）Computational Complexity

①The computational complexity of HYBRID: $O\left ( N^2K \right )$ , the same as MLP（芥末屌）

2.5.2. Optimization Framework

①By reckoning $X$ , $Y$ and $H$ are random variables in the Markovian chain as $X\leftrightarrow Y\leftrightarrow H$ , they optimize:

$\arg\max I(H;Y)-\beta I(H;X)$

where $I$ denotes mutual information, so $I\left ( H;Y \right )$ denotes informativeness and $I\left ( H;X \right )$ denotes redundancy, $\beta$ is a hyperparameter for trading

②"Since optimizing the mutual information for high-dimensional continuous variables is intractable, we instead optimize the lower bound of above equation":

$\begin{aligned} I(H;Y)& =\mathbb{H}[Y]-\mathbb{H}[Y|H] \\ &=\mathbb{H}[Y]+\mathbb{E}_{p(Y,H)}[\log p(Y|H)] \\ &=\mathbb{H}[Y]+\mathbb{E}_{p(Y,H)}[\log q_{\phi}(Y|H)] \\ &+\mathbb{E}_{p(H)}[\mathrm{KL}(p(Y|H)|q_{\phi}(Y|H))] \\ &\geq\mathbb{H}[Y]+\mathbb{E}_{p(Y,H)}[\log q_{\phi}(Y|H)], \end{aligned}$

where $\mathbb{H}\left [ \cdot \right ]$ denotes entropy computation, they aim to optimize component in the second term, $q_\phi\sim \mathcal{F}_l \circ \mathcal{F}_w$ is a model to predict $Y$ based on $H$ , $\circ$ denotes function composition. They define the $q_\phi$ as a Gaussian model with variance

解释：

$I(H;Y) =\mathbb{H}[Y]-\mathbb{H}[Y|H]$

$\mathbb{H}\left [ Y \right ]$ 是 $Y$ 的熵，表示 $Y$ 的不确定性， $\mathbb{H}[Y|H]$ 是在给定 $H$ 的条件下 $Y$ 的条件熵，表示在已知 $H$ 的情况下， $Y$ 的不确定性。互信息 $I(H;Y)$ 可以看作是 $Y$ 的总不确定性减去在给定 $H$ 时的剩余不确定性。优化互信息的目标是减少条件熵（即减少在知道 $H$ 后对 $Y$ 的不确定性）。

$=\mathbb{H}[Y]+\mathbb{E}_{p(Y,H)}[\log p(Y|H)]$

$\mathbb{E}_{p(Y,H)}[\log p(Y|H)]$ 是对于联合分布 $p(Y,H)$ 下，条件概率 $p(Y|H)$ 的对数似然的期望值。也就是说，这一项衡量了 $H$ 给定时 $Y$ 的可能性（即 $H$ 对 $Y$ 的信息量）。

$=\mathbb{H}[Y]+\mathbb{E}_{p(Y,H)}[\log q_{\phi}(Y|H)]+\mathbb{E}_{p(H)}[\mathrm{KL}(p(Y|H)|q_{\phi}(Y|H))]$

在这一步中，条件概率 $p(Y|H)$ 被替换为了一个近似模型 $q_{\phi}(Y|H)$ ，其中 $\phi$ 是模型的参数。通常这种近似方法被用来推导出可优化的目标，尤其是在变分推理中。 $q_{\phi}(Y|H)$ 是一个我们学习的模型，旨在预测给定 $H$ 时的 $Y$ 。这可以视为通过某个神经网络（或其他模型）来表示的条件概率。而KL散度（Kullback-Leibler散度），它度量了真实条件概率 $p(Y|H)$ 与近似条件概率 $q_{\phi}(Y|H)$ 之间的差异。KL散度的形式是：

$\mathrm{KL}(p(Y|H)|q_{\phi}(Y|H))=\mathbb{E}_{p(Y|H)}[log\, p(Y|H)-log\, q_{\phi}(Y|H)]$

这表示在优化过程中需要考虑到真实分布与近似模型之间的偏差。如果 $q_{\phi}(Y|H)$ 和 $p(Y|H)$ 不相等，KL散度会给出一个正的值，这意味着需要减小这种差异。

$\geq\mathbb{H}[Y]+\mathbb{E}_{p(Y,H)}[\log q_{\phi}(Y|H)]$

是最终的下界，通过变分推理的方式得到了一个优化目标。KL散度始终大于等于零。想要最大化 $I(H;Y)$ ，但实际上只能通过最小化 $\mathbb{E}_{p(Y,H)}[\log q_\phi (Y|H)]$ 来优化近似模型 $q_{\phi}(Y|H)$ ，因为第二项（关于KL散度的期望）是非负的。

（1）Proposition 4.1.

①For the redundancy term:

$\begin{aligned} I(H;X)\leq\sum_{k=1}^{K}I(\boldsymbol{h}^{k};X)& \leq\sum_{k=1}^{K}\sum_{i=1}^{N}I(\boldsymbol h_{i}^{k};X_{i}) \\ &=\sum_{k=1}^K\sum_{i=1}^N\mathbb{H}[X_i](1-p_{\theta,i}^k) \end{aligned}$

the equation holds if and only if nodes are independent and hyperedges do not overlap.

我的补充解释（猜测好吧）：

Ⅰ. 互信息量与总信息量

在这个公式中， $I\left ( H;X \right )$ 代表了系统 $H$ 和 $X$ 之间的互信息。互信息反映了 $H$ 和 $X$ 之间共享的信息量，具体地，它量化了 $X$ 中节点特征信息对超图特征 $H$ 的预测能力。

公式开始的部分是：

$I(H;X)\leq\sum_{k=1}^{K}I(\boldsymbol{h}^{k};X)$

这一步可以理解为总互信息 $I\left ( H;X \right )$ 的上界。它表示如果将系统分 $H$ 成 $K$ 个部分（每个部分用 $\boldsymbol{h}^{k}$ 表示），那么总互信息 $I\left ( H;X \right )$ 的上界可以通过将每个部分的互信息相加得到。这里假设互信息具有子加性，即整体的互信息不大于单独部分的互信息之和。

Ⅱ. 节点的互信息

$\sum_{k=1}^{K}I(\boldsymbol{h}^{k};X)\leq\sum_{k=1}^{K}\sum_{i=1}^{N}I(\boldsymbol h_{i}^{k};X_{i})$

这里将系统的每个部分 $\boldsymbol{h}^{k}$ 再细分为多个节点 $\boldsymbol{h}^{k}_i$ 。换句话说，系统 $H$ 被划分为 $K$ 个超边，每个超边中包含 $N$ 个节点。公式显示，每个超边的互信息可以被分解为每个节点的互信息。此时，互信息的计算从超边级别转移到节点级别。这步转换表示将每个超边的整体信息量进一步细化为其组成部分（即节点）的信息量。

Ⅲ. 计算节点间的互信息

$\sum_{k=1}^{K}\sum_{i=1}^{N}I(\boldsymbol h_{i}^{k};X_{i}) =\sum_{k=1}^K\sum_{i=1}^N\mathbb{H}[X_i](1-p_{\theta,i}^k)$

$\mathbb{H}\left [ X_i\right ]$ 量化了节点的“信息量”或“随机性”。这一步的右边是基于节点的熵和概率计算得到的表达式。

Ⅳ. 公式成立的条件

节点独立性：公式中的每个节点互信息被独立计算，这意味着节点之间应该是独立的。独立性假设意味着一个节点的信息不会影响其他节点的信息，从而可以将整个系统的信息量分解为每个节点的信息量之和。

超边不重叠：超边之间的关系是通过节点的组合来体现的。如果超边之间有重叠（即某些节点在多个超边中同时出现），那么互信息的计算就不再可以简单地分解成每个节点的互信息的和。这是因为超边的重叠会引入额外的依赖关系，从而使得节点之间的互信息不再可以独立计算。

②Then the loss function will be:

2.6. Experiments

2.6.1. Predictive Performance of Hypereges

（1）Datasets

①ABIDE: 直接说官方预处理版本吗哈哈哈哈哈哈好吧虽然也不是不行 with Craddock 200 atlas. They used FIQ, VIQ and PIQ to predict

②ABCD: they chose baseline (release 2.0) and the 2-year followup (release 3.0), obtaining 8 sub-datasets from 2 timepoints under 4 tasks with AAL3v1. The prediction task is aim to FIQ

③Region features: Pearson correlation

（2）Evaluation Metric

①⭐Hyperedge quality assessment method: CPM（代码连接：GitHub - 耶鲁MRRC/CPM）. The formula of CPM:

$r'=CPM\left ( \boldsymbol{w}_p,Y \right )$

where $\boldsymbol{w}_p \in \mathbb{R}^{K_p}$ denotes the pairwise edge weights and $K_p$ denotes the total number of pairwise edges

②Due to the authors use hyperedge they change the formula to:

$r=CPM\left ( \boldsymbol{w}_h,Y \right )$

（3）Baselines

①They compare their model with standard methods, hypergraph construction methods, and connectivity-based phenotypic prediction methods

（4）Implementation & Training Details

①Hyperparameter setting:

（5）Results

① $r$ values comparison table on ABCD:

② $r$ values comparison table on ABIDE:

（6）Runtime

①很快，但细节在附录J，我懒得翻了

（7）Ablation Studies

①Module ablation:

HYBRIDNoMask	Do not mask at all, which means all nodes and their features are visible to each head
HYBRIDRndMask	Replace the learnable masks with random masks with the same sparsity, initialized at the beginning of training
HYBRIDSoftMask	Remove the indicator function and use $p^k_{\theta,i}$ directly in

（8）Additional Experiments on Synthetic Dataset

①“由于在面向特定结果的学习信息超边缘的真实数据集中没有基真超边缘，因此我们构建了一个合成数据集来验证我们的模型是否可以在MIMR目标下恢复正确的超边缘结构。我们使用精度、召回率和F1分数来衡量学习到的超边缘相对于基本事实的正确性。尽管仅在任务标签的监督下学习超边缘是具有挑战性的，但我们发现我们的模型达到了高性能，与最强基线相比，F1得分平均提高了28.3%。有关合成实验的详细信息可在附录G中找到。”以下是我扒来的G：