Example 1: Sparse Multiple Canonical Correlation Network (SmCCNet) Workflow with Graph Neural Network (GNN) Embeddings

This tutorial demonstrates a comprehensive workflow using SmCCNet for graph generation, followed by GNN-based embedding generation to create node representations from the network. The process integrates the generated embeddings into subject-level omics data, enhancing downstream analytical capabilities.

Workflow Overview:

Network Construction (SmCCNet): Generates a network from multi-omics data using SmCCNet. The resulting adjacency matrix represents relationships between features.
GNN-Based Embedding Generation: Utilizes Graph Neural Networks (GNNs) to create embeddings from the constructed network, capturing intricate feature relationships.
Subject Representation Integration: Integrates the generated embeddings into subject-level omics data, enhancing the dataset for downstream analyses such as clustering or disease prediction.

Step-by-Step Guide:

Setup Input Data: - Prepare your omics data (omics_data), phenotype data (phenotype_data), and clinical data (clinical_data) as Pandas DataFrames or Series. - These data structures should be loaded or created within your application or script.

"""
Example 1: Sparse Multiple Canonical Correlation Network (SmCCNet) Workflow with Graph Neural Network (GNN) Embeddings
======================================================================================================================

This script demonstrates a comprehensive workflow where we first generate a graph using Sparse Multiple Canonical
Correlation Network (SmCCNet), and then use Graph Neural Network (GNN)-based embedding generation to create node
representations from the network.

Steps:
1. Generate an adjacency matrix using SmCCNet based on multi-omics and phenotype data.
2. Compute node features based on correlations.
3. Use a Graph Convolutional Network (GCN) to generate node embeddings.
4. Integrate the embeddings into the omics data for enhanced analysis.
"""


import pandas as pd
from bioneuralnet.graph_generation import SmCCNet
from bioneuralnet.network_embedding import GNNEmbedding
from bioneuralnet.subject_representation import GraphEmbedding

Run SmCCNet Workflow:

Running SmCCNet to generate the adjacency matrix.

    try:
        smccnet_instance = SmCCNet(
            phenotype_data=phenotype_data,
            omics_data=omics_data,
            data_types=['protein', 'metabolite'],
            kfold=5,
            summarization='PCA',
            seed=732,
        )

        adjacency_matrix = smccnet_instance.run()

This step instantiates the SmCCNet class and generates an adjacency matrix using your multi-omics and phenotype data.

Run GNN Embedding Generation:

Generating GNN Embeddings from the Adjacency Matrix.

        node_features = pd.concat([
            omics_data[['protein_feature1', 'protein_feature2']], 
            omics_data[['metabolite_feature1', 'metabolite_feature2']]  
        ], axis=1)

        gnn_embedding = GNNEmbedding(
            adjacency_matrix=adjacency_matrix,
            node_features=node_features,
            model_type='GCN', 
            gnn_hidden_dim=64,
            gnn_layer_num=2,
            dropout=True
        )
        embeddings_dict = gnn_embedding.run()
        embeddings_tensor = embeddings_dict['graph']
        embeddings_df = pd.DataFrame(embeddings_tensor.numpy(), index=node_features.index)

This section computes node features based on correlations and employs a Graph Neural Network (GNN) to generate embeddings from the adjacency matrix.

Integrate Embeddings into Subject Representation:

Integrating GNN Embeddings into Subject-Level Omics Data.

        # Step 4: Initialize and run GraphEmbedding
        graph_embedding = GraphEmbedding(
            adjacency_matrix=adjacency_matrix,
            omics_data=omics_data,
            phenotype_data=phenotype_data,
            clinical_data=clinical_data,
            embedding_method='GNNs'
        )
        enhanced_omics_data = graph_embedding.run()

Here, the generated embeddings are integrated into the subject-level omics data, enhancing the dataset for downstream analyses such as clustering or disease prediction.

Complete Workflow Execution:

Complete SmCCNet Workflow Execution with Sample Data.

if __name__ == "__main__":
    try:
        print("Starting SmCCNet and GNNs Workflow...")

        omics_data = pd.DataFrame({
            'protein_feature1': [0.1, 0.2],
            'protein_feature2': [0.3, 0.4],
            'metabolite_feature1': [0.5, 0.6],
            'metabolite_feature2': [0.7, 0.8]
        }, index=['Sample1', 'Sample2'])

        phenotype_data = pd.Series([1, 0], index=['Sample1', 'Sample2'])

        clinical_data = pd.DataFrame({
            'clinical_feature1': [5, 3],
            'clinical_feature2': [7, 2]
        }, index=['Sample1', 'Sample2'])

        enhanced_omics = run_smccnet_workflow(omics_data, phenotype_data, clinical_data)

        print("Enhanced Omics Data:")
        print(enhanced_omics)

        print("SmCCNet Workflow completed successfully.\n")
    except Exception as e:
        print(f"An error occurred during the execution: {e}")

This section demonstrates the full execution of the workflow using sample data. It initializes the input data, runs the SmCCNet workflow, and outputs the enhanced omics data integrated with GNN embeddings.

Running the Example:

Complete SmCCNet Workflow Execution with Sample Data.

"""
Example 1: Sparse Multiple Canonical Correlation Network (SmCCNet) Workflow with Graph Neural Network (GNN) Embeddings
======================================================================================================================

This script demonstrates a comprehensive workflow where we first generate a graph using Sparse Multiple Canonical
Correlation Network (SmCCNet), and then use Graph Neural Network (GNN)-based embedding generation to create node
representations from the network.

Steps:
1. Generate an adjacency matrix using SmCCNet based on multi-omics and phenotype data.
2. Compute node features based on correlations.
3. Use a Graph Convolutional Network (GCN) to generate node embeddings.
4. Integrate the embeddings into the omics data for enhanced analysis.
"""


import pandas as pd
from bioneuralnet.graph_generation import SmCCNet
from bioneuralnet.network_embedding import GNNEmbedding
from bioneuralnet.subject_representation import GraphEmbedding

def run_smccnet_workflow(omics_data: pd.DataFrame,
                         phenotype_data: pd.Series,
                         clinical_data: pd.DataFrame) -> pd.DataFrame:
    """
    Executes the SmCCNet-based workflow for generating enhanced omics data.

    This function performs the following steps:
        1. Instantiates the SmCCNet, GNNEmbedding, and GraphEmbedding components.
        2. Generates an adjacency matrix using SmCCNet.
        3. Computes node features based on correlations.
        4. Generates embeddings using GNNEmbedding.
        5. Integrates embeddings into omics data to produce enhanced omics data.

    Args:
        omics_data (pd.DataFrame): DataFrame containing omics features (e.g., proteins, metabolites).
        phenotype_data (pd.Series): Series containing phenotype information.
        clinical_data (pd.DataFrame): DataFrame containing clinical data.

    Returns:
        pd.DataFrame: Enhanced omics data integrated with GNN embeddings.
    """
    try:
        smccnet_instance = SmCCNet(
            phenotype_data=phenotype_data,
            omics_data=omics_data,
            data_types=['protein', 'metabolite'],
            kfold=5,
            summarization='PCA',
            seed=732,
        )

        adjacency_matrix = smccnet_instance.run()
        print("Adjacency matrix generated using SmCCNet.")

        node_features = pd.concat([
            omics_data[['protein_feature1', 'protein_feature2']], 
            omics_data[['metabolite_feature1', 'metabolite_feature2']]  
        ], axis=1)

        gnn_embedding = GNNEmbedding(
            adjacency_matrix=adjacency_matrix,
            node_features=node_features,
            model_type='GCN', 
            gnn_hidden_dim=64,
            gnn_layer_num=2,
            dropout=True
        )
        embeddings_dict = gnn_embedding.run()
        embeddings_tensor = embeddings_dict['graph']
        embeddings_df = pd.DataFrame(embeddings_tensor.numpy(), index=node_features.index)
        print("GNN embeddings generated.")
        
        #Embeddings can also be saved to a file
        #output_file = 'output/embeddings.csv'
        #embeddings_df.to_csv(output_file)
        #print(f"Embeddings saved to {output_file}")

        # Step 4: Initialize and run GraphEmbedding
        graph_embedding = GraphEmbedding(
            adjacency_matrix=adjacency_matrix,
            omics_data=omics_data,
            phenotype_data=phenotype_data,
            clinical_data=clinical_data,
            embedding_method='GNNs'
        )
        enhanced_omics_data = graph_embedding.run()
        print("Embeddings integrated into omics data.")

        return enhanced_omics_data

    except Exception as e:
        print(f"An error occurred during the SmCCNet workflow: {e}")
        raise e

if __name__ == "__main__":
    try:
        print("Starting SmCCNet and GNNs Workflow...")

        omics_data = pd.DataFrame({
            'protein_feature1': [0.1, 0.2],
            'protein_feature2': [0.3, 0.4],
            'metabolite_feature1': [0.5, 0.6],
            'metabolite_feature2': [0.7, 0.8]
        }, index=['Sample1', 'Sample2'])

        phenotype_data = pd.Series([1, 0], index=['Sample1', 'Sample2'])

        clinical_data = pd.DataFrame({
            'clinical_feature1': [5, 3],
            'clinical_feature2': [7, 2]
        }, index=['Sample1', 'Sample2'])

        enhanced_omics = run_smccnet_workflow(omics_data, phenotype_data, clinical_data)

        print("Enhanced Omics Data:")
        print(enhanced_omics)

        print("SmCCNet Workflow completed successfully.\n")
    except Exception as e:
        print(f"An error occurred during the execution: {e}")
        raise e

Upon successful execution, you will find: - Adjacency Matrix: Generated by SmCCNet, stored as a DataFrame. - GNN Embeddings: Created using GNNs, stored as embeddings_df. - Enhanced Omics Data: Subject-level data enriched with embeddings, stored as enhanced_omics_data.

Result Interpretation:

Adjacency Matrix: Represents the constructed network from multi-omics data, indicating the strength and presence of relationships between features.
GNN Embeddings: Numerical representations capturing the structural and feature-based intricacies of the network, facilitating advanced analyses.
Enhanced Omics Data: Combines original omics data with embedding information, providing a richer dataset for downstream tasks like clustering or predictive modeling.