Examples d’utilisation

Voici quelques exemples d’utilisation du package mvcluster.

Préparer un dataset personnalisé

  1"""
  2[EN] prepare_custom_dataset.py - Final Version
  3
  4This script prepares heterogeneous multi-view (e.g., multi-omics) datasets
  5for downstream tasks such as clustering or graph-based learning.
  6
  7It performs robust loading, preprocessing, normalization, graph
  8construction, and saving of multiple data views into a unified .mat file
  9format.
 10
 11==============================
 12Main Functionalities
 13==============================
 14
 151. Robust Data Loading
 16----------------------
 17- Loads CSV files using pandas.
 18- Tries alternative encodings (utf-8, latin1, windows-1252) if standard
 19  read fails.
 20- If no valid columns are found, generates random fallback data to avoid
 21  crashing.
 22
 232. View Preprocessing
 24---------------------
 25Each input view (CSV file) undergoes the following steps:
 26- Categorical columns are converted to numerical using factorization.
 27- Missing values are imputed using column-wise medians.
 28- Views with fewer features than `--min_features` are automatically
 29  augmented by duplicating existing columns.
 30- If a view has more than 100 features, variance thresholding is applied
 31  to remove low-variance columns.
 32- Each view is standardized using `StandardScaler`.
 33
 343. Graph Construction
 35---------------------
 36- Constructs a symmetric K-Nearest Neighbors (KNN) graph for each view.
 37- Graphs are binary (1/0 connectivity) and symmetric (A = (A + A.T) / 2).
 38
 394. Label Handling (Optional)
 40----------------------------
 41- If a label file is provided, it is loaded and encoded using
 42  `LabelEncoder`.
 43- Only labels matching the number of samples are retained.
 44
 455. Output Generation
 46---------------------
 47The final data is saved as a `.mat` file and includes:
 48- Feature matrices: X_0, X_1, ..., one per view.
 49- Adjacency matrices: A_0, A_1, ..., one per view.
 50- View names.
 51- Original shape information for each view.
 52- Sample count.
 53- Feature names (limited to selected columns).
 54- Encoded labels (optional).
 55
 56==============================
 57Command Line Arguments
 58==============================
 59--views        : List of CSV files (one per view) [REQUIRED]
 60--labels       : (Optional) Path to CSV file with sample labels
 61--data_name    : Output filename (without extension) [REQUIRED]
 62--k            : Number of neighbors for KNN graph (default: 15)
 63--min_features : Minimum number of features per view (default: 1)
 64--output_dir   : Output directory (default: prepared_datasets)
 65
 66==============================
 67Typical Usage Example
 68==============================
 69python prepare_custom_dataset.py \
 70    --views view1.csv view2.csv view3.csv \
 71    --labels labels.csv \
 72    --data_name my_dataset \
 73    --k 15 \
 74    --min_features 2 \
 75    --output_dir prepared_datasets
 76
 77==============================
 78Error Handling and Recommendations
 79==============================
 80- Views with <2 features may cause downstream errors with dimensionality
 81  reduction (e.g., TruncatedSVD).
 82- Use `--min_features 2` or manually exclude weak views.
 83- Final `.mat` output is compatible with MATLAB and multi-view clustering
 84  frameworks.
 85
 86==============================
 87Output Example
 88==============================
 89View 1/5: transcriptomics
 90transcriptomics: Selected 45/150 features
 91Shape: (30, 45), Features: 45
 92Loaded 3 label classes
 93
 94=== Successfully saved to prepared_datasets/my_dataset.mat ===
 95Summary: 5 views, 30 samples
 96
 97
 98[FR] prepare_custom_dataset.py - Version finale
 99
100Ce script prépare des jeux de données hétérogènes multi-vues (ex : multi-
101omiques) pour des tâches en aval telles que le clustering ou
102l’apprentissage basé sur les graphes.
103
104Il effectue le chargement robuste, le prétraitement, la normalisation,
105la construction de graphes et la sauvegarde des vues dans un fichier
106unique `.mat`.
107
108==============================
109Fonctionnalités principales
110==============================
111
1121. Chargement robuste des données
113----------------------------------
114- Chargement des fichiers CSV avec pandas.
115- Essaie plusieurs encodages alternatifs (utf-8, latin1, windows-1252) si
116  le chargement échoue.
117- Si aucun fichier valide n'est trouvé, des données aléatoires sont
118  générées pour éviter l'arrêt du programme.
119
1202. Prétraitement des vues
121--------------------------
122Chaque vue (fichier CSV) est traitée comme suit :
123- Les colonnes catégorielles sont converties en valeurs numériques via la
124  factorisation.
125- Les valeurs manquantes sont remplacées par la médiane des colonnes.
126- Si une vue contient moins de `--min_features`, elle est augmentée
127  automatiquement.
128- Si une vue contient plus de 100 colonnes, une sélection par variance
129  est appliquée.
130- Chaque vue est normalisée avec `StandardScaler`.
131
1323. Construction de graphes
133---------------------------
134- Un graphe de K plus proches voisins (KNN) est construit pour chaque vue.
135- Les graphes sont binaires (0/1) et symétrisés (A = (A + A.T)/2).
136
1374. Gestion des étiquettes (facultatif)
138---------------------------------------
139- Si un fichier de labels est fourni, il est chargé et encodé avec
140  `LabelEncoder`.
141- Les étiquettes sont conservées uniquement si elles correspondent au
142  nombre d’échantillons.
143
1445. Génération de la sortie
145---------------------------
146Le fichier final au format `.mat` contient :
147- Les matrices de caractéristiques : X_0, X_1, ..., une par vue.
148- Les matrices d’adjacence : A_0, A_1, ..., une par vue.
149- Les noms des vues.
150- Les dimensions d’origine de chaque vue.
151- Le nombre total d’échantillons.
152- Les noms des variables (colonnes sélectionnées).
153- Les étiquettes encodées (si présentes).
154
155==============================
156Arguments en ligne de commande
157==============================
158--views        : Liste de fichiers CSV (une par vue) [OBLIGATOIRE]
159--labels       : (Facultatif) Fichier CSV contenant les labels
160--data_name    : Nom du fichier de sortie (sans extension) [OBLIGATOIRE]
161--k            : Nombre de voisins pour le graphe KNN (défaut : 15)
162--min_features : Nombre minimal de colonnes par vue (défaut : 1)
163--output_dir   : Répertoire de sortie (défaut : prepared_datasets)
164
165==============================
166Exemple d'utilisation
167==============================
168python prepare_custom_dataset.py \
169    --views vue1.csv vue2.csv vue3.csv \
170    --labels labels.csv \
171    --data_name mon_dataset \
172    --k 15 \
173    --min_features 2 \
174    --output_dir prepared_datasets
175
176==============================
177Conseils et gestion des erreurs
178==============================
179- Les vues avec moins de 2 colonnes peuvent provoquer des erreurs avec
180  TruncatedSVD.
181- Utilisez `--min_features 2` ou excluez manuellement ces vues.
182- Le fichier `.mat` final est compatible avec MATLAB et les frameworks
183  de clustering multi-vues.
184
185==============================
186Exemple de sortie
187==============================
188Vue 1/5 : transcriptomics
189transcriptomics : 45/150 variables sélectionnées
190Forme : (30, 45), Variables : 45
1913 classes de labels chargées
192
193=== Sauvegarde réussie vers prepared_datasets/mon_dataset.mat ===
194Résumé : 5 vues, 30 échantillons
195"""
196
197
198import argparse
199import numpy as np
200import scipy.io
201import pandas as pd
202import os
203import warnings
204from sklearn.neighbors import kneighbors_graph
205from sklearn.preprocessing import StandardScaler, LabelEncoder
206from sklearn.feature_selection import VarianceThreshold
207
208# Configure logging
209warnings.filterwarnings('once')
210pd.set_option('display.max_columns', 10)
211
212
213def robust_read_file(filepath: str) -> pd.DataFrame:
214    """Read data file with multiple fallback strategies."""
215    try:
216        df = pd.read_csv(filepath, header=0, index_col=None)
217
218        if df.shape[1] == 0:
219            encodings = ['utf-8', 'latin1', 'windows-1252']
220            for enc in encodings:
221                try:
222                    df = pd.read_csv(filepath, encoding=enc)
223                    if df.shape[1] > 0:
224                        break
225                except Exception:
226                    continue
227
228        if df.shape[1] == 0:
229            raise ValueError("No columns detected")
230
231        return df
232
233    except Exception as e:
234        warnings.warn(f"Failed to read {filepath}: {str(e)}")
235        return pd.DataFrame({'feature': np.random.rand(30)})
236
237
238def preprocess_view(df: pd.DataFrame, view_name: str,
239                    min_features: int) -> np.ndarray:
240    """Preprocess a single view."""
241    cat_cols = df.select_dtypes(exclude=np.number).columns
242    for col in cat_cols:
243        df[col] = pd.factorize(df[col])[0]
244
245    if df.isna().any().any():
246        df = df.fillna(df.median())
247
248    X = df.values.astype(np.float32)
249
250    if X.shape[1] < min_features:
251        warnings.warn(
252            f"Augmenting {view_name} from {X.shape[1]} "
253            f"to {min_features} features"
254        )
255        X = np.hstack([X] + [X[:, [0]] *
256                             (min_features - X.shape[1])])
257
258    if X.shape[1] > 100:
259        selector = VarianceThreshold(threshold=0.1)
260        try:
261            X = selector.fit_transform(X)
262            print(
263                f"{view_name}: Selected {X.shape[1]}/"
264                f"{selector.n_features_in_} features"
265            )
266        except Exception as e:
267            print(f"Feature selection failed for {view_name}: {str(e)}")
268
269    if X.shape[0] > 1:
270        X = StandardScaler().fit_transform(X)
271
272    return X
273
274
275def save_heterogeneous_data(output_path: str, data: dict):
276    """Specialized saver for heterogeneous data."""
277    save_data = {}
278    for i, (x, a) in enumerate(zip(data['Xs'], data['As'])):
279        save_data[f'X_{i}'] = x
280        save_data[f'A_{i}'] = a
281
282    save_data.update({
283        'view_names': np.array(data['view_names'], dtype=object),
284        'n_samples': data['n_samples'],
285        'original_shapes': np.array(
286            [x.shape for x in data['Xs']], dtype=object
287        )
288    })
289
290    if 'labels' in data:
291        save_data['labels'] = data['labels']
292
293    scipy.io.savemat(output_path, save_data)
294
295
296def main():
297    parser = argparse.ArgumentParser(
298        description="Multi-omics data preprocessor"
299    )
300    parser.add_argument("--views", nargs="+", required=True,
301                        help="Input files")
302    parser.add_argument("--labels", help="Label file")
303    parser.add_argument("--data_name", required=True, help="Output name")
304    parser.add_argument("--k", type=int, default=10,
305                        help="k for KNN graph")
306    parser.add_argument("--min_features", type=int, default=2,
307                        help="Min features")
308    parser.add_argument("--output_dir", default="prepared_datasets",
309                        help="Output dir")
310
311    args = parser.parse_args()
312    os.makedirs(args.output_dir, exist_ok=True)
313    output_path = os.path.join(args.output_dir,
314                               f"{args.data_name}.mat")
315
316    view_data = []
317    print("\n=== Processing Views ===")
318
319    for i, view_path in enumerate(args.views):
320        view_name = os.path.splitext(os.path.basename(view_path))[0]
321        print(f"\nView {i + 1}/{len(args.views)}: {view_name}")
322
323        try:
324            df = robust_read_file(view_path)
325            X = preprocess_view(df, view_name, args.min_features)
326
327            print(f"\n>>> First 10 rows of {view_name} after preprocessing:")
328            print(pd.DataFrame(X).head(10))
329
330            A = kneighbors_graph(X, n_neighbors=args.k,
331                                 mode='connectivity')
332            A = 0.5 * (A + A.T)  # type: ignore # Symmetrize
333            A.data[:] = 1        # Binary weights
334
335            view_data.append({
336                'X': X,
337                'A': A,
338                'name': view_name,
339                'features': df.columns.tolist()[:X.shape[1]]
340            })
341
342            print(f"  Shape: {X.shape}, "
343                  f"Features: {len(view_data[-1]['features'])}")
344
345        except Exception as e:
346            warnings.warn(f"Failed to process {view_name}: {str(e)}")
347            continue
348
349    results = {
350        'Xs': [vd['X'] for vd in view_data],
351        'As': [vd['A'] for vd in view_data],
352        'view_names': [vd['name'] for vd in view_data],
353        'n_samples': view_data[0]['X'].shape[0] if view_data else 0,
354        'feature_names': [vd['features'] for vd in view_data]
355    }
356
357    if args.labels and os.path.exists(args.labels):
358        try:
359            labels = pd.read_csv(args.labels).squeeze()
360            if len(labels) == results['n_samples']:  # type: ignore
361                results['labels'] = LabelEncoder().fit_transform(labels)
362                print(f"\nLoaded {len(np.unique(results['labels']))} "
363                      "label classes")
364        except Exception as e:
365            warnings.warn(f"Label loading failed: {str(e)}")
366
367    try:
368        save_heterogeneous_data(output_path, results)
369        print(f"\n=== Successfully saved to {output_path} ===")
370        print(f"Summary: {len(view_data)} views, "
371              f"{results['n_samples']} samples")
372    except Exception as e:
373        print(f"\n!!! Final save failed: {str(e)}")
374        print("Possible solutions:")
375        print("1. Install hdf5storage: pip install hdf5storage")
376        print("2. Reduce feature dimensions using PCA")
377        print("3. Save in a different format (e.g., HDF5)")
378
379
380if __name__ == "__main__":
381    main()

Comparer les méthodes

 1"""
 2compare_methods.py
 3
 4Compares multiple multiview clustering algorithms
 5on the same dataset using clustering metrics (NMI, ARI, ACC).
 6
 7Steps:
 8    1. Load and preprocess a multi-view dataset.
 9    2. Apply multiple clustering algorithms to generate labels.
10    3. Compute and display evaluation metrics.
11    4. Optionally visualize clusters from each method.
12
13Usage:
14    python compare_methods.py
15
16Dependencies:
17    - mvclustlib.algorithms.*
18    - mvclustlib.utils.metrics
19    - mvclustlib.utils.plot
20"""

Évaluer avec des métriques

 1"""
 2evaluate_with_metrics.py
 3
 4Computes clustering quality metrics (NMI, ARI, ACC) for a selected multiview
 5clustering algorithm on a benchmark dataset.
 6
 7Steps:
 8    1. Run a clustering method on a dataset.
 9    2. Compare predicted labels against ground truth.
10    3. Compute and print evaluation metrics.
11
12Usage:
13    python evaluate_with_metrics.py
14
15Dependencies:
16    - mvclustlib.algorithms.lmgec
17    - mvclustlib.utils.metrics
18"""

Visualiser les clusters

  1"""
  2[EN]
  3This script loads and visualizes multi-view clustering results from custom
  4multi-view datasets stored in .mat files. It supports various common .mat file
  5formats for multi-view data with adjacency and feature matrices, optionally
  6including ground truth cluster labels.
  7
  8Main features and workflow:
  9
 101. Data Loading:
 11   - Supports .mat formats with keys such as 'X_i'/'A_i', 'X1', 'features',
 12     'views', and special cases like 'fea', 'W', and 'gnd'.
 13   - Handles sparse and dense matrices and converts them as needed.
 14   - Returns a list of (adjacency matrix, feature matrix) tuples for each view,
 15     along with optional ground truth labels.
 16
 172. Data Preprocessing:
 18   - Normalizes adjacency matrices and preprocesses feature matrices.
 19   - Supports tf-idf option disabled here and beta parameter usage.
 20   - Converts sparse matrices to dense format where necessary.
 21
 223. Clustering:
 23   - Uses the LMGEC (Localized Multi-View Graph Embedding Clustering) model
 24     for clustering.
 25   - Automatically determines the number of clusters from labels or defaults
 26     to 3 if no labels are provided.
 27   - Embedding dimension is set as clusters + 1.
 28
 294. Visualization:
 30   - Visualizes predicted clusters and, if available, ground truth clusters.
 31   - Uses PCA for dimensionality reduction before plotting.
 32
 335. Command-Line Interface:
 34   - Requires a path to the .mat dataset.
 35   - Optional flag to run without ground truth labels.
 36
 37Dependencies:
 38- mvcluster package (cluster, utils.plot, utils.preprocess modules)
 39- numpy, scipy, scikit-learn, argparse, warnings
 40
 41Usage example:
 42    python visualize_mvclusters.py --data_file path/to/data.mat
 43    python visualize_mvclusters.py --data_file path/to/data.mat --no_labels
 44
 45[FR]
 46Ce script charge et visualise les résultats de clustering multi-vues à partir
 47de jeux de données multi-vues personnalisés au format .mat. Il supporte
 48plusieurs formats .mat communs avec matrices d’adjacence et matrices de
 49caractéristiques, incluant éventuellement des étiquettes de vérité terrain.
 50
 51Fonctionnalités principales et déroulement :
 52
 531. Chargement des données :
 54- Supporte les formats .mat avec clés telles que 'X_i'/'A_i', 'X1', 'features',
 55'views', et cas spéciaux comme 'fea', 'W' et 'gnd'.
 56- Gère les matrices creuses (sparse) et denses en les convertissant si besoin.
 57- Retourne une liste de tuples
 58(matrice d’adjacence, matrice de caractéristiques)
 59pour chaque vue, ainsi que les étiquettes de vérité terrain optionnelles.
 60
 612. Prétraitement des données :
 62- Normalise les matrices d’adjacence et prépare les matrices
 63de caractéristiques.
 64- Supporte l’option tf-idf désactivée ici et l’usage du paramètre beta.
 65- Convertit les matrices creuses en matrices denses si nécessaire.
 66
 673. Clustering :
 68- Utilise le modèle LMGEC (Localized Multi-View Graph Embedding Clustering)
 69pour le clustering.
 70- Détermine automatiquement le nombre de clusters à partir des étiquettes,
 71ou utilise 3 clusters par défaut si aucune étiquette n’est fournie.
 72- La dimension d’embedding est fixée à clusters + 1.
 73
 744. Visualisation :
 75- Visualise les clusters prédits et, si disponibles, les clusters de vérité
 76terrain.
 77- Utilise l’ACP (PCA) pour réduire la dimension avant affichage.
 78
 795. Interface en ligne de commande :
 80- Nécessite le chemin vers le fichier .mat.
 81- Option pour exécuter sans étiquettes de vérité terrain.
 82
 83Dépendances :
 84- Package mvcluster (modules cluster, utils.plot, utils.preprocess)
 85- numpy, scipy, scikit-learn, argparse, warnings
 86
 87Exemples d’utilisation :
 88    python visualize_mvclusters.py --data_file chemin/vers/data.mat
 89    python visualize_mvclusters.py --data_file chemin/vers/data.mat --no_labels
 90"""
 91
 92
 93import argparse
 94import os
 95import sys
 96import numpy as np
 97import warnings
 98from sklearn.preprocessing import StandardScaler
 99from scipy.io import loadmat
100from scipy.sparse import issparse, coo_matrix
101
102# Add the parent directory to the import path
103sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
104
105try:
106    from mvcluster.cluster import LMGEC
107    from mvcluster.utils.plot import visualize_clusters
108    from mvcluster.utils.preprocess import preprocess_dataset
109except ImportError as e:
110    raise ImportError(f"Failed to import required modules: {e}")
111
112
113def load_custom_mat(path):
114    """
115    Load .mat file supporting multiple multiview formats.
116
117    Args:
118        path (str): Path to the .mat file
119
120    Returns:
121        tuple: (list of (A, X) tuples, labels array or None)
122
123    Raises:
124        ValueError: If the file structure is unsupported
125    """
126    mat = loadmat(path)
127    Xs, As = [], []
128    # Try to get labels (optional)
129    labels = None
130    for label_key in ['labels', 'label', 'gt', 'ground_truth']:
131        if label_key in mat:
132            labels = mat[label_key].squeeze()
133            break
134
135    # Try X_0/A_0 format
136    i = 0
137    while f"X_{i}" in mat and f"A_{i}" in mat:
138        X = mat[f"X_{i}"]
139        A = mat[f"A_{i}"].astype(np.float32)
140        if issparse(X):
141            X = X.toarray()
142        if issparse(A):
143            A = A.toarray()
144        Xs.append(X)
145        As.append(A)
146        i += 1
147    if Xs:
148        return list(zip(As, Xs)), labels
149
150    # Try X1 format (with identity adjacency)
151    i = 1
152    while f"X{i}" in mat:
153        X = mat[f"X{i}"]
154        if issparse(X):
155            X = X.toarray()
156        A = np.eye(X.shape[0], dtype=np.float32)
157        Xs.append(X)
158        As.append(A)
159        i += 1
160    if Xs:
161        return list(zip(As, Xs)), labels
162
163    # Try features/views format
164    for key in ["features", "views", "data"]:
165        if key in mat:
166            value = mat[key]
167            try:
168                if isinstance(value, coo_matrix):
169                    X = value.toarray()
170                    A = np.eye(X.shape[0], dtype=np.float32)
171                    return [(A, X)], labels
172                elif value.shape == (1,):
173                    # Handle cell array format
174                    for view in value[0]:
175                        X = view.toarray() if issparse(view) else view
176                        A = np.eye(X.shape[0], dtype=np.float32)
177                        Xs.append(X)
178                        As.append(A)
179                else:
180                    # Handle matrix directly
181                    X = value.toarray() if issparse(value) else value
182                    A = np.eye(X.shape[0], dtype=np.float32)
183                    Xs.append(X)
184                    As.append(A)
185                if Xs:
186                    return list(zip(As, Xs)), labels
187            except Exception as e:
188                warnings.warn(f"Failed to process key '{key}': {str(e)}")
189                continue
190            # New case for wiki.mat format with 'fea', 'W', and 'gnd' keys
191    if "fea" in mat and "W" in mat:
192        X = mat["fea"]
193        A = mat["W"].astype(np.float32)
194        Xs.append(X)
195        As.append(A)
196        if "gnd" in mat:
197            labels = mat["gnd"].squeeze()
198            if labels.ndim != 1:
199                labels = labels.ravel()
200            if not isinstance(labels, np.ndarray):
201                labels = np.array(labels)
202        return list(zip(As, Xs)), labels
203
204    raise ValueError(
205        "Unsupported .mat structure. Expected formats:\n"
206        "1. X_0/A_0, X_1/A_1,...\n"
207        "2. X1, X2,... (with identity adjacency)\n"
208        "3. 'features' or 'views' key with data"
209    )
210
211
212def main():
213    """Main function to run the visualization pipeline."""
214    parser = argparse.ArgumentParser(
215        description="Visualize multiview clustering results."
216    )
217    parser.add_argument(
218        "--data_file",
219        type=str,
220        required=True,
221        help="Path to the .mat multiview dataset"
222    )
223    parser.add_argument(
224        "--no_labels",
225        action="store_true",
226        help="Run even if dataset has no ground truth labels"
227    )
228    args = parser.parse_args()
229
230    # Configuration parameters
231    temperature = 1.0
232    beta = 1.0
233    max_iter = 10
234    tolerance = 1e-7
235
236    # Load and preprocess data
237    views, labels = load_custom_mat(args.data_file)
238
239    if labels is None and not args.no_labels:
240        raise ValueError(
241            "Dataset must include 'labels' for visualization. "
242            "Use --no_labels to run without ground truth."
243        )
244
245    # Process each view
246    processed_views = []
247    for A, X in views:
248        # Convert to dense arrays if sparse
249        if issparse(A):
250            A = A.toarray()  # type: ignore
251        if issparse(X):
252            X = X.toarray()
253
254        # Ensure proper dimensions
255        A = np.asarray(A, dtype=np.float32)
256        X = np.asarray(X, dtype=np.float32)
257
258        if X.ndim == 1:
259            X = X.reshape(-1, 1)
260        if A.ndim != 2 or A.shape[0] != A.shape[1]:
261            A = np.eye(X.shape[0], dtype=np.float32)
262
263        # Preprocess
264        norm_adj, feats = preprocess_dataset(A, X, tf_idf=False, beta=int(beta))  # noqa: E501
265        if issparse(feats):
266            feats = feats.toarray()
267        processed_views.append((np.asarray(norm_adj), np.asarray(feats)))
268
269    # Create feature matrices for each view
270    Hs = []
271    for S, X in processed_views:
272        if X.ndim < 2:
273            X = X.reshape(-1, 1)
274        if S.ndim < 2:
275            S = S.reshape(-1, 1)
276
277        # Standardize features
278        H = StandardScaler(with_std=False).fit_transform(S @ X)
279        Hs.append(H)
280
281    # Cluster the data
282    k = len(np.unique(labels)) if labels is not None else 3
283    model = LMGEC(
284        n_clusters=k,
285        embedding_dim=k + 1,
286        temperature=temperature,
287        max_iter=max_iter,
288        tolerance=tolerance,
289    )
290    pred_labels = model.fit_predict(Hs)  # type: ignore
291
292    # Visualize results
293    X_concat = np.hstack([X for _, X in processed_views])
294    visualize_clusters(
295        X_concat, pred_labels, method='pca',
296        title='Predicted Clusters (LMGEC)'
297    )
298
299    if labels is not None:
300        visualize_clusters(
301            X_concat, labels, method='pca',
302            title='Ground Truth Clusters'
303        )
304
305
306if __name__ == "__main__":
307    # Suppress runtime warnings about imports
308    warnings.filterwarnings("ignore", category=RuntimeWarning)
309    main()

Ajuster les hyperparamètres

  1import argparse
  2import itertools
  3import os
  4from typing import List, Tuple, Optional
  5
  6import numpy as np
  7import matplotlib.pyplot as plt
  8import seaborn as sns
  9import pandas as pd
 10
 11from sklearn.preprocessing import StandardScaler
 12from sklearn.metrics import normalized_mutual_info_score as nmi
 13from sklearn.metrics import adjusted_rand_score as ari
 14from scipy.io import loadmat
 15
 16from mvcluster.cluster.lmgec import LMGEC
 17from mvcluster.utils.metrics import clustering_accuracy, clustering_f1_score
 18from mvcluster.utils.preprocess import preprocess_dataset
 19
 20
 21def load_custom_mat(path: str) -> Tuple[List[Tuple[np.ndarray, np.ndarray]], Optional[np.ndarray]]:  # noqa: E501
 22    """
 23    Load various possible .mat file formats with views and labels.
 24    Returns:
 25        views: list of (A, X) tuples
 26        labels: ndarray or None
 27    """
 28    from scipy.sparse import issparse
 29
 30    mat = loadmat(path)
 31    Xs, As = [], []
 32    labels = None
 33    if "labels" in mat:
 34        labels = mat["labels"].squeeze()
 35    elif "label" in mat:
 36        labels = mat["label"].squeeze()
 37    if labels is not None and labels.ndim != 1:
 38        labels = labels.ravel()
 39    if labels is not None and not isinstance(labels, np.ndarray):
 40        labels = np.array(labels)
 41
 42    i = 0
 43    while f"X_{i}" in mat and f"A_{i}" in mat:
 44        Xs.append(mat[f"X_{i}"])
 45        As.append(mat[f"A_{i}"].astype(np.float32))
 46        i += 1
 47    if Xs:
 48        return list(zip(As, Xs)), labels
 49
 50    i = 1
 51    while f"X{i}" in mat:
 52        X = mat[f"X{i}"]
 53        A = np.eye(X.shape[0], dtype=np.float32)
 54        Xs.append(X)
 55        As.append(A)
 56        i += 1
 57    if Xs:
 58        return list(zip(As, Xs)), labels
 59
 60    for key in ("features", "views"):
 61        if key in mat:
 62            value = mat[key]
 63
 64            if issparse(value):
 65                # Cas : une seule matrice sparse (1 vue)
 66                A = np.eye(value.shape[0], dtype=np.float32)
 67                return [(A, value)], labels
 68
 69            if isinstance(value, np.ndarray) and value.ndim == 2:
 70                # Cas : une seule matrice dense (1 vue)
 71                A = np.eye(value.shape[0], dtype=np.float32)
 72                return [(A, value)], labels
 73
 74            try:
 75                # Cas : plusieurs vues stockées dans un array de shape (1, n)
 76                raw_views = value[0]
 77                for view in raw_views:
 78                    if issparse(view):
 79                        view = view.tocsr()
 80                    A = np.eye(view.shape[0], dtype=np.float32)
 81                    Xs.append(view)
 82                    As.append(A)
 83                return list(zip(As, Xs)), labels
 84            except Exception as e:
 85                raise ValueError(f"Unsupported format under key '{key}': {e}")
 86
 87
 88    if "fea" in mat and "W" in mat:  # noqa :303
 89        X = mat["fea"]
 90        A = mat["W"].astype(np.float32)
 91        Xs.append(X)
 92        As.append(A)
 93        if "gnd" in mat:
 94            labels = mat["gnd"].squeeze()
 95            if labels.ndim != 1:
 96                labels = labels.ravel()
 97            if not isinstance(labels, np.ndarray):
 98                labels = np.array(labels)
 99        return list(zip(As, Xs)), labels
100
101    raise ValueError("Unsupported .mat file structure. Expected known keys.")
102
103def run_once(views, labels, dim, temp, beta, max_iter, tol):  # noqa : 302
104    """
105    Run a single LMGEC clustering evaluation with detailed
106    output and flake8 compliance.
107
108    Args:
109        views (List[Tuple[np.ndarray, np.ndarray]]): List of (A, X) views.
110        labels (np.ndarray): Ground truth cluster labels.
111        dim (int): Embedding dimension.
112        temp (float): Temperature parameter.
113        beta (float): Graph regularization coefficient.
114        max_iter (int): Maximum number of iterations.
115        tol (float): Tolerance for convergence.
116
117    Returns:
118        dict: Dictionary of evaluation metrics.
119    """
120    if labels is None:
121        raise ValueError("Ground truth labels are required.")
122
123    views_proc = []
124    print("\n[ÉTAPE] Prétraitement des vues")
125    for idx, (A, X) in enumerate(views):
126        A_norm, X_proc = preprocess_dataset(A, X, beta=beta)
127        if hasattr(X_proc, "toarray"):
128            X_proc = X_proc.toarray()
129        print(
130            f"  → Vue {idx + 1}: A ({A.shape}), X ({X.shape}) → "
131            f"A_norm ({A_norm.shape}), X_proc ({X_proc.shape})"
132        )
133        views_proc.append((A_norm, X_proc))
134
135    print("\n[ÉTAPE] Calcul des embeddings (H = S @ X)")
136    Hs = []
137    for idx, (S, X) in enumerate(views_proc):
138        H = S @ X
139        if isinstance(H, np.matrix):
140            print(f"  [AVERTISSEMENT] Vue {idx + 1} est un np.matrix → conversion en ndarray")  # noqa: E501
141            H = np.asarray(H)
142        H_scaled = StandardScaler(with_std=False).fit_transform(H)
143        print(
144            f"  → H_{idx + 1} = S @ X : {H.shape}, "
145            f"après normalisation : {H_scaled.shape}"
146        )
147        Hs.append(H_scaled)
148
149    print("\n[ÉTAPE] Entraînement du modèle LMGEC")
150    model = LMGEC(
151        n_clusters=len(np.unique(labels)),
152        embedding_dim=dim,
153        temperature=temp,
154        max_iter=max_iter,
155        tolerance=tol,
156    )
157    model.fit(Hs)
158    pred = model.labels_
159    print(f"  → Clustering terminé en {len(model.loss_history_)} itérations")
160
161    metrics = {
162        "acc": clustering_accuracy(labels, pred),
163        "nmi": nmi(labels, pred),
164        "ari": ari(labels, pred),
165        "f1": clustering_f1_score(labels, pred, average="macro"),
166    }
167    print(
168        f"[SCORE] ACC: {metrics['acc']:.4f}, "
169        f"NMI: {metrics['nmi']:.4f}, "
170        f"ARI: {metrics['ari']:.4f}, "
171        f"F1: {metrics['f1']:.4f}"
172    )
173
174    return metrics
175
176
177def main(args):
178    views, labels = load_custom_mat(args.data_file)
179    if labels is None:
180        raise ValueError("Labels not found in dataset.")
181    if args.n_clusters != len(np.unique(labels)):
182        print(
183            f"[WARN] --n_clusters ({args.n_clusters}) ≠ nb unique labels ({len(np.unique(labels))})"  # noqa: E501
184        )
185
186    temperatures = [0.1, 0.5, 1.0, 2.0, 10.0, 20.0]
187    betas = [1.0, 2.0]
188    embedding_dims = [3, 4, 5]
189
190    results = []
191    for temp, beta, dim in itertools.product(temperatures, betas, embedding_dims):  # noqa: E501
192        print("\n" + "=" * 60)
193        print(f"[TEST] Température={temp}, β={beta}, dim={dim}")
194        metrics = run_once(
195            views,
196            labels,
197            dim=dim,
198            temp=temp,
199            beta=beta,
200            max_iter=args.max_iter,
201            tol=args.tolerance,
202        )
203        metrics.update(temperature=temp, beta=beta, embedding_dim=dim)
204        results.append(metrics)
205
206    df = pd.DataFrame(results)
207    df.to_csv("hyperparam_results.csv", index=False)
208
209    print("\n[TOP CONFIGS PAR NMI]")
210    print(df.sort_values("nmi", ascending=False).head())
211
212    os.makedirs("plots", exist_ok=True)
213    for metric in ("nmi", "ari", "acc", "f1"):
214        plt.figure(figsize=(8, 5))
215        sns.lineplot(
216            data=df,
217            x="temperature",
218            y=metric,
219            hue="embedding_dim",
220            style="beta",
221            markers=True,
222        )
223        plt.title(f"{metric.upper()} vs Température")
224        plt.grid(True)
225        plt.tight_layout()
226        plt.savefig(f"plots/{metric}_vs_temperature.png")
227        plt.close()
228
229
230if __name__ == "__main__":
231    parser = argparse.ArgumentParser()
232    parser.add_argument("--data_file", type=str, required=True)
233    parser.add_argument("--n_clusters", type=int, required=True)
234    parser.add_argument("--max_iter", type=int, default=50)
235    parser.add_argument("--tolerance", type=float, default=1e-7)
236    args = parser.parse_args()
237    main(args)

Benchmark custom

  1"""
  2[EN]
  3Benchmark the LMGEC clustering algorithm on a custom multi-view dataset
  4stored in .mat format.
  5
  6This script performs the following steps:
  7
  81. Load the multi-view dataset from a .mat file, where data is organized
  9   as pairs of adjacency matrices (A_i) and feature matrices (X_i) for
 10   each view, plus optional ground truth labels.
 11
 122. Preprocess each view by normalizing adjacency matrices and preparing
 13   feature matrices, converting sparse formats to dense if necessary.
 14
 153. Run the LMGEC clustering algorithm multiple times (specified by the
 16   'runs' parameter) with given hyperparameters, fitting the model on
 17   the preprocessed feature representations.
 18
 194. Evaluate clustering performance using metrics including Accuracy,
 20   Normalized Mutual Information (NMI), Adjusted Rand Index (ARI),
 21   F1 score, final loss value, and runtime.
 22
 235. Aggregate and print the average and standard deviation of these metrics
 24   over all runs to assess the algorithm’s stability and performance.
 25
 26Command-line arguments allow flexible configuration of the dataset path,
 27number of clusters, number of runs, and algorithm-specific hyperparameters
 28such as temperature, beta (preprocessing), maximum iterations, and
 29convergence tolerance.
 30
 31The script depends on external modules from the mvcluster package for the
 32LMGEC implementation, metrics, and preprocessing utilities.
 33
 34Usage example:
 35    python benchmark_custom_lmgec.py --data_file path/to/data.mat
 36    --n_clusters 3 --runs 5 --temperature 1.0 --beta 1.0
 37
 38[FR]
 39Évaluation de l'algorithme de clustering LMGEC sur un jeu de données
 40multi-vues personnalisé au format .mat.
 41
 42Ce script réalise les étapes suivantes :
 43
 441. Chargement du jeu de données multi-vues depuis un fichier .mat, où les
 45   données sont organisées en paires de matrices d’adjacence (A_i) et
 46   matrices de caractéristiques (X_i) pour chaque vue, ainsi que les
 47   étiquettes de vérité terrain optionnelles.
 48
 492. Prétraitement de chaque vue en normalisant les matrices d’adjacence et
 50   en préparant les matrices de caractéristiques, en convertissant les
 51   formats creux en denses si nécessaire.
 52
 533. Exécution de l’algorithme de clustering LMGEC plusieurs fois (paramètre
 54   'runs') avec les hyperparamètres spécifiés, en ajustant le modèle sur
 55   les représentations prétraitées.
 56
 574. Évaluation de la performance du clustering à l’aide de métriques telles
 58   que la précision (Accuracy), l’information mutuelle normalisée (NMI),
 59   l’indice de Rand ajusté (ARI), le score F1, la valeur finale de la perte,
 60   et le temps d’exécution.
 61
 625. Agrégation et affichage de la moyenne et de l’écart-type de ces métriques
 63   sur toutes les exécutions pour mesurer la stabilité et l’efficacité de
 64   l’algorithme.
 65
 66Les arguments en ligne de commande permettent de configurer le chemin du jeu
 67de données, le nombre de clusters, le nombre d’exécutions, ainsi que des
 68hyperparamètres spécifiques tels que la température, beta (prétraitement),
 69le nombre maximal d’itérations, et la tolérance de convergence.
 70
 71Le script dépend de modules externes du package mvcluster pour
 72l’implémentation de LMGEC, les métriques et les outils de prétraitement.
 73
 74Exemple d’utilisation :
 75    python benchmark_custom_lmgec.py --data_file chemin/vers/data.mat
 76    --n_clusters 3 --runs 5 --temperature 1.0 --beta 1.0
 77
 78"""
 79
 80
 81import argparse
 82import time
 83import sys
 84import os
 85
 86sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
 87
 88import numpy as np  # noqa: E402
 89import scipy.io  # noqa: E402
 90from sklearn.preprocessing import StandardScaler  # noqa: E402, E501
 91from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score  # noqa: E402, E501
 92
 93from mvcluster.cluster.lmgec import LMGEC  # noqa: E402
 94from mvcluster.utils.metrics import clustering_accuracy, clustering_f1_score  # noqa: E402, E501
 95from mvcluster.utils.preprocess import preprocess_dataset  # noqa: E402
 96
 97
 98def load_custom_mat(path):
 99    """Load .mat file with keys: X_0, A_0, X_1, A_1, ..., labels."""
100    mat = scipy.io.loadmat(path)
101    Xs, As = [], []
102    i = 0
103    while f"X_{i}" in mat and f"A_{i}" in mat:
104        Xs.append(mat[f"X_{i}"])
105        As.append(mat[f"A_{i}"].astype(np.float32))
106        i += 1
107    labels = mat["labels"].squeeze() if "labels" in mat else None
108    return As, Xs, labels
109
110
111def run_custom_lmgec_experiment(
112    file_path,
113    n_clusters,
114    beta=1.0,
115    temperature=1.0,
116    max_iter=10,
117    tolerance=1e-7,
118    runs=5,
119):
120    As, Xs, labels = load_custom_mat(file_path)
121    views = list(zip(As, Xs))
122    for i, (A, X) in enumerate(views):
123        norm_adj, feats = preprocess_dataset(A, X, beta=beta)
124        if hasattr(feats, "toarray"):
125            feats = feats.toarray()
126        views[i] = (norm_adj, feats)
127
128    metrics = {m: [] for m in ["acc", "nmi", "ari", "f1", "loss", "time"]}
129    for _ in range(runs):
130        start = time.time()
131        Hs = [
132            StandardScaler(with_std=False).fit_transform(S @ X) for S, X in views]  # noqa: E501
133
134        model = LMGEC(
135            n_clusters=n_clusters,
136            embedding_dim=n_clusters + 1,
137            temperature=temperature,
138            max_iter=max_iter,
139            tolerance=tolerance,
140        )
141        model.fit(Hs)
142
143        duration = time.time() - start
144        preds = model.labels_
145
146        metrics["time"].append(duration)
147        metrics["acc"].append(clustering_accuracy(labels, preds))
148        metrics["nmi"].append(
149            normalized_mutual_info_score(labels, preds)  # type: ignore
150        )  # type: ignore
151        metrics["ari"].append(adjusted_rand_score(labels, preds))  # noqa: E501
152        metrics["f1"].append(
153            clustering_f1_score(labels, preds, average="macro")  # type: ignore
154        )  # type: ignore
155        metrics["loss"].append(model.loss_history_[-1])
156
157    print("\n=== Averaged Metrics over", runs, "runs ===")
158    for key in metrics:
159        mean = np.mean(metrics[key])
160        std = np.std(metrics[key])
161        print(f"{key.upper()}: {mean:.4f} ± {std:.4f}")
162
163
164if __name__ == "__main__":
165    parser = argparse.ArgumentParser(
166        description="Benchmark LMGEC on a custom multi-view dataset"
167    )
168    parser.add_argument(
169        "--data_file",
170        type=str,
171        required=True,
172        help="Path to .mat file containing X_i, A_i, labels",
173    )
174    parser.add_argument(
175        "--n_clusters",
176        type=int,
177        required=True,
178        help="Number of clusters in ground truth",
179    )
180    parser.add_argument(
181        "--runs", type=int, default=5, help="Number of runs to average metrics"
182    )
183    parser.add_argument(
184        "--temperature",
185        type=float,
186        default=1.0,
187        help="Temperature parameter for LMGEC",
188    )
189    parser.add_argument(
190        "--beta", type=float, default=1.0,
191        help="Beta for graph-feature preprocessing"
192        )
193    parser.add_argument("--max_iter", type=int, default=10)
194    parser.add_argument("--tolerance", type=float, default=1e-7)
195
196    args = parser.parse_args()
197
198    run_custom_lmgec_experiment(
199        file_path=args.data_file,
200        n_clusters=args.n_clusters,
201        beta=args.beta,
202        temperature=args.temperature,
203        max_iter=args.max_iter,
204        tolerance=args.tolerance,
205        runs=args.runs,
206    )