Examples d’utilisation¶
Voici quelques exemples d’utilisation du package mvcluster.
Préparer un dataset personnalisé¶
1"""
2[EN] prepare_custom_dataset.py - Final Version
3
4This script prepares heterogeneous multi-view (e.g., multi-omics) datasets
5for downstream tasks such as clustering or graph-based learning.
6
7It performs robust loading, preprocessing, normalization, graph
8construction, and saving of multiple data views into a unified .mat file
9format.
10
11==============================
12Main Functionalities
13==============================
14
151. Robust Data Loading
16----------------------
17- Loads CSV files using pandas.
18- Tries alternative encodings (utf-8, latin1, windows-1252) if standard
19 read fails.
20- If no valid columns are found, generates random fallback data to avoid
21 crashing.
22
232. View Preprocessing
24---------------------
25Each input view (CSV file) undergoes the following steps:
26- Categorical columns are converted to numerical using factorization.
27- Missing values are imputed using column-wise medians.
28- Views with fewer features than `--min_features` are automatically
29 augmented by duplicating existing columns.
30- If a view has more than 100 features, variance thresholding is applied
31 to remove low-variance columns.
32- Each view is standardized using `StandardScaler`.
33
343. Graph Construction
35---------------------
36- Constructs a symmetric K-Nearest Neighbors (KNN) graph for each view.
37- Graphs are binary (1/0 connectivity) and symmetric (A = (A + A.T) / 2).
38
394. Label Handling (Optional)
40----------------------------
41- If a label file is provided, it is loaded and encoded using
42 `LabelEncoder`.
43- Only labels matching the number of samples are retained.
44
455. Output Generation
46---------------------
47The final data is saved as a `.mat` file and includes:
48- Feature matrices: X_0, X_1, ..., one per view.
49- Adjacency matrices: A_0, A_1, ..., one per view.
50- View names.
51- Original shape information for each view.
52- Sample count.
53- Feature names (limited to selected columns).
54- Encoded labels (optional).
55
56==============================
57Command Line Arguments
58==============================
59--views : List of CSV files (one per view) [REQUIRED]
60--labels : (Optional) Path to CSV file with sample labels
61--data_name : Output filename (without extension) [REQUIRED]
62--k : Number of neighbors for KNN graph (default: 15)
63--min_features : Minimum number of features per view (default: 1)
64--output_dir : Output directory (default: prepared_datasets)
65
66==============================
67Typical Usage Example
68==============================
69python prepare_custom_dataset.py \
70 --views view1.csv view2.csv view3.csv \
71 --labels labels.csv \
72 --data_name my_dataset \
73 --k 15 \
74 --min_features 2 \
75 --output_dir prepared_datasets
76
77==============================
78Error Handling and Recommendations
79==============================
80- Views with <2 features may cause downstream errors with dimensionality
81 reduction (e.g., TruncatedSVD).
82- Use `--min_features 2` or manually exclude weak views.
83- Final `.mat` output is compatible with MATLAB and multi-view clustering
84 frameworks.
85
86==============================
87Output Example
88==============================
89View 1/5: transcriptomics
90transcriptomics: Selected 45/150 features
91Shape: (30, 45), Features: 45
92Loaded 3 label classes
93
94=== Successfully saved to prepared_datasets/my_dataset.mat ===
95Summary: 5 views, 30 samples
96
97
98[FR] prepare_custom_dataset.py - Version finale
99
100Ce script prépare des jeux de données hétérogènes multi-vues (ex : multi-
101omiques) pour des tâches en aval telles que le clustering ou
102l’apprentissage basé sur les graphes.
103
104Il effectue le chargement robuste, le prétraitement, la normalisation,
105la construction de graphes et la sauvegarde des vues dans un fichier
106unique `.mat`.
107
108==============================
109Fonctionnalités principales
110==============================
111
1121. Chargement robuste des données
113----------------------------------
114- Chargement des fichiers CSV avec pandas.
115- Essaie plusieurs encodages alternatifs (utf-8, latin1, windows-1252) si
116 le chargement échoue.
117- Si aucun fichier valide n'est trouvé, des données aléatoires sont
118 générées pour éviter l'arrêt du programme.
119
1202. Prétraitement des vues
121--------------------------
122Chaque vue (fichier CSV) est traitée comme suit :
123- Les colonnes catégorielles sont converties en valeurs numériques via la
124 factorisation.
125- Les valeurs manquantes sont remplacées par la médiane des colonnes.
126- Si une vue contient moins de `--min_features`, elle est augmentée
127 automatiquement.
128- Si une vue contient plus de 100 colonnes, une sélection par variance
129 est appliquée.
130- Chaque vue est normalisée avec `StandardScaler`.
131
1323. Construction de graphes
133---------------------------
134- Un graphe de K plus proches voisins (KNN) est construit pour chaque vue.
135- Les graphes sont binaires (0/1) et symétrisés (A = (A + A.T)/2).
136
1374. Gestion des étiquettes (facultatif)
138---------------------------------------
139- Si un fichier de labels est fourni, il est chargé et encodé avec
140 `LabelEncoder`.
141- Les étiquettes sont conservées uniquement si elles correspondent au
142 nombre d’échantillons.
143
1445. Génération de la sortie
145---------------------------
146Le fichier final au format `.mat` contient :
147- Les matrices de caractéristiques : X_0, X_1, ..., une par vue.
148- Les matrices d’adjacence : A_0, A_1, ..., une par vue.
149- Les noms des vues.
150- Les dimensions d’origine de chaque vue.
151- Le nombre total d’échantillons.
152- Les noms des variables (colonnes sélectionnées).
153- Les étiquettes encodées (si présentes).
154
155==============================
156Arguments en ligne de commande
157==============================
158--views : Liste de fichiers CSV (une par vue) [OBLIGATOIRE]
159--labels : (Facultatif) Fichier CSV contenant les labels
160--data_name : Nom du fichier de sortie (sans extension) [OBLIGATOIRE]
161--k : Nombre de voisins pour le graphe KNN (défaut : 15)
162--min_features : Nombre minimal de colonnes par vue (défaut : 1)
163--output_dir : Répertoire de sortie (défaut : prepared_datasets)
164
165==============================
166Exemple d'utilisation
167==============================
168python prepare_custom_dataset.py \
169 --views vue1.csv vue2.csv vue3.csv \
170 --labels labels.csv \
171 --data_name mon_dataset \
172 --k 15 \
173 --min_features 2 \
174 --output_dir prepared_datasets
175
176==============================
177Conseils et gestion des erreurs
178==============================
179- Les vues avec moins de 2 colonnes peuvent provoquer des erreurs avec
180 TruncatedSVD.
181- Utilisez `--min_features 2` ou excluez manuellement ces vues.
182- Le fichier `.mat` final est compatible avec MATLAB et les frameworks
183 de clustering multi-vues.
184
185==============================
186Exemple de sortie
187==============================
188Vue 1/5 : transcriptomics
189transcriptomics : 45/150 variables sélectionnées
190Forme : (30, 45), Variables : 45
1913 classes de labels chargées
192
193=== Sauvegarde réussie vers prepared_datasets/mon_dataset.mat ===
194Résumé : 5 vues, 30 échantillons
195"""
196
197
198import argparse
199import numpy as np
200import scipy.io
201import pandas as pd
202import os
203import warnings
204from sklearn.neighbors import kneighbors_graph
205from sklearn.preprocessing import StandardScaler, LabelEncoder
206from sklearn.feature_selection import VarianceThreshold
207
208# Configure logging
209warnings.filterwarnings('once')
210pd.set_option('display.max_columns', 10)
211
212
213def robust_read_file(filepath: str) -> pd.DataFrame:
214 """Read data file with multiple fallback strategies."""
215 try:
216 df = pd.read_csv(filepath, header=0, index_col=None)
217
218 if df.shape[1] == 0:
219 encodings = ['utf-8', 'latin1', 'windows-1252']
220 for enc in encodings:
221 try:
222 df = pd.read_csv(filepath, encoding=enc)
223 if df.shape[1] > 0:
224 break
225 except Exception:
226 continue
227
228 if df.shape[1] == 0:
229 raise ValueError("No columns detected")
230
231 return df
232
233 except Exception as e:
234 warnings.warn(f"Failed to read {filepath}: {str(e)}")
235 return pd.DataFrame({'feature': np.random.rand(30)})
236
237
238def preprocess_view(df: pd.DataFrame, view_name: str,
239 min_features: int) -> np.ndarray:
240 """Preprocess a single view."""
241 cat_cols = df.select_dtypes(exclude=np.number).columns
242 for col in cat_cols:
243 df[col] = pd.factorize(df[col])[0]
244
245 if df.isna().any().any():
246 df = df.fillna(df.median())
247
248 X = df.values.astype(np.float32)
249
250 if X.shape[1] < min_features:
251 warnings.warn(
252 f"Augmenting {view_name} from {X.shape[1]} "
253 f"to {min_features} features"
254 )
255 X = np.hstack([X] + [X[:, [0]] *
256 (min_features - X.shape[1])])
257
258 if X.shape[1] > 100:
259 selector = VarianceThreshold(threshold=0.1)
260 try:
261 X = selector.fit_transform(X)
262 print(
263 f"{view_name}: Selected {X.shape[1]}/"
264 f"{selector.n_features_in_} features"
265 )
266 except Exception as e:
267 print(f"Feature selection failed for {view_name}: {str(e)}")
268
269 if X.shape[0] > 1:
270 X = StandardScaler().fit_transform(X)
271
272 return X
273
274
275def save_heterogeneous_data(output_path: str, data: dict):
276 """Specialized saver for heterogeneous data."""
277 save_data = {}
278 for i, (x, a) in enumerate(zip(data['Xs'], data['As'])):
279 save_data[f'X_{i}'] = x
280 save_data[f'A_{i}'] = a
281
282 save_data.update({
283 'view_names': np.array(data['view_names'], dtype=object),
284 'n_samples': data['n_samples'],
285 'original_shapes': np.array(
286 [x.shape for x in data['Xs']], dtype=object
287 )
288 })
289
290 if 'labels' in data:
291 save_data['labels'] = data['labels']
292
293 scipy.io.savemat(output_path, save_data)
294
295
296def main():
297 parser = argparse.ArgumentParser(
298 description="Multi-omics data preprocessor"
299 )
300 parser.add_argument("--views", nargs="+", required=True,
301 help="Input files")
302 parser.add_argument("--labels", help="Label file")
303 parser.add_argument("--data_name", required=True, help="Output name")
304 parser.add_argument("--k", type=int, default=10,
305 help="k for KNN graph")
306 parser.add_argument("--min_features", type=int, default=2,
307 help="Min features")
308 parser.add_argument("--output_dir", default="prepared_datasets",
309 help="Output dir")
310
311 args = parser.parse_args()
312 os.makedirs(args.output_dir, exist_ok=True)
313 output_path = os.path.join(args.output_dir,
314 f"{args.data_name}.mat")
315
316 view_data = []
317 print("\n=== Processing Views ===")
318
319 for i, view_path in enumerate(args.views):
320 view_name = os.path.splitext(os.path.basename(view_path))[0]
321 print(f"\nView {i + 1}/{len(args.views)}: {view_name}")
322
323 try:
324 df = robust_read_file(view_path)
325 X = preprocess_view(df, view_name, args.min_features)
326
327 print(f"\n>>> First 10 rows of {view_name} after preprocessing:")
328 print(pd.DataFrame(X).head(10))
329
330 A = kneighbors_graph(X, n_neighbors=args.k,
331 mode='connectivity')
332 A = 0.5 * (A + A.T) # type: ignore # Symmetrize
333 A.data[:] = 1 # Binary weights
334
335 view_data.append({
336 'X': X,
337 'A': A,
338 'name': view_name,
339 'features': df.columns.tolist()[:X.shape[1]]
340 })
341
342 print(f" Shape: {X.shape}, "
343 f"Features: {len(view_data[-1]['features'])}")
344
345 except Exception as e:
346 warnings.warn(f"Failed to process {view_name}: {str(e)}")
347 continue
348
349 results = {
350 'Xs': [vd['X'] for vd in view_data],
351 'As': [vd['A'] for vd in view_data],
352 'view_names': [vd['name'] for vd in view_data],
353 'n_samples': view_data[0]['X'].shape[0] if view_data else 0,
354 'feature_names': [vd['features'] for vd in view_data]
355 }
356
357 if args.labels and os.path.exists(args.labels):
358 try:
359 labels = pd.read_csv(args.labels).squeeze()
360 if len(labels) == results['n_samples']: # type: ignore
361 results['labels'] = LabelEncoder().fit_transform(labels)
362 print(f"\nLoaded {len(np.unique(results['labels']))} "
363 "label classes")
364 except Exception as e:
365 warnings.warn(f"Label loading failed: {str(e)}")
366
367 try:
368 save_heterogeneous_data(output_path, results)
369 print(f"\n=== Successfully saved to {output_path} ===")
370 print(f"Summary: {len(view_data)} views, "
371 f"{results['n_samples']} samples")
372 except Exception as e:
373 print(f"\n!!! Final save failed: {str(e)}")
374 print("Possible solutions:")
375 print("1. Install hdf5storage: pip install hdf5storage")
376 print("2. Reduce feature dimensions using PCA")
377 print("3. Save in a different format (e.g., HDF5)")
378
379
380if __name__ == "__main__":
381 main()
Comparer les méthodes¶
1"""
2compare_methods.py
3
4Compares multiple multiview clustering algorithms
5on the same dataset using clustering metrics (NMI, ARI, ACC).
6
7Steps:
8 1. Load and preprocess a multi-view dataset.
9 2. Apply multiple clustering algorithms to generate labels.
10 3. Compute and display evaluation metrics.
11 4. Optionally visualize clusters from each method.
12
13Usage:
14 python compare_methods.py
15
16Dependencies:
17 - mvclustlib.algorithms.*
18 - mvclustlib.utils.metrics
19 - mvclustlib.utils.plot
20"""
Évaluer avec des métriques¶
1"""
2evaluate_with_metrics.py
3
4Computes clustering quality metrics (NMI, ARI, ACC) for a selected multiview
5clustering algorithm on a benchmark dataset.
6
7Steps:
8 1. Run a clustering method on a dataset.
9 2. Compare predicted labels against ground truth.
10 3. Compute and print evaluation metrics.
11
12Usage:
13 python evaluate_with_metrics.py
14
15Dependencies:
16 - mvclustlib.algorithms.lmgec
17 - mvclustlib.utils.metrics
18"""
Visualiser les clusters¶
1"""
2[EN]
3This script loads and visualizes multi-view clustering results from custom
4multi-view datasets stored in .mat files. It supports various common .mat file
5formats for multi-view data with adjacency and feature matrices, optionally
6including ground truth cluster labels.
7
8Main features and workflow:
9
101. Data Loading:
11 - Supports .mat formats with keys such as 'X_i'/'A_i', 'X1', 'features',
12 'views', and special cases like 'fea', 'W', and 'gnd'.
13 - Handles sparse and dense matrices and converts them as needed.
14 - Returns a list of (adjacency matrix, feature matrix) tuples for each view,
15 along with optional ground truth labels.
16
172. Data Preprocessing:
18 - Normalizes adjacency matrices and preprocesses feature matrices.
19 - Supports tf-idf option disabled here and beta parameter usage.
20 - Converts sparse matrices to dense format where necessary.
21
223. Clustering:
23 - Uses the LMGEC (Localized Multi-View Graph Embedding Clustering) model
24 for clustering.
25 - Automatically determines the number of clusters from labels or defaults
26 to 3 if no labels are provided.
27 - Embedding dimension is set as clusters + 1.
28
294. Visualization:
30 - Visualizes predicted clusters and, if available, ground truth clusters.
31 - Uses PCA for dimensionality reduction before plotting.
32
335. Command-Line Interface:
34 - Requires a path to the .mat dataset.
35 - Optional flag to run without ground truth labels.
36
37Dependencies:
38- mvcluster package (cluster, utils.plot, utils.preprocess modules)
39- numpy, scipy, scikit-learn, argparse, warnings
40
41Usage example:
42 python visualize_mvclusters.py --data_file path/to/data.mat
43 python visualize_mvclusters.py --data_file path/to/data.mat --no_labels
44
45[FR]
46Ce script charge et visualise les résultats de clustering multi-vues à partir
47de jeux de données multi-vues personnalisés au format .mat. Il supporte
48plusieurs formats .mat communs avec matrices d’adjacence et matrices de
49caractéristiques, incluant éventuellement des étiquettes de vérité terrain.
50
51Fonctionnalités principales et déroulement :
52
531. Chargement des données :
54- Supporte les formats .mat avec clés telles que 'X_i'/'A_i', 'X1', 'features',
55'views', et cas spéciaux comme 'fea', 'W' et 'gnd'.
56- Gère les matrices creuses (sparse) et denses en les convertissant si besoin.
57- Retourne une liste de tuples
58(matrice d’adjacence, matrice de caractéristiques)
59pour chaque vue, ainsi que les étiquettes de vérité terrain optionnelles.
60
612. Prétraitement des données :
62- Normalise les matrices d’adjacence et prépare les matrices
63de caractéristiques.
64- Supporte l’option tf-idf désactivée ici et l’usage du paramètre beta.
65- Convertit les matrices creuses en matrices denses si nécessaire.
66
673. Clustering :
68- Utilise le modèle LMGEC (Localized Multi-View Graph Embedding Clustering)
69pour le clustering.
70- Détermine automatiquement le nombre de clusters à partir des étiquettes,
71ou utilise 3 clusters par défaut si aucune étiquette n’est fournie.
72- La dimension d’embedding est fixée à clusters + 1.
73
744. Visualisation :
75- Visualise les clusters prédits et, si disponibles, les clusters de vérité
76terrain.
77- Utilise l’ACP (PCA) pour réduire la dimension avant affichage.
78
795. Interface en ligne de commande :
80- Nécessite le chemin vers le fichier .mat.
81- Option pour exécuter sans étiquettes de vérité terrain.
82
83Dépendances :
84- Package mvcluster (modules cluster, utils.plot, utils.preprocess)
85- numpy, scipy, scikit-learn, argparse, warnings
86
87Exemples d’utilisation :
88 python visualize_mvclusters.py --data_file chemin/vers/data.mat
89 python visualize_mvclusters.py --data_file chemin/vers/data.mat --no_labels
90"""
91
92
93import argparse
94import os
95import sys
96import numpy as np
97import warnings
98from sklearn.preprocessing import StandardScaler
99from scipy.io import loadmat
100from scipy.sparse import issparse, coo_matrix
101
102# Add the parent directory to the import path
103sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
104
105try:
106 from mvcluster.cluster import LMGEC
107 from mvcluster.utils.plot import visualize_clusters
108 from mvcluster.utils.preprocess import preprocess_dataset
109except ImportError as e:
110 raise ImportError(f"Failed to import required modules: {e}")
111
112
113def load_custom_mat(path):
114 """
115 Load .mat file supporting multiple multiview formats.
116
117 Args:
118 path (str): Path to the .mat file
119
120 Returns:
121 tuple: (list of (A, X) tuples, labels array or None)
122
123 Raises:
124 ValueError: If the file structure is unsupported
125 """
126 mat = loadmat(path)
127 Xs, As = [], []
128 # Try to get labels (optional)
129 labels = None
130 for label_key in ['labels', 'label', 'gt', 'ground_truth']:
131 if label_key in mat:
132 labels = mat[label_key].squeeze()
133 break
134
135 # Try X_0/A_0 format
136 i = 0
137 while f"X_{i}" in mat and f"A_{i}" in mat:
138 X = mat[f"X_{i}"]
139 A = mat[f"A_{i}"].astype(np.float32)
140 if issparse(X):
141 X = X.toarray()
142 if issparse(A):
143 A = A.toarray()
144 Xs.append(X)
145 As.append(A)
146 i += 1
147 if Xs:
148 return list(zip(As, Xs)), labels
149
150 # Try X1 format (with identity adjacency)
151 i = 1
152 while f"X{i}" in mat:
153 X = mat[f"X{i}"]
154 if issparse(X):
155 X = X.toarray()
156 A = np.eye(X.shape[0], dtype=np.float32)
157 Xs.append(X)
158 As.append(A)
159 i += 1
160 if Xs:
161 return list(zip(As, Xs)), labels
162
163 # Try features/views format
164 for key in ["features", "views", "data"]:
165 if key in mat:
166 value = mat[key]
167 try:
168 if isinstance(value, coo_matrix):
169 X = value.toarray()
170 A = np.eye(X.shape[0], dtype=np.float32)
171 return [(A, X)], labels
172 elif value.shape == (1,):
173 # Handle cell array format
174 for view in value[0]:
175 X = view.toarray() if issparse(view) else view
176 A = np.eye(X.shape[0], dtype=np.float32)
177 Xs.append(X)
178 As.append(A)
179 else:
180 # Handle matrix directly
181 X = value.toarray() if issparse(value) else value
182 A = np.eye(X.shape[0], dtype=np.float32)
183 Xs.append(X)
184 As.append(A)
185 if Xs:
186 return list(zip(As, Xs)), labels
187 except Exception as e:
188 warnings.warn(f"Failed to process key '{key}': {str(e)}")
189 continue
190 # New case for wiki.mat format with 'fea', 'W', and 'gnd' keys
191 if "fea" in mat and "W" in mat:
192 X = mat["fea"]
193 A = mat["W"].astype(np.float32)
194 Xs.append(X)
195 As.append(A)
196 if "gnd" in mat:
197 labels = mat["gnd"].squeeze()
198 if labels.ndim != 1:
199 labels = labels.ravel()
200 if not isinstance(labels, np.ndarray):
201 labels = np.array(labels)
202 return list(zip(As, Xs)), labels
203
204 raise ValueError(
205 "Unsupported .mat structure. Expected formats:\n"
206 "1. X_0/A_0, X_1/A_1,...\n"
207 "2. X1, X2,... (with identity adjacency)\n"
208 "3. 'features' or 'views' key with data"
209 )
210
211
212def main():
213 """Main function to run the visualization pipeline."""
214 parser = argparse.ArgumentParser(
215 description="Visualize multiview clustering results."
216 )
217 parser.add_argument(
218 "--data_file",
219 type=str,
220 required=True,
221 help="Path to the .mat multiview dataset"
222 )
223 parser.add_argument(
224 "--no_labels",
225 action="store_true",
226 help="Run even if dataset has no ground truth labels"
227 )
228 args = parser.parse_args()
229
230 # Configuration parameters
231 temperature = 1.0
232 beta = 1.0
233 max_iter = 10
234 tolerance = 1e-7
235
236 # Load and preprocess data
237 views, labels = load_custom_mat(args.data_file)
238
239 if labels is None and not args.no_labels:
240 raise ValueError(
241 "Dataset must include 'labels' for visualization. "
242 "Use --no_labels to run without ground truth."
243 )
244
245 # Process each view
246 processed_views = []
247 for A, X in views:
248 # Convert to dense arrays if sparse
249 if issparse(A):
250 A = A.toarray() # type: ignore
251 if issparse(X):
252 X = X.toarray()
253
254 # Ensure proper dimensions
255 A = np.asarray(A, dtype=np.float32)
256 X = np.asarray(X, dtype=np.float32)
257
258 if X.ndim == 1:
259 X = X.reshape(-1, 1)
260 if A.ndim != 2 or A.shape[0] != A.shape[1]:
261 A = np.eye(X.shape[0], dtype=np.float32)
262
263 # Preprocess
264 norm_adj, feats = preprocess_dataset(A, X, tf_idf=False, beta=int(beta)) # noqa: E501
265 if issparse(feats):
266 feats = feats.toarray()
267 processed_views.append((np.asarray(norm_adj), np.asarray(feats)))
268
269 # Create feature matrices for each view
270 Hs = []
271 for S, X in processed_views:
272 if X.ndim < 2:
273 X = X.reshape(-1, 1)
274 if S.ndim < 2:
275 S = S.reshape(-1, 1)
276
277 # Standardize features
278 H = StandardScaler(with_std=False).fit_transform(S @ X)
279 Hs.append(H)
280
281 # Cluster the data
282 k = len(np.unique(labels)) if labels is not None else 3
283 model = LMGEC(
284 n_clusters=k,
285 embedding_dim=k + 1,
286 temperature=temperature,
287 max_iter=max_iter,
288 tolerance=tolerance,
289 )
290 pred_labels = model.fit_predict(Hs) # type: ignore
291
292 # Visualize results
293 X_concat = np.hstack([X for _, X in processed_views])
294 visualize_clusters(
295 X_concat, pred_labels, method='pca',
296 title='Predicted Clusters (LMGEC)'
297 )
298
299 if labels is not None:
300 visualize_clusters(
301 X_concat, labels, method='pca',
302 title='Ground Truth Clusters'
303 )
304
305
306if __name__ == "__main__":
307 # Suppress runtime warnings about imports
308 warnings.filterwarnings("ignore", category=RuntimeWarning)
309 main()
Ajuster les hyperparamètres¶
1import argparse
2import itertools
3import os
4from typing import List, Tuple, Optional
5
6import numpy as np
7import matplotlib.pyplot as plt
8import seaborn as sns
9import pandas as pd
10
11from sklearn.preprocessing import StandardScaler
12from sklearn.metrics import normalized_mutual_info_score as nmi
13from sklearn.metrics import adjusted_rand_score as ari
14from scipy.io import loadmat
15
16from mvcluster.cluster.lmgec import LMGEC
17from mvcluster.utils.metrics import clustering_accuracy, clustering_f1_score
18from mvcluster.utils.preprocess import preprocess_dataset
19
20
21def load_custom_mat(path: str) -> Tuple[List[Tuple[np.ndarray, np.ndarray]], Optional[np.ndarray]]: # noqa: E501
22 """
23 Load various possible .mat file formats with views and labels.
24 Returns:
25 views: list of (A, X) tuples
26 labels: ndarray or None
27 """
28 from scipy.sparse import issparse
29
30 mat = loadmat(path)
31 Xs, As = [], []
32 labels = None
33 if "labels" in mat:
34 labels = mat["labels"].squeeze()
35 elif "label" in mat:
36 labels = mat["label"].squeeze()
37 if labels is not None and labels.ndim != 1:
38 labels = labels.ravel()
39 if labels is not None and not isinstance(labels, np.ndarray):
40 labels = np.array(labels)
41
42 i = 0
43 while f"X_{i}" in mat and f"A_{i}" in mat:
44 Xs.append(mat[f"X_{i}"])
45 As.append(mat[f"A_{i}"].astype(np.float32))
46 i += 1
47 if Xs:
48 return list(zip(As, Xs)), labels
49
50 i = 1
51 while f"X{i}" in mat:
52 X = mat[f"X{i}"]
53 A = np.eye(X.shape[0], dtype=np.float32)
54 Xs.append(X)
55 As.append(A)
56 i += 1
57 if Xs:
58 return list(zip(As, Xs)), labels
59
60 for key in ("features", "views"):
61 if key in mat:
62 value = mat[key]
63
64 if issparse(value):
65 # Cas : une seule matrice sparse (1 vue)
66 A = np.eye(value.shape[0], dtype=np.float32)
67 return [(A, value)], labels
68
69 if isinstance(value, np.ndarray) and value.ndim == 2:
70 # Cas : une seule matrice dense (1 vue)
71 A = np.eye(value.shape[0], dtype=np.float32)
72 return [(A, value)], labels
73
74 try:
75 # Cas : plusieurs vues stockées dans un array de shape (1, n)
76 raw_views = value[0]
77 for view in raw_views:
78 if issparse(view):
79 view = view.tocsr()
80 A = np.eye(view.shape[0], dtype=np.float32)
81 Xs.append(view)
82 As.append(A)
83 return list(zip(As, Xs)), labels
84 except Exception as e:
85 raise ValueError(f"Unsupported format under key '{key}': {e}")
86
87
88 if "fea" in mat and "W" in mat: # noqa :303
89 X = mat["fea"]
90 A = mat["W"].astype(np.float32)
91 Xs.append(X)
92 As.append(A)
93 if "gnd" in mat:
94 labels = mat["gnd"].squeeze()
95 if labels.ndim != 1:
96 labels = labels.ravel()
97 if not isinstance(labels, np.ndarray):
98 labels = np.array(labels)
99 return list(zip(As, Xs)), labels
100
101 raise ValueError("Unsupported .mat file structure. Expected known keys.")
102
103def run_once(views, labels, dim, temp, beta, max_iter, tol): # noqa : 302
104 """
105 Run a single LMGEC clustering evaluation with detailed
106 output and flake8 compliance.
107
108 Args:
109 views (List[Tuple[np.ndarray, np.ndarray]]): List of (A, X) views.
110 labels (np.ndarray): Ground truth cluster labels.
111 dim (int): Embedding dimension.
112 temp (float): Temperature parameter.
113 beta (float): Graph regularization coefficient.
114 max_iter (int): Maximum number of iterations.
115 tol (float): Tolerance for convergence.
116
117 Returns:
118 dict: Dictionary of evaluation metrics.
119 """
120 if labels is None:
121 raise ValueError("Ground truth labels are required.")
122
123 views_proc = []
124 print("\n[ÉTAPE] Prétraitement des vues")
125 for idx, (A, X) in enumerate(views):
126 A_norm, X_proc = preprocess_dataset(A, X, beta=beta)
127 if hasattr(X_proc, "toarray"):
128 X_proc = X_proc.toarray()
129 print(
130 f" → Vue {idx + 1}: A ({A.shape}), X ({X.shape}) → "
131 f"A_norm ({A_norm.shape}), X_proc ({X_proc.shape})"
132 )
133 views_proc.append((A_norm, X_proc))
134
135 print("\n[ÉTAPE] Calcul des embeddings (H = S @ X)")
136 Hs = []
137 for idx, (S, X) in enumerate(views_proc):
138 H = S @ X
139 if isinstance(H, np.matrix):
140 print(f" [AVERTISSEMENT] Vue {idx + 1} est un np.matrix → conversion en ndarray") # noqa: E501
141 H = np.asarray(H)
142 H_scaled = StandardScaler(with_std=False).fit_transform(H)
143 print(
144 f" → H_{idx + 1} = S @ X : {H.shape}, "
145 f"après normalisation : {H_scaled.shape}"
146 )
147 Hs.append(H_scaled)
148
149 print("\n[ÉTAPE] Entraînement du modèle LMGEC")
150 model = LMGEC(
151 n_clusters=len(np.unique(labels)),
152 embedding_dim=dim,
153 temperature=temp,
154 max_iter=max_iter,
155 tolerance=tol,
156 )
157 model.fit(Hs)
158 pred = model.labels_
159 print(f" → Clustering terminé en {len(model.loss_history_)} itérations")
160
161 metrics = {
162 "acc": clustering_accuracy(labels, pred),
163 "nmi": nmi(labels, pred),
164 "ari": ari(labels, pred),
165 "f1": clustering_f1_score(labels, pred, average="macro"),
166 }
167 print(
168 f"[SCORE] ACC: {metrics['acc']:.4f}, "
169 f"NMI: {metrics['nmi']:.4f}, "
170 f"ARI: {metrics['ari']:.4f}, "
171 f"F1: {metrics['f1']:.4f}"
172 )
173
174 return metrics
175
176
177def main(args):
178 views, labels = load_custom_mat(args.data_file)
179 if labels is None:
180 raise ValueError("Labels not found in dataset.")
181 if args.n_clusters != len(np.unique(labels)):
182 print(
183 f"[WARN] --n_clusters ({args.n_clusters}) ≠ nb unique labels ({len(np.unique(labels))})" # noqa: E501
184 )
185
186 temperatures = [0.1, 0.5, 1.0, 2.0, 10.0, 20.0]
187 betas = [1.0, 2.0]
188 embedding_dims = [3, 4, 5]
189
190 results = []
191 for temp, beta, dim in itertools.product(temperatures, betas, embedding_dims): # noqa: E501
192 print("\n" + "=" * 60)
193 print(f"[TEST] Température={temp}, β={beta}, dim={dim}")
194 metrics = run_once(
195 views,
196 labels,
197 dim=dim,
198 temp=temp,
199 beta=beta,
200 max_iter=args.max_iter,
201 tol=args.tolerance,
202 )
203 metrics.update(temperature=temp, beta=beta, embedding_dim=dim)
204 results.append(metrics)
205
206 df = pd.DataFrame(results)
207 df.to_csv("hyperparam_results.csv", index=False)
208
209 print("\n[TOP CONFIGS PAR NMI]")
210 print(df.sort_values("nmi", ascending=False).head())
211
212 os.makedirs("plots", exist_ok=True)
213 for metric in ("nmi", "ari", "acc", "f1"):
214 plt.figure(figsize=(8, 5))
215 sns.lineplot(
216 data=df,
217 x="temperature",
218 y=metric,
219 hue="embedding_dim",
220 style="beta",
221 markers=True,
222 )
223 plt.title(f"{metric.upper()} vs Température")
224 plt.grid(True)
225 plt.tight_layout()
226 plt.savefig(f"plots/{metric}_vs_temperature.png")
227 plt.close()
228
229
230if __name__ == "__main__":
231 parser = argparse.ArgumentParser()
232 parser.add_argument("--data_file", type=str, required=True)
233 parser.add_argument("--n_clusters", type=int, required=True)
234 parser.add_argument("--max_iter", type=int, default=50)
235 parser.add_argument("--tolerance", type=float, default=1e-7)
236 args = parser.parse_args()
237 main(args)
Benchmark custom¶
1"""
2[EN]
3Benchmark the LMGEC clustering algorithm on a custom multi-view dataset
4stored in .mat format.
5
6This script performs the following steps:
7
81. Load the multi-view dataset from a .mat file, where data is organized
9 as pairs of adjacency matrices (A_i) and feature matrices (X_i) for
10 each view, plus optional ground truth labels.
11
122. Preprocess each view by normalizing adjacency matrices and preparing
13 feature matrices, converting sparse formats to dense if necessary.
14
153. Run the LMGEC clustering algorithm multiple times (specified by the
16 'runs' parameter) with given hyperparameters, fitting the model on
17 the preprocessed feature representations.
18
194. Evaluate clustering performance using metrics including Accuracy,
20 Normalized Mutual Information (NMI), Adjusted Rand Index (ARI),
21 F1 score, final loss value, and runtime.
22
235. Aggregate and print the average and standard deviation of these metrics
24 over all runs to assess the algorithm’s stability and performance.
25
26Command-line arguments allow flexible configuration of the dataset path,
27number of clusters, number of runs, and algorithm-specific hyperparameters
28such as temperature, beta (preprocessing), maximum iterations, and
29convergence tolerance.
30
31The script depends on external modules from the mvcluster package for the
32LMGEC implementation, metrics, and preprocessing utilities.
33
34Usage example:
35 python benchmark_custom_lmgec.py --data_file path/to/data.mat
36 --n_clusters 3 --runs 5 --temperature 1.0 --beta 1.0
37
38[FR]
39Évaluation de l'algorithme de clustering LMGEC sur un jeu de données
40multi-vues personnalisé au format .mat.
41
42Ce script réalise les étapes suivantes :
43
441. Chargement du jeu de données multi-vues depuis un fichier .mat, où les
45 données sont organisées en paires de matrices d’adjacence (A_i) et
46 matrices de caractéristiques (X_i) pour chaque vue, ainsi que les
47 étiquettes de vérité terrain optionnelles.
48
492. Prétraitement de chaque vue en normalisant les matrices d’adjacence et
50 en préparant les matrices de caractéristiques, en convertissant les
51 formats creux en denses si nécessaire.
52
533. Exécution de l’algorithme de clustering LMGEC plusieurs fois (paramètre
54 'runs') avec les hyperparamètres spécifiés, en ajustant le modèle sur
55 les représentations prétraitées.
56
574. Évaluation de la performance du clustering à l’aide de métriques telles
58 que la précision (Accuracy), l’information mutuelle normalisée (NMI),
59 l’indice de Rand ajusté (ARI), le score F1, la valeur finale de la perte,
60 et le temps d’exécution.
61
625. Agrégation et affichage de la moyenne et de l’écart-type de ces métriques
63 sur toutes les exécutions pour mesurer la stabilité et l’efficacité de
64 l’algorithme.
65
66Les arguments en ligne de commande permettent de configurer le chemin du jeu
67de données, le nombre de clusters, le nombre d’exécutions, ainsi que des
68hyperparamètres spécifiques tels que la température, beta (prétraitement),
69le nombre maximal d’itérations, et la tolérance de convergence.
70
71Le script dépend de modules externes du package mvcluster pour
72l’implémentation de LMGEC, les métriques et les outils de prétraitement.
73
74Exemple d’utilisation :
75 python benchmark_custom_lmgec.py --data_file chemin/vers/data.mat
76 --n_clusters 3 --runs 5 --temperature 1.0 --beta 1.0
77
78"""
79
80
81import argparse
82import time
83import sys
84import os
85
86sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
87
88import numpy as np # noqa: E402
89import scipy.io # noqa: E402
90from sklearn.preprocessing import StandardScaler # noqa: E402, E501
91from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score # noqa: E402, E501
92
93from mvcluster.cluster.lmgec import LMGEC # noqa: E402
94from mvcluster.utils.metrics import clustering_accuracy, clustering_f1_score # noqa: E402, E501
95from mvcluster.utils.preprocess import preprocess_dataset # noqa: E402
96
97
98def load_custom_mat(path):
99 """Load .mat file with keys: X_0, A_0, X_1, A_1, ..., labels."""
100 mat = scipy.io.loadmat(path)
101 Xs, As = [], []
102 i = 0
103 while f"X_{i}" in mat and f"A_{i}" in mat:
104 Xs.append(mat[f"X_{i}"])
105 As.append(mat[f"A_{i}"].astype(np.float32))
106 i += 1
107 labels = mat["labels"].squeeze() if "labels" in mat else None
108 return As, Xs, labels
109
110
111def run_custom_lmgec_experiment(
112 file_path,
113 n_clusters,
114 beta=1.0,
115 temperature=1.0,
116 max_iter=10,
117 tolerance=1e-7,
118 runs=5,
119):
120 As, Xs, labels = load_custom_mat(file_path)
121 views = list(zip(As, Xs))
122 for i, (A, X) in enumerate(views):
123 norm_adj, feats = preprocess_dataset(A, X, beta=beta)
124 if hasattr(feats, "toarray"):
125 feats = feats.toarray()
126 views[i] = (norm_adj, feats)
127
128 metrics = {m: [] for m in ["acc", "nmi", "ari", "f1", "loss", "time"]}
129 for _ in range(runs):
130 start = time.time()
131 Hs = [
132 StandardScaler(with_std=False).fit_transform(S @ X) for S, X in views] # noqa: E501
133
134 model = LMGEC(
135 n_clusters=n_clusters,
136 embedding_dim=n_clusters + 1,
137 temperature=temperature,
138 max_iter=max_iter,
139 tolerance=tolerance,
140 )
141 model.fit(Hs)
142
143 duration = time.time() - start
144 preds = model.labels_
145
146 metrics["time"].append(duration)
147 metrics["acc"].append(clustering_accuracy(labels, preds))
148 metrics["nmi"].append(
149 normalized_mutual_info_score(labels, preds) # type: ignore
150 ) # type: ignore
151 metrics["ari"].append(adjusted_rand_score(labels, preds)) # noqa: E501
152 metrics["f1"].append(
153 clustering_f1_score(labels, preds, average="macro") # type: ignore
154 ) # type: ignore
155 metrics["loss"].append(model.loss_history_[-1])
156
157 print("\n=== Averaged Metrics over", runs, "runs ===")
158 for key in metrics:
159 mean = np.mean(metrics[key])
160 std = np.std(metrics[key])
161 print(f"{key.upper()}: {mean:.4f} ± {std:.4f}")
162
163
164if __name__ == "__main__":
165 parser = argparse.ArgumentParser(
166 description="Benchmark LMGEC on a custom multi-view dataset"
167 )
168 parser.add_argument(
169 "--data_file",
170 type=str,
171 required=True,
172 help="Path to .mat file containing X_i, A_i, labels",
173 )
174 parser.add_argument(
175 "--n_clusters",
176 type=int,
177 required=True,
178 help="Number of clusters in ground truth",
179 )
180 parser.add_argument(
181 "--runs", type=int, default=5, help="Number of runs to average metrics"
182 )
183 parser.add_argument(
184 "--temperature",
185 type=float,
186 default=1.0,
187 help="Temperature parameter for LMGEC",
188 )
189 parser.add_argument(
190 "--beta", type=float, default=1.0,
191 help="Beta for graph-feature preprocessing"
192 )
193 parser.add_argument("--max_iter", type=int, default=10)
194 parser.add_argument("--tolerance", type=float, default=1e-7)
195
196 args = parser.parse_args()
197
198 run_custom_lmgec_experiment(
199 file_path=args.data_file,
200 n_clusters=args.n_clusters,
201 beta=args.beta,
202 temperature=args.temperature,
203 max_iter=args.max_iter,
204 tolerance=args.tolerance,
205 runs=args.runs,
206 )