Average-Link Hierarchical Agglomerative Clustering

This repository contains the Python implementation of the algorithms discussed in the paper "On the Cohesion and Separability of Average-Link for Hierarchical Agglomerative Clustering".
Repository Structure

    get_data.py: Script to retrieve, clean, and normalize the datasets used in the tests.
    metrics.py: Contains functions to compute the metrics defined in the paper.
    main.py: Executes the clustering algorithms on all datasets, computes the metrics, and saves the results in the results folder.
    summarizer.py: Summarizes the results generated by main.py, identifying the best method at each step.
    performance_analyzer.py: Script to analyze the performance of the different methods, providing comparisons based on the computed metrics.


Prerequisites
Ensure you have the following Python libraries installed:

    pandas
    numpy
    sklearn
    matplotlib
    multiprocessing
    scipy
    gzip
    zipfile
    ucimlrepo


Running the Scripts

Data Preparation:
    Execute get_data.py to download, clean, and normalize the datasets.

Computing Metrics:
    Run main.py to process all datasets, compute the specified metrics, and store the results in the results folder.

Summarizing Results:
    Use summarizer.py to analyze the results from main.py and identify the best method at each step.

Performance Analysis:
    Execute performance_analyzer.py to analyze the performance of the different nethods


Files Description

get_data.py:
    Fetches datasets from various sources.
    Cleans the data to remove any inconsistencies.
    Normalizes the data to ensure uniformity across all datasets.

metrics.py:
    Calculates cohesion and separability metrics as defined in the paper.
    Includes additional utility functions for metric computation.

main.py:
    Iterates through all datasets.
    Applies hierarchical agglomerative clustering with the methods mentioned in the paper.
    Computes metrics for each dataset.
    Saves the computed metrics in the results folder for further analysis..

summarizer.py:
    Loads the results from the results folder.
    Analyzes the metrics.
    Summarizes the performance of different methods, highlighting the best performing method at each step.

performance_analyzer.py:
    Performs performance tests to compare methods.


Results

Results are stored in the results, results_l1, results_l2 and extra_results folders in a csv format for easy access and further analysis.

