{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "earned-imaging",
   "metadata": {},
   "source": [
    "# Mallows Hamming\n",
    "### A python package for Mallows Model using Hamming distance with complete rankings \n",
    "&emsp;&emsp; **By Ahmed Boujaada and Ekhine Irurozki**\n",
    "\n",
    "See also\n",
    "> [1] E. Irurozki, B. Calvo, and J. Lozano. “Mallows and generalized Mallows model for matchings”. In: Bernoulli 25.2 (2019), pp. 1160–1188\n",
    "\n",
    "> [2] E. Irurozki, B. Calvo, and J. Lozano. “PerMallows: An R package for mallows and generalized mallows models”. In: Journal of Statistical Software 71 (2019)\n",
    "\n",
    "We present methods for working with the Mallows Model (MM), the best-known distribution for permutations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "accredited-welsh",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import mallows_hamming as mh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "royal-flour",
   "metadata": {},
   "source": [
    "## Hamming Distance"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "serial-flashing",
   "metadata": {},
   "source": [
    "Let $\\sigma$ and $\\pi$ be two permutations. Hamming distance between $\\sigma$ and $\\pi$ counts the number of disagreements on both permutations. It's given by\n",
    "\n",
    "$$d(\\sigma, \\pi) = \\sum^n_{i=1} \\mathbb{I}[\\sigma(i)\\neq\\pi(i)]$$\n",
    "\n",
    "where $\\mathbb{I[\\cdot]}$ represents the Indicator function and $n$ is the length of permutations. Hamming distance between two permutations can be computed as follows: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "occasional-least",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "perm1 = np.array([3, 1, 2, 0, 4])\n",
    "perm2 = np.array([3, 1, 4, 2, 0])\n",
    "mh.distance(perm1, perm2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "radio-huntington",
   "metadata": {},
   "source": [
    "If only one permutation is given as input, it will be assumed that the second permutation is the identity permutation, $e = (1, 2, \\dotsc, n)$, $d(\\sigma) = d(\\sigma,e)$. In this case, it counts the number of *fixed points* on $\\sigma$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "heated-joyce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mh.distance(perm1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "designing-latvia",
   "metadata": {},
   "source": [
    "The maximum value of Hamming distance between two permutations of length $n$ is $n$ itself. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "standing-mambo",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n=5\n",
    "mh.dist_at_uniform(n)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "reflected-investigation",
   "metadata": {},
   "source": [
    "This package also allows to sample a permutation at a given Hamming distance where all the possible permutations have the same probability of being generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "pursuant-stable",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([4, 2, 1, 3, 0])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dist=2\n",
    "mh.sample_at_dist(n,dist, sigma0=[4,1,2,3,0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "third-ticket",
   "metadata": {},
   "source": [
    "*Remark:* Hamming distance between permutations is defined in $\\mathbb{N}^{[0,n]}\\backslash\\{1\\}$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "plain-omaha",
   "metadata": {},
   "source": [
    "## Mallows Model (MM) under the Hamming distance"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "american-penguin",
   "metadata": {},
   "source": [
    "The probability mass function of a Mallows Model with central permutation $\\sigma_0$ and dispersion parameter $\\theta$ can be computed using the following function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "scheduled-bubble",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.010125860730934329"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sigma = np.array([3,1,2,0,4])\n",
    "sigma_0 = np.array(range(5))\n",
    "theta = 0.1\n",
    "\n",
    "mh.prob(sigma, sigma_0, theta)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "partial-method",
   "metadata": {},
   "source": [
    "**Sampling** This package includes a sampler based on the factorization of Hamming  distance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "korean-robinson",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[4., 0., 2., 3., 1.],\n",
       "       [0., 1., 2., 3., 4.],\n",
       "       [2., 0., 1., 3., 4.],\n",
       "       [0., 1., 2., 3., 4.]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mh.sample(m=4,n=5,theta=1.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "configured-gravity",
   "metadata": {},
   "source": [
    "Note that in the package, the sampling functions generates the samples considering $\\sigma_0 = e$, identity permutation by default. But any other central permutation can be given as a parameter.  \n",
    "In practice, we can draw a sample from a MM as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "increasing-paper",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[4., 1., 2., 3., 0.],\n",
       "       [3., 2., 0., 1., 4.],\n",
       "       [0., 2., 4., 3., 1.],\n",
       "       [0., 3., 1., 4., 2.]])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mh.sample(m=4, n=5, theta=0.1, s0=np.array([4,3,2,1,0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "living-toddler",
   "metadata": {},
   "source": [
    "In this package, we can specify also the parameter `phi` instead of `theta`. This functionality holds for most functions. The sampling, for example, is done then as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "statistical-reviewer",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0., 3., 1., 2., 4.],\n",
       "       [2., 0., 3., 1., 4.],\n",
       "       [0., 4., 3., 1., 2.],\n",
       "       [0., 1., 2., 3., 4.]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mh.sample(m=4,n=5,phi=.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "thousand-catalyst",
   "metadata": {},
   "source": [
    "**Expected distance** The expected value of Hamming distance under the MM is given in [1,2] and is computed as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "interested-vector",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.9927710045244083"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "theta_mm = 0.7\n",
    "expected_dist = mh.expected_dist_mm(n, theta_mm)\n",
    "expected_dist"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "thorough-softball",
   "metadata": {},
   "source": [
    "**Learning** Given a sample distributed according to the MM, the maximum likelihood estimator MLE for the median is computed with the Hungarian algorithm in polynomial time. We provide the `median` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "naked-enzyme",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3, 2, 1, 0, 4])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_mm = mh.sample(m=40,n=5,phi=.3, s0=[3,2,1,0,4])\n",
    "mh.median(sample_mm)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
