{
  "Selected_candidate": {
    "pr_number": 219,
    "pr_title": "Fix concat str truncation",
    "pr_body": "Fixes #217.\n\nI also took the opportunity to add two small optimizations, which add up to make `Variable.concat` about 35% faster.\n",
    "issue_id": 217,
    "issue_title": "Strings are truncated when concatenating Datasets.",
    "issue_body": "When concatenating Datasets, a variable's string length is limited to the length in the first of the Datasets being concatenated.\n\n```\n>>> import xray\n>>> first = xray.Dataset({'animal': ('animal', ['horse'])})\n>>> second = xray.Dataset( {'animal': ('animal', ['aardvark_0'])})\n>>> xray.Dataset.concat([first, second], dimension='animal')['animal']\n<xray.DataArray 'animal' (animal: 2)>\narray(['horse', 'aardv'], \n      dtype='|S5')\nCoordinates:\n    animal: Index([u'horse', u'aardv'], dtype='object')\nAttributes:\n    Empty\n```\n\n(Note the `|S5` dtype and the truncated `aardv`)\n\nI think this is the offending line: https://github.com/xray/xray/blob/master/xray/core/variable.py#L623\nMay want to use `dtype=object` for strings to avoid this issue.\n",
    "issue_closed_at": "2014-08-21T05:17:28Z",
    "base_commit": "4a9f283fdb2b4c7588a8ca373e9f3cb9af401bf4",
    "changes": [
      {
        "file": "xray/core/variable.py",
        "type": "function",
        "name": "concat",
        "class_name": "Variable",
        "code": "def concat(cls, variables, dim='concat_dim', indexers=None, length=None,\n               shortcut=False):\n        \"\"\"Concatenate variables along a new or existing dimension.\n\n        Parameters\n        ----------\n        variables : iterable of Array\n            Arrays to stack together. Each variable is expected to have\n            matching dimensions and shape except for along the stacked\n            dimension.\n        dim : str or DataArray, optional\n            Name of the dimension to stack along. This can either be a new\n            dimension name, in which case it is added along axis=0, or an\n            existing dimension name, in which case the location of the\n            dimension is unchanged. Where to insert the new dimension is\n            determined by the first variable.\n        indexers : iterable of indexers, optional\n            Iterable of indexers of the same length as variables which\n            specifies how to assign variables along the given dimension. If\n            not supplied, indexers is inferred from the length of each\n            variable along the dimension, and the variables are stacked in the\n            given order.\n        length : int, optional\n            Length of the new dimension. This is used to allocate the new data\n            array for the stacked variable data before iterating over all\n            items, which is thus more memory efficient and a bit faster. If\n            dimension is provided as a DataArray, length is calculated\n            automatically.\n        shortcut : bool, optional\n            This option is used internally to speed-up groupby operations.\n            If `shortcut` is True, some checks of internal consistency between\n            arrays to concatenate are skipped.\n\n        Returns\n        -------\n        stacked : Variable\n            Concatenated Variable formed by stacking all the supplied variables\n            along the given dimension.\n        \"\"\"\n        if not isinstance(dim, basestring):\n            length = dim.size\n            dim, = dim.dims\n\n        if length is None or indexers is None:\n            # so much for lazy evaluation! we need to look at all the variables\n            # to figure out the indexers and/or dimensions of the stacked\n            # variable\n            variables = list(variables)\n            steps = [var.shape[var.get_axis_num(dim)]\n                     if dim in var.dims else 1\n                     for var in variables]\n            if length is None:\n                length = sum(steps)\n            if indexers is None:\n                indexers = []\n                i = 0\n                for step in steps:\n                    indexers.append(slice(i, i + step))\n                    i += step\n                if i != length:\n                    raise ValueError('actual length of stacked variables '\n                                     'along %s is %r but expected length was '\n                                     '%s' % (dim, i, length))\n\n        # initialize the stacked variable with empty data\n        from . import groupby\n        first_var, variables = groupby.peek_at(variables)\n        if dim in first_var.dims:\n            axis = first_var.get_axis_num(dim)\n            shape = tuple(length if n == axis else s\n                          for n, s in enumerate(first_var.shape))\n            dims = first_var.dims\n        else:\n            axis = 0\n            shape = (length,) + first_var.shape\n            dims = (dim,) + first_var.dims\n\n        concatenated = cls(dims, np.empty(shape, dtype=first_var.dtype))\n        concatenated.attrs.update(first_var.attrs)\n\n        alt_dims = tuple(d for d in dims if d != dim)\n\n        # copy in the data from the variables\n        for var, indexer in zip(variables, indexers):\n            if not shortcut:\n                # do sanity checks & attributes clean-up\n                if dim in var.dims:\n                    # transpose verifies that the dims are equivalent\n                    if var.dims != concatenated.dims:\n                        var = var.transpose(*concatenated.dims)\n                elif var.dims != alt_dims:\n                    raise ValueError('inconsistent dimensions')\n                utils.remove_incompatible_items(concatenated.attrs, var.attrs)\n\n            key = tuple(indexer if n == axis else slice(None)\n                        for n in range(concatenated.ndim))\n            concatenated.values[key] = var.values\n\n        return concatenated"
      },
      {
        "file": "xray/core/variable.py",
        "type": "function",
        "name": "concat",
        "class_name": "Variable",
        "code": "def concat(cls, variables, dim='concat_dim', indexers=None, length=None,\n               shortcut=False):\n        \"\"\"Concatenate variables along a new or existing dimension.\n\n        Parameters\n        ----------\n        variables : iterable of Array\n            Arrays to stack together. Each variable is expected to have\n            matching dimensions and shape except for along the stacked\n            dimension.\n        dim : str or DataArray, optional\n            Name of the dimension to stack along. This can either be a new\n            dimension name, in which case it is added along axis=0, or an\n            existing dimension name, in which case the location of the\n            dimension is unchanged. Where to insert the new dimension is\n            determined by the first variable.\n        indexers : iterable of indexers, optional\n            Iterable of indexers of the same length as variables which\n            specifies how to assign variables along the given dimension. If\n            not supplied, indexers is inferred from the length of each\n            variable along the dimension, and the variables are stacked in the\n            given order.\n        length : int, optional\n            Length of the new dimension. This is used to allocate the new data\n            array for the stacked variable data before iterating over all\n            items, which is thus more memory efficient and a bit faster. If\n            dimension is provided as a DataArray, length is calculated\n            automatically.\n        shortcut : bool, optional\n            This option is used internally to speed-up groupby operations.\n            If `shortcut` is True, some checks of internal consistency between\n            arrays to concatenate are skipped.\n\n        Returns\n        -------\n        stacked : Variable\n            Concatenated Variable formed by stacking all the supplied variables\n            along the given dimension.\n        \"\"\"\n        if not isinstance(dim, basestring):\n            length = dim.size\n            dim, = dim.dims\n\n        if length is None or indexers is None:\n            # so much for lazy evaluation! we need to look at all the variables\n            # to figure out the indexers and/or dimensions of the stacked\n            # variable\n            variables = list(variables)\n            steps = [var.shape[var.get_axis_num(dim)]\n                     if dim in var.dims else 1\n                     for var in variables]\n            if length is None:\n                length = sum(steps)\n            if indexers is None:\n                indexers = []\n                i = 0\n                for step in steps:\n                    indexers.append(slice(i, i + step))\n                    i += step\n                if i != length:\n                    raise ValueError('actual length of stacked variables '\n                                     'along %s is %r but expected length was '\n                                     '%s' % (dim, i, length))\n\n        # initialize the stacked variable with empty data\n        from . import groupby\n        first_var, variables = groupby.peek_at(variables)\n        if dim in first_var.dims:\n            axis = first_var.get_axis_num(dim)\n            shape = tuple(length if n == axis else s\n                          for n, s in enumerate(first_var.shape))\n            dims = first_var.dims\n        else:\n            axis = 0\n            shape = (length,) + first_var.shape\n            dims = (dim,) + first_var.dims\n\n        concatenated = cls(dims, np.empty(shape, dtype=first_var.dtype))\n        concatenated.attrs.update(first_var.attrs)\n\n        alt_dims = tuple(d for d in dims if d != dim)\n\n        # copy in the data from the variables\n        for var, indexer in zip(variables, indexers):\n            if not shortcut:\n                # do sanity checks & attributes clean-up\n                if dim in var.dims:\n                    # transpose verifies that the dims are equivalent\n                    if var.dims != concatenated.dims:\n                        var = var.transpose(*concatenated.dims)\n                elif var.dims != alt_dims:\n                    raise ValueError('inconsistent dimensions')\n                utils.remove_incompatible_items(concatenated.attrs, var.attrs)\n\n            key = tuple(indexer if n == axis else slice(None)\n                        for n in range(concatenated.ndim))\n            concatenated.values[key] = var.values\n\n        return concatenated"
      }
    ]
  },
  "Justification": "Candidate B is the most relevant report as it directly addresses issues related to concatenating xray Datasets, similar to the CURRENT bug report's focus on concatenating datasets with varying variables. Both reports concern the behavior of the `concat` function, albeit with different underlying issues. The similarity in observable behavior and the relevant files that were fixed in Candidate B make it highly applicable to understanding and potentially fixing the CURRENT bug. The fix involved changing how data types are handled during the concatenation process, which may hold insights into resolving the mismatch of variables in the user's datasets.",
  "instance_id": "pydata__xarray-3364",
  "repo": "pydata/xarray",
  "created_at": "2019-10-01T21:15:54Z",
  "problem_statement": "Ignore missing variables when concatenating datasets?\nSeveral users (@raj-kesavan, @richardotis, now myself) have wondered about how to concatenate xray Datasets with different variables.\n\nWith the current `xray.concat`, you need to awkwardly create dummy variables filled with `NaN` in datasets that don't have them (or drop mismatched variables entirely). Neither of these are great options -- `concat` should have an option (the default?) to take care of this for the user.\n\nThis would also be more consistent with `pd.concat`, which takes a more relaxed approach to matching dataframes with different variables (it does an outer join).\n\n",
  "patch": "diff --git a/xarray/core/concat.py b/xarray/core/concat.py\n--- a/xarray/core/concat.py\n+++ b/xarray/core/concat.py\n@@ -312,15 +312,9 @@ def _dataset_concat(\n         to_merge = {var: [] for var in variables_to_merge}\n \n         for ds in datasets:\n-            absent_merge_vars = variables_to_merge - set(ds.variables)\n-            if absent_merge_vars:\n-                raise ValueError(\n-                    \"variables %r are present in some datasets but not others. \"\n-                    % absent_merge_vars\n-                )\n-\n             for var in variables_to_merge:\n-                to_merge[var].append(ds.variables[var])\n+                if var in ds:\n+                    to_merge[var].append(ds.variables[var])\n \n         for var in variables_to_merge:\n             result_vars[var] = unique_variable(\n"
}