{
  "original_problem": {
    "instance_id": "pydata__xarray-4493",
    "repo": "pydata/xarray",
    "created_at": "2020-10-06T22:00:41Z",
    "problem_statement": "DataSet.update causes chunked dask DataArray to evalute its values eagerly \n**What happened**:\r\nUsed `DataSet.update` to update a chunked dask DataArray, but the DataArray is no longer chunked after the update.\r\n\r\n**What you expected to happen**:\r\nThe chunked DataArray should still be chunked after the update\r\n\r\n**Minimal Complete Verifiable Example**:\r\n\r\n```python\r\nfoo = xr.DataArray(np.random.randn(3, 3), dims=(\"x\", \"y\")).chunk()  # foo is chunked\r\nds = xr.Dataset({\"foo\": foo, \"bar\": (\"x\", [1, 2, 3])})  # foo is still chunked here\r\nds  # you can verify that foo is chunked\r\n```\r\n```python\r\nupdate_dict = {\"foo\": ((\"x\", \"y\"), ds.foo[1:, :]), \"bar\": (\"x\", ds.bar[1:])}\r\nupdate_dict[\"foo\"][1]  # foo is still chunked\r\n```\r\n```python\r\nds.update(update_dict)\r\nds  # now foo is no longer chunked\r\n```\r\n\r\n**Environment**:\r\n\r\n<details><summary>Output of <tt>xr.show_versions()</tt></summary>\r\n\r\n```\r\ncommit: None\r\npython: 3.8.3 (default, Jul  2 2020, 11:26:31) \r\n[Clang 10.0.0 ]\r\npython-bits: 64\r\nOS: Darwin\r\nOS-release: 19.6.0\r\nmachine: x86_64\r\nprocessor: i386\r\nbyteorder: little\r\nLC_ALL: None\r\nLANG: en_US.UTF-8\r\nLOCALE: en_US.UTF-8\r\nlibhdf5: 1.10.6\r\nlibnetcdf: None\r\n\r\nxarray: 0.16.0\r\npandas: 1.0.5\r\nnumpy: 1.18.5\r\nscipy: 1.5.0\r\nnetCDF4: None\r\npydap: None\r\nh5netcdf: None\r\nh5py: 2.10.0\r\nNio: None\r\nzarr: None\r\ncftime: None\r\nnc_time_axis: None\r\nPseudoNetCDF: None\r\nrasterio: None\r\ncfgrib: None\r\niris: None\r\nbottleneck: None\r\ndask: 2.20.0\r\ndistributed: 2.20.0\r\nmatplotlib: 3.2.2\r\ncartopy: None\r\nseaborn: None\r\nnumbagg: None\r\npint: None\r\nsetuptools: 49.2.0.post20200714\r\npip: 20.1.1\r\nconda: None\r\npytest: 5.4.3\r\nIPython: 7.16.1\r\nsphinx: None\r\n```\r\n\r\n</details>\nDataset constructor with DataArray triggers computation\nIs it intentional that creating a Dataset with a DataArray and dimension names for a single variable causes computation of that variable?  In other words, why does ```xr.Dataset(dict(a=('d0', xr.DataArray(da.random.random(10)))))``` cause the dask array to compute?\r\n\r\nA longer example:\r\n\r\n```python\r\nimport dask.array as da\r\nimport xarray as xr\r\nx = da.random.randint(1, 10, size=(100, 25))\r\nds = xr.Dataset(dict(a=xr.DataArray(x, dims=('x', 'y'))))\r\ntype(ds.a.data)\r\ndask.array.core.Array\r\n\r\n# Recreate the dataset with the same array, but also redefine the dimensions\r\nds2 = xr.Dataset(dict(a=(('x', 'y'), ds.a))\r\ntype(ds2.a.data)\r\nnumpy.ndarray\r\n```\r\n\r\n\n",
    "patch": "diff --git a/xarray/core/variable.py b/xarray/core/variable.py\n--- a/xarray/core/variable.py\n+++ b/xarray/core/variable.py\n@@ -120,6 +120,16 @@ def as_variable(obj, name=None) -> \"Union[Variable, IndexVariable]\":\n     if isinstance(obj, Variable):\n         obj = obj.copy(deep=False)\n     elif isinstance(obj, tuple):\n+        if isinstance(obj[1], DataArray):\n+            # TODO: change into TypeError\n+            warnings.warn(\n+                (\n+                    \"Using a DataArray object to construct a variable is\"\n+                    \" ambiguous, please extract the data using the .data property.\"\n+                    \" This will raise a TypeError in 0.19.0.\"\n+                ),\n+                DeprecationWarning,\n+            )\n         try:\n             obj = Variable(*obj)\n         except (TypeError, ValueError) as error:\n"
  },
  "candidates_evaluated": 5,
  "judgment_result": {
    "candidates": [
      {
        "idx": 1,
        "id": "similar_4291",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue involves NaN handling in resampling, which is unrelated to chunking behavior in updates."
      },
      {
        "idx": 2,
        "id": "similar_2622",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about indexing behavior leading to 0d arrays, not related to chunking or update operations."
      },
      {
        "idx": 3,
        "id": "similar_2908",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue deals with memory management in rolling operations, not with chunking persistence in updates."
      },
      {
        "idx": 4,
        "id": "similar_2994",
        "decision": "Not useful",
        "confidence": "Low",
        "reason": "The issue is about enhancing the drop method, which does not relate to chunking or update behavior."
      },
      {
        "idx": 5,
        "id": "similar_4302",
        "decision": "Not useful",
        "confidence": "Low",
        "reason": "The issue concerns installation completeness, unrelated to chunking or update operations."
      }
    ]
  },
  "raw_summaries": [
    {
      "similar_issue": {
        "issue_title": "resample function gives 0s instead of NaNs",
        "issue_body": "<!-- Please include a self-contained copy-pastable example that generates the issue if possible.\r\n\r\nPlease be concise with code posted. See guidelines below on how to provide a good bug report:\r\n\r\n- Craft Minimal Bug Reports: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports\r\n- Minimal Complete Verifiable Examples: https://stackoverflow.com/help/mcve\r\n\r\nBug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly.\r\n-->\r\n\r\n**What happened**:\r\nWhen I use `resample(time='1d').sum(dim='time')` to resample a time series with NaNs, the resampled result gives me 0s instead of NaNs, while NaNs should be the correct answer.\r\n\r\n**What you expected to happen**:\r\n\r\nNaNs should be the correct answer.\r\n\r\n**Minimal Complete Verifiable Example**:\r\n\r\n```python\r\nimport xarray as xr\r\n\r\ndates =  pd.date_range('20200101', '20200601', freq='h')\r\ndata = np.linspace(0, 10, num=len(dates))\r\ndata[0:30*24] = np.nan\r\n\r\nda = xr.DataArray(data, coords=[dates], dims='time')\r\nda.plot()\r\n\r\n# Instead of NaNs, the resampled time series in January 20202 give us 0s, which not right.\r\nda.resample(time='1d', skipna=True).sum(dim='time', skipna=True).plot()\r\n```\r\n\r\n**Anything else we need to know?**:\r\n\r\nDid I misunderstand something here? Thanks!\r\n\r\n\r\n**Environment**:\r\nxarray - '0.15.1' \r\n\r\n<details><summary>Output of <tt>xr.show_versions()</tt></summary>\r\n\r\nxarray - '0.15.1' \r\n\r\n\r\n</details>\r\n",
        "issue_id": 4291,
        "pr_number": 2603,
        "pr_title": "Support HighLevelGraphs",
        "pr_body": "Fixes https://github.com/dask/dask/issues/4291\r\n\r\n - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API\r\n",
        "issue_closed_at": "2020-08-05T16:55:58Z",
        "base_commit": "82789bc6f72a76d69ace4bbabd00601e28e808da"
      },
      "summary": "### Summary:\nThis issue pertains to an unexpected behavior in the `resample` function of the xarray library, where the operation yields zeros instead of the anticipated NaN values when resampling a time series data containing NaNs. \n\n1. **Problem description in general terms**: \n   - The core problem involves the `resample` function's handling of NaN values during aggregation operations (in this case, summation over time intervals). Users expect NaNs to be preserved when summing over intervals that only contain NaN values, but instead, the function returns zeros.\n\n2. **Key symptoms and behaviors observed**:\n   - When executing a time series resampling operation with the `resample(time='1d').sum(dim='time')` method on datasets containing NaN values, the output unexpectedly contains zeros for some intervals instead of NaNs, which is contrary to user expectations.\n\n3. **Affected components or systems**:\n   - The issue specifically affects the time series resampling functionality within the xarray library, particularly when using the `sum` method with the `resample` function. The behavior is observed in the xarray version '0.15.1'.\n\n4. **Potential impact or severity**:\n   - The issue could lead to incorrect data analysis outcomes, especially in scenarios where the distinction between zero and NaN is crucial for interpreting the results. This could mislead users into drawing incorrect conclusions from their data, making the issue of moderate to high severity depending on the application context.\n\n5. **Relevant technical details abstracted for broader understanding**:\n   - The problem arises during the aggregation process in the `resample` function, potentially due to mismanagement of NaN handling within the function's logic. It may involve the `__dask_graph__` method in the xarray core modules (`dataarray.py`, `dataset.py`, `variable.py`), which could be responsible for the incorrect aggregation behavior. The issue could be tied to the internal mechanisms of how xarray interfaces with Dask for computational graph construction and execution.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: resample function gives 0s instead of NaNs\n\nBody:\n<!-- Please include a self-contained copy-pastable example that generates the issue if possible.\r\n\r\nPlease be concise with code posted. See guidelines below on how to provide a good bug report:\r\n\r\n- Craft Minimal Bug Reports: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports\r\n- Minimal Complete Verifiable Examples: https://stackoverflow.com/help/mcve\r\n\r\nBug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly.\r\n-->\r\n\r\n**What happened**:\r\nWhen I use `resample(time='1d').sum(dim='time')` to resample a time series with NaNs, the resampled result gives me 0s instead of NaNs, while NaNs should be the correct answer.\r\n\r\n**What you expected to happen**:\r\n\r\nNaNs should be the correct answer.\r\n\r\n**Minimal Complete Verifiable Example**:\r\n\r\n```python\r\nimport xarray as xr\r\n\r\ndates =  pd.date_range('20200101', '20200601', freq='h')\r\ndata = np.linspace(0, 10, num=len(dates))\r\ndata[0:30*24] = np.nan\r\n\r\nda = xr.DataArray(data, coords=[dates], dims='time')\r\nda.plot()\r\n\r\n# Instead of NaNs, the resampled time series in January 20202 give us 0s, which not right.\r\nda.resample(time='1d', skipna=True).sum(dim='time', skipna=True).plot()\r\n```\r\n\r\n**Anything else we need to know?**:\r\n\r\nDid I misunderstand something here? Thanks!\r\n\r\n\r\n**Environment**:\r\nxarray - '0.15.1' \r\n\r\n<details><summary>Output of <tt>xr.show_versions()</tt></summary>\r\n\r\nxarray - '0.15.1' \r\n\r\n\r\n</details>\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nxarray/core/dataarray.py\n  function: DataArray.__dask_graph__\n\nxarray/core/dataset.py\n  function: Dataset.__dask_graph__\n\nxarray/core/variable.py\n  function: Variable.__dask_graph__\n"
    },
    {
      "similar_issue": {
        "issue_title": "Unnecessary copy when indexing to obtain a 0d array",
        "issue_body": "#### Code Sample\r\n```python\r\n>>> import numpy as np\r\n>>> import xarray as xr\r\n>>> da = xr.DataArray(np.arange(3))\r\n>>> da\r\n<xarray.DataArray (dim_0: 3)>\r\narray([0, 1, 2])\r\nDimensions without coordinates: dim_0\r\n>>> da[0].values.fill(99)\r\n>>> da\r\n<xarray.DataArray (dim_0: 3)>\r\narray([0, 1, 2])\r\nDimensions without coordinates: dim_0\r\n```\r\n#### Problem description\r\nIndexing into xarray objects creates a view of the underlying data if possible. A surprising exception is when all dimensions are indexed out and the resulting object is 0d. Xarray insists on returning a 0d array rather than a scalar, which suggests (at least to me) that this is also a view whenever possible; however, it is always a copy, and modifying it will never affect the original array.\r\n\r\n(The example above is a little contrived, since one could always call `da[0] = 99`. In my actual use case I am indexing into a Dataset in a way that creates views for all variables except the one that happens to collapse to 0d, and thus I'm unable to use the indexed Dataset to modify that variable in the original Dataset.) \r\n\r\nThe copy happens because, internally, the 0d array is created by retrieving a scalar from the underlying numpy array and then wrapping a new array around it. However, in numpy a 0d view can be created directly by indexing with `Ellipsis`/`...`, as follows:\r\n```python\r\n>>> import numpy as np\r\n>>> arr = np.arange(3)\r\n>>> arr[0, ...]\r\narray(0)\r\n```\r\nThus, a fix that solves my immediate issues and passes all current tests is to modify the following method:\r\nhttps://github.com/pydata/xarray/blob/778ffc49135d6f97e17b37b48304995fca72f1e0/xarray/core/indexing.py#L1154-L1163\r\nto always append an ellipsis for basic and outer indexing:\r\n```python\r\n    def _indexing_array_and_key(self, key):\r\n        if isinstance(key, OuterIndexer):\r\n            array = self.array\r\n>           key = _outer_to_numpy_indexer(key, self.array.shape) + (Ellipsis,)\r\n        elif isinstance(key, VectorizedIndexer):\r\n            array = nputils.NumpyVIndexAdapter(self.array)\r\n            key = key.tuple\r\n        elif isinstance(key, BasicIndexer):\r\n            array = self.array\r\n>           key = key.tuple + (Ellipsis,)\r\n```\r\nI'm not familiar enough with all the indexing variants in xarray to know if this covers all cases of 0d arrays that are currently copies but could be views. If someone wants to share some insight (e.g., some more advanced test cases), I could try and put together a pull request.\r\n\r\n#### Expected Output\r\n```python\r\n>>> da[0].values.fill(99)\r\n>>> da\r\n<xarray.DataArray (dim_0: 3)>\r\narray([99, 1, 2])\r\nDimensions without coordinates: dim_0\r\n```\r\n#### Output of ``xr.show_versions()``\r\n\r\n<details>\r\n/home/daniel/local/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\r\n  from ._conv import register_converters as _register_converters\r\n\r\nINSTALLED VERSIONS\r\n------------------\r\ncommit: None\r\npython: 3.6.5.final.0\r\npython-bits: 64\r\nOS: Linux\r\nOS-release: 4.15.0-42-lowlatency\r\nmachine: x86_64\r\nprocessor: x86_64\r\nbyteorder: little\r\nLC_ALL: None\r\nLANG: en_US.UTF-8\r\nLOCALE: en_US.UTF-8\r\n\r\nxarray: 0.11.0\r\npandas: 0.23.0\r\nnumpy: 1.14.3\r\nscipy: 1.1.0\r\nnetCDF4: 1.4.0\r\nh5netcdf: 0.6.2\r\nh5py: 2.7.1\r\nNio: None\r\nzarr: None\r\ncftime: 1.0.0b1\r\nPseudonetCDF: None\r\nrasterio: None\r\niris: None\r\nbottleneck: 1.2.1\r\ncyordereddict: None\r\ndask: 0.17.5\r\ndistributed: 1.21.8\r\nmatplotlib: 2.2.2\r\ncartopy: None\r\nseaborn: 0.8.1\r\nsetuptools: 39.1.0\r\npip: 10.0.1\r\nconda: 4.5.12\r\npytest: 3.5.1\r\nIPython: 6.4.0\r\nsphinx: 1.7.4\r\n</details>\r\n",
        "issue_id": 2622,
        "pr_number": 2625,
        "pr_title": "Get 0d slices of ndarrays directly from indexing",
        "pr_body": " - [x] Closes #2622\r\n - [x] Tests added\r\n - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API\r\n",
        "issue_closed_at": "2018-12-22T22:57:59Z",
        "base_commit": "a15587de419f8a47a875013813186a36fdc04c08"
      },
      "summary": "### Summary:\nThis issue pertains to the behavior of the xarray library when indexing operations result in a 0-dimensional (0d) array. Generally, xarray aims to provide a view of the underlying data when indexing is performed. However, an exception occurs when all dimensions are indexed out, resulting in a 0d array. In such cases, xarray unexpectedly returns a copy of the data instead of a view. This behavior is misleading because it suggests that the 0d array could still be a view, but any modifications to it do not reflect in the original data array.\n\nKey symptoms include the inability to modify the original data through the 0d array obtained by indexing, as demonstrated in the provided code sample. This behavior is contrary to user expectations, especially in scenarios where consistent view-based indexing is assumed across multiple dimensions and variables within a dataset.\n\nThe affected component is the xarray library, specifically its indexing mechanism within the `xarray/core/indexing.py` file. The functions `NumpyIndexingAdapter._indexing_array_and_key` and related indexing functions need adjustments to ensure that 0d arrays are treated as views whenever possible.\n\nThe potential impact is significant for users relying on xarray's indexing to manipulate datasets directly. The inability to modify the original data through indexed 0d arrays could lead to unexpected results and complicate data handling processes.\n\nRelevant technical details include the solution of appending an ellipsis (`...`) to the indexing operation, enabling the creation of a 0d view. This solution aligns with numpy's method of indexing to maintain views, thereby addressing the inconsistency in xarray's current implementation.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Unnecessary copy when indexing to obtain a 0d array\n\nBody:\n#### Code Sample\r\n```python\r\n>>> import numpy as np\r\n>>> import xarray as xr\r\n>>> da = xr.DataArray(np.arange(3))\r\n>>> da\r\n<xarray.DataArray (dim_0: 3)>\r\narray([0, 1, 2])\r\nDimensions without coordinates: dim_0\r\n>>> da[0].values.fill(99)\r\n>>> da\r\n<xarray.DataArray (dim_0: 3)>\r\narray([0, 1, 2])\r\nDimensions without coordinates: dim_0\r\n```\r\n#### Problem description\r\nIndexing into xarray objects creates a view of the underlying data if possible. A surprising exception is when all dimensions are indexed out and the resulting object is 0d. Xarray insists on returning a 0d array rather than a scalar, which suggests (at least to me) that this is also a view whenever possible; however, it is always a copy, and modifying it will never affect the original array.\r\n\r\n(The example above is a little contrived, since one could always call `da[0] = 99`. In my actual use case I am indexing into a Dataset in a way that creates views for all variables except the one that happens to collapse to 0d, and thus I'm unable to use the indexed Dataset to modify that variable in the original Dataset.) \r\n\r\nThe copy happens because, internally, the 0d array is created by retrieving a scalar from the underlying numpy array and then wrapping a new array around it. However, in numpy a 0d view can be created directly by indexing with `Ellipsis`/`...`, as follows:\r\n```python\r\n>>> import numpy as np\r\n>>> arr = np.arange(3)\r\n>>> arr[0, ...]\r\narray(0)\r\n```\r\nThus, a fix that solves my immediate issues and passes all current tests is to modify the following method:\r\nhttps://github.com/pydata/xarray/blob/778ffc49135d6f97e17b37b48304995fca72f1e0/xarray/core/indexing.py#L1154-L1163\r\nto always append an ellipsis for basic and outer indexing:\r\n```python\r\n    def _indexing_array_and_key(self, key):\r\n        if isinstance(key, OuterIndexer):\r\n            array = self.array\r\n>           key = _outer_to_numpy_indexer(key, self.array.shape) + (Ellipsis,)\r\n        elif isinstance(key, VectorizedIndexer):\r\n            array = nputils.NumpyVIndexAdapter(self.array)\r\n            key = key.tuple\r\n        elif isinstance(key, BasicIndexer):\r\n            array = self.array\r\n>           key = key.tuple + (Ellipsis,)\r\n```\r\nI'm not familiar enough with all the indexing variants in xarray to know if this covers all cases of 0d arrays that are currently copies but could be views. If someone wants to share some insight (e.g., some more advanced test cases), I could try and put together a pull request.\r\n\r\n#### Expected Output\r\n```python\r\n>>> da[0].values.fill(99)\r\n>>> da\r\n<xarray.DataArray (dim_0: 3)>\r\narray([99, 1, 2])\r\nDimensions without coordinates: dim_0\r\n```\r\n#### Output of ``xr.show_versions()``\r\n\r\n<details>\r\n/home/daniel/local/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\r\n  from ._conv import register_converters as _register_converters\r\n\r\nINSTALLED VERSIONS\r\n------------------\r\ncommit: None\r\npython: 3.6.5.final.0\r\npython-bits: 64\r\nOS: Linux\r\nOS-release: 4.15.0-42-lowlatency\r\nmachine: x86_64\r\nprocessor: x86_64\r\nbyteorder: little\r\nLC_ALL: None\r\nLANG: en_US.UTF-8\r\nLOCALE: en_US.UTF-8\r\n\r\nxarray: 0.11.0\r\npandas: 0.23.0\r\nnumpy: 1.14.3\r\nscipy: 1.1.0\r\nnetCDF4: 1.4.0\r\nh5netcdf: 0.6.2\r\nh5py: 2.7.1\r\nNio: None\r\nzarr: None\r\ncftime: 1.0.0b1\r\nPseudonetCDF: None\r\nrasterio: None\r\niris: None\r\nbottleneck: 1.2.1\r\ncyordereddict: None\r\ndask: 0.17.5\r\ndistributed: 1.21.8\r\nmatplotlib: 2.2.2\r\ncartopy: None\r\nseaborn: 0.8.1\r\nsetuptools: 39.1.0\r\npip: 10.0.1\r\nconda: 4.5.12\r\npytest: 3.5.1\r\nIPython: 6.4.0\r\nsphinx: 1.7.4\r\n</details>\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nxarray/core/indexing.py\n  function: PandasIndexAdapter.__init__\n  function: NumpyIndexingAdapter._indexing_array_and_key\n  function: PandasIndexAdapter.transpose\n"
    },
    {
      "similar_issue": {
        "issue_title": "More efficient rolling with large dask arrays",
        "issue_body": "#### Code Sample\r\n\r\n```python\r\nimport xarray as xr\r\nimport dask.array as da\r\n\r\ndsize=[62,12,100,192,288]\r\narray1=da.random.random(dsize,chunks=(dsize[0],dsize[1],1,dsize[3],int(dsize[4]/2)))\r\narray2=xr.DataArray(array1)\r\nrollingmean=array2.rolling(dim_1=3,center=True).mean()  # <-- this kills all workers\r\n\r\n```\r\n#### Problem description\r\n\r\nI'm working on NCAR's cheyenne with a 36GB netcdf using dask_jobqueue.PBSCluster, and trying to calculate the running-mean along one dimension. Despite having plenty of memory reserved (400GB), I can watch DataArray.rolling blow up the bytes stored in the dashboard until the job hangs and all the workers are killed.  \r\n\r\nThe above snippet reproduces the issue with the same array size and chunksize as what I'm working with. This worker-killing behavior does not occur for arrays that are 100x smaller.  I've found a speedy way to calculate what I need without using rolling, but I thought I should bring this to your attention regardless. \r\n\r\nIn case it's relevant, here's how I'm setting up the dask cluster on cheyenne:\r\n```python\r\nfrom dask.distributed import Client\r\nfrom dask_jobqueue import PBSCluster  #version 0.4.1\r\n\r\ncluster=PBSCluster(cores=36, processes=9, memory='109GB', project=myproj, resource_spec='select=1:ncpus=36:mem=109G', queue='regular', walltime='02:00:00')\r\nnumnodes=4\r\nclient = Client(cluster)\r\ncluster.scale(numnodes*9)\r\n\r\n```\r\n\r\n#### Output of ``xr.show_versions()``\r\n\r\n<details>\r\n\r\nINSTALLED VERSIONS\r\n------------------\r\ncommit: None\r\npython: 3.7.1 (default, Dec 14 2018, 19:28:38) \r\n[GCC 7.3.0]\r\npython-bits: 64\r\nOS: Linux\r\nOS-release: 3.12.62-60.64.8-default\r\nmachine: x86_64\r\nprocessor: x86_64\r\nbyteorder: little\r\nLC_ALL: None\r\nLANG: en_US.UTF-8\r\nLOCALE: en_US.UTF-8\r\nlibhdf5: 1.10.4\r\nlibnetcdf: 4.6.2\r\n\r\nxarray: 0.12.1\r\npandas: 0.24.1\r\nnumpy: 1.15.4\r\nscipy: 1.2.1\r\nnetCDF4: 1.4.2\r\npydap: None\r\nh5netcdf: None\r\nh5py: None\r\nNio: None\r\nzarr: 2.3.1\r\ncftime: 1.0.3.4\r\nnc_time_axis: None\r\nPseudonetCDF: None\r\nrasterio: None\r\ncfgrib: None\r\niris: None\r\nbottleneck: None\r\ndask: 1.1.5\r\ndistributed: 1.26.1\r\nmatplotlib: 3.0.2\r\ncartopy: 0.17.0\r\nseaborn: 0.9.0\r\nsetuptools: 40.6.3\r\npip: 18.1\r\nconda: 4.6.13\r\npytest: None\r\nIPython: 7.3.0\r\nsphinx: None\r\n\r\n</details>\r\n",
        "issue_id": 2908,
        "pr_number": 2934,
        "pr_title": "Docs/more fixes",
        "pr_body": "<!-- Feel free to remove check-list items aren't relevant to your change -->\r\n\r\n - partially addresses #2909 , closes #2901, closes #2908 \r\n",
        "issue_closed_at": "2019-10-04T17:04:37Z",
        "base_commit": "f3c7da6eba987ec67616cd8cb9aec6ea79f0e92c"
      },
      "summary": "### Summary:\n\nThis issue is related to performance and resource management when performing rolling operations on large Dask arrays within an Xarray DataArray. The problem specifically involves excessive memory consumption that leads to worker crashes when calculating a rolling mean on a large multidimensional array using Dask on an HPC system. The described behavior does not occur with significantly smaller arrays, indicating a scalability issue in the handling of large datasets. \n\nKey symptoms include a rapid increase in memory usage when executing the rolling operation, ultimately resulting in the termination of all workers due to resource exhaustion. This was observed despite ample memory being allocated (400GB) and a correctly configured Dask cluster using the `dask_jobqueue.PBSCluster` system on the Cheyenne supercomputer.\n\nThe affected components primarily involve the Dask library's interaction with Xarray, specifically during rolling operations, which can lead to significant memory overhead. The impact is severe in scenarios involving large datasets, as it prevents successful computation and requires users to seek alternative methods to achieve the desired calculations.\n\nRelevant technical details include the specific versions of the libraries involved (e.g., Dask 1.1.5, Xarray 0.12.1) and the environment setup, such as the Dask cluster configuration with PBSCluster on Cheyenne. These details help in understanding the context and potential areas where optimizations or fixes might be applied to resolve the problem. \n\nThe fixed code elements indicate updates to various functions within the Xarray library, which likely address the root cause of the excessive memory usage during post-persistence operations and dataset management, improving the efficiency of rolling operations on large datasets.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: More efficient rolling with large dask arrays\n\nBody:\n#### Code Sample\r\n\r\n```python\r\nimport xarray as xr\r\nimport dask.array as da\r\n\r\ndsize=[62,12,100,192,288]\r\narray1=da.random.random(dsize,chunks=(dsize[0],dsize[1],1,dsize[3],int(dsize[4]/2)))\r\narray2=xr.DataArray(array1)\r\nrollingmean=array2.rolling(dim_1=3,center=True).mean()  # <-- this kills all workers\r\n\r\n```\r\n#### Problem description\r\n\r\nI'm working on NCAR's cheyenne with a 36GB netcdf using dask_jobqueue.PBSCluster, and trying to calculate the running-mean along one dimension. Despite having plenty of memory reserved (400GB), I can watch DataArray.rolling blow up the bytes stored in the dashboard until the job hangs and all the workers are killed.  \r\n\r\nThe above snippet reproduces the issue with the same array size and chunksize as what I'm working with. This worker-killing behavior does not occur for arrays that are 100x smaller.  I've found a speedy way to calculate what I need without using rolling, but I thought I should bring this to your attention regardless. \r\n\r\nIn case it's relevant, here's how I'm setting up the dask cluster on cheyenne:\r\n```python\r\nfrom dask.distributed import Client\r\nfrom dask_jobqueue import PBSCluster  #version 0.4.1\r\n\r\ncluster=PBSCluster(cores=36, processes=9, memory='109GB', project=myproj, resource_spec='select=1:ncpus=36:mem=109G', queue='regular', walltime='02:00:00')\r\nnumnodes=4\r\nclient = Client(cluster)\r\ncluster.scale(numnodes*9)\r\n\r\n```\r\n\r\n#### Output of ``xr.show_versions()``\r\n\r\n<details>\r\n\r\nINSTALLED VERSIONS\r\n------------------\r\ncommit: None\r\npython: 3.7.1 (default, Dec 14 2018, 19:28:38) \r\n[GCC 7.3.0]\r\npython-bits: 64\r\nOS: Linux\r\nOS-release: 3.12.62-60.64.8-default\r\nmachine: x86_64\r\nprocessor: x86_64\r\nbyteorder: little\r\nLC_ALL: None\r\nLANG: en_US.UTF-8\r\nLOCALE: en_US.UTF-8\r\nlibhdf5: 1.10.4\r\nlibnetcdf: 4.6.2\r\n\r\nxarray: 0.12.1\r\npandas: 0.24.1\r\nnumpy: 1.15.4\r\nscipy: 1.2.1\r\nnetCDF4: 1.4.2\r\npydap: None\r\nh5netcdf: None\r\nh5py: None\r\nNio: None\r\nzarr: 2.3.1\r\ncftime: 1.0.3.4\r\nnc_time_axis: None\r\nPseudonetCDF: None\r\nrasterio: None\r\ncfgrib: None\r\niris: None\r\nbottleneck: None\r\ndask: 1.1.5\r\ndistributed: 1.26.1\r\nmatplotlib: 3.0.2\r\ncartopy: 0.17.0\r\nseaborn: 0.9.0\r\nsetuptools: 40.6.3\r\npip: 18.1\r\nconda: 4.6.13\r\npytest: None\r\nIPython: 7.3.0\r\nsphinx: None\r\n\r\n</details>\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nxarray/core/dataset.py\n  function: Dataset.sizes\n  function: Dataset._dask_postpersist\n  function: Dataset.persist\n"
    },
    {
      "similar_issue": {
        "issue_title": "xr.Dataset.drop",
        "issue_body": "Currently, `drop` throws an error if one of the labels doesn't exist. It would be nice to have a parameter in the drop method for optionally ignoring errors like in the pandas.DataFrame.\r\nFrom the pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html):\r\n\r\n> errors : {‘ignore’, ‘raise’}, default ‘raise’\r\n>     If ‘ignore’, suppress error and only existing labels are dropped.\r\n",
        "issue_id": 2994,
        "pr_number": 3028,
        "pr_title": "Add \"errors\" keyword argument to drop() and drop_dims() (#2994)",
        "pr_body": "<!-- Feel free to remove check-list items aren't relevant to your change -->\r\n\r\n - [x] Closes #2994 \r\n - [x] Tests added\r\n - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API\r\n\r\nThis addresses #2994 by adding an \"errors\" keyword argument to `Dataset.drop()`, `Dataset.drop_dims()`, and `DataArray.drop()`. \r\n\r\nI stuck with pandas' convention of using either `errors='raise'`, now the default that maintains previous behavior by raising an error if any passed label is not found in the dataset/array, or `errors='ignore'` in which case any missing labels are silently ignored. \r\n\r\nThis seems like a pretty straightforward change; mainly it is just skipping checks for missing labels when `errors == 'ignore'` and passing the errors keyword over to the pandas method when using `index.drop()`. Hopefully there are no subtleties that I've missed. \r\n\r\nI added documentation to the appropriate methods, although I have been struggling to build the docs locally and am unsure if they look right.\r\n\r\nAlso this is my first attempt to contribute to any project, so suggestions and feedback are welcome. ",
        "issue_closed_at": "2019-06-20T15:48:00Z",
        "base_commit": "c2a2a6efcaf2d279c78da4ba3a87ea96afe78be0"
      },
      "summary": "### Summary: This issue is concerned with enhancing the functionality of the `drop` method in the xarray library, specifically within the `xr.Dataset` class. The current implementation of `drop` results in an error if any of the specified labels do not exist, which limits its flexibility compared to similar functions in other libraries like pandas. The request is to introduce an optional parameter that allows users to suppress these errors, enabling only the existing labels to be dropped without interruption.\n\n1. **Problem description in general terms**: The `drop` method in xarray's `xr.Dataset` class lacks the ability to ignore missing labels, causing an error when attempting to drop labels that do not exist. The enhancement sought is similar to pandas' functionality, where an 'errors' parameter can be set to 'ignore' to bypass these errors.\n\n2. **Key symptoms and behaviors observed**: When the `drop` method is called with a list of labels, and one or more of those labels do not exist in the dataset, an error is thrown, interrupting the execution. This behavior restricts the method's utility for users who want to programmatically manage datasets where the presence of labels can vary.\n\n3. **Affected components or systems**: The affected components are primarily within the xarray library, particularly involving the `xr.Dataset.drop` method, as well as related functions such as `DataArray.drop`, `Dataset._assert_all_in_dataset`, and `Dataset.drop_dims`.\n\n4. **Potential impact or severity**: The impact is moderate in terms of user experience, particularly for those who rely on flexible data manipulation capabilities. It does not result in data loss or processing errors but can significantly hinder workflow automation and require additional error-handling code.\n\n5. **Relevant technical details abstracted for broader understanding**: The enhancement involves adding an 'errors' parameter to the `drop` methods, allowing it to accept options such as 'ignore' and 'raise' to control error handling behavior. This feature aligns xarray's functionality with that of pandas, promoting consistency and enhancing usability for developers familiar with both libraries.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: xr.Dataset.drop\n\nBody:\nCurrently, `drop` throws an error if one of the labels doesn't exist. It would be nice to have a parameter in the drop method for optionally ignoring errors like in the pandas.DataFrame.\r\nFrom the pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html):\r\n\r\n> errors : {‘ignore’, ‘raise’}, default ‘raise’\r\n>     If ‘ignore’, suppress error and only existing labels are dropped.\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nxarray/core/dataarray.py\n  function: DataArray.transpose\n  function: DataArray.drop\n\nxarray/core/dataset.py\n  function: Dataset._assert_all_in_dataset\n  function: Dataset.drop\n  function: Dataset.drop_dims\n"
    },
    {
      "similar_issue": {
        "issue_title": "Installing from sources does not install everything",
        "issue_body": "<!-- Please include a self-contained copy-pastable example that generates the issue if possible.\r\n\r\nPlease be concise with code posted. See guidelines below on how to provide a good bug report:\r\n\r\n- Craft Minimal Bug Reports: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports\r\n- Minimal Complete Verifiable Examples: https://stackoverflow.com/help/mcve\r\n\r\nBug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly.\r\n-->\r\n\r\n**What happened**:\r\n\r\nWhen installing from [sources](https://github.com/pydata/xarray/archive/v0.16.0.tar.gz) the package isn't fully installed, e.g. the `core` directory never gets added.\r\n```bash\r\n-rw-r--r-- 1   27K Aug  3 09:43 conventions.py\r\n-rw-r--r-- 1  9.5K Aug  3 09:43 convert.py\r\n-rw-r--r-- 1  2.4K Aug  3 09:43 __init__.py\r\ndrwxr-xr-x 1   274 Aug  3 09:43 __pycache__\r\n-rw-r--r-- 1     0 Aug  3 09:43 py.typed\r\ndrwxr-xr-x 1    14 Aug  3 09:43 static\r\n-rw-r--r-- 1   12K Aug  3 09:43 testing.py\r\ndrwxr-xr-x 1     8 Aug  3 09:43 tests\r\n-rw-r--r-- 1  3.6K Aug  3 09:43 tutorial.py\r\n-rw-r--r-- 1  4.7K Aug  3 09:43 ufuncs.py\r\n```\r\n\r\n**Minimal Complete Verifiable Example**:\r\n\r\n```bash\r\nwget -o xarray-0.16.0.tar.gz https://github.com/pydata/xarray/archive/v0.16.0.tar.gz\r\ntar xvfz xarray-0.16.0.tar.gz\r\ncd xarray-0.16.0\r\n# this is sadly required since the downloaded file does not contain *any* version information\r\n# setuptools_scm reads PKG-INFO as a last resort when trying to determine the version.\r\necho 'Version: 0.16.0' > PKG-INFO\r\npython3 setup.py install --prefix=<>\r\n# or\r\npip3 install -vvv --no-cache-dir --no-deps --no-index --no-build-isolation --compile --prefix=<> .\r\n```\r\n\r\n**Anything else we need to know?**:\r\nI think this is quite self-producing. It is just important that one does not do this on a git repo.\r\n\r\nDo you need anything else from me?",
        "issue_id": 4302,
        "pr_number": 4244,
        "pr_title": "Clarify drop_vars return value.",
        "pr_body": "The previous documentation was not clear about whether the variable\ndropping was \"inplace\" or created a fresh Dataset.\n",
        "issue_closed_at": "2020-08-05T21:01:15Z",
        "base_commit": "8fab5a2449d8368251f96fc2b9d1eaa3040894e6"
      },
      "summary": "### Summary:\n\nThis issue is related to an incomplete installation of a software package when it is installed from source files. Specifically, the problem occurs with the installation of the `xarray` package from its archived source. Despite the installation process appearing to complete without errors, certain essential directories, notably the `core` directory, are not included in the final installation. This omission results in missing functionality that is expected to be present in a fully installed package.\n\n1. **Problem Description in General Terms**:\n   - The installation process from source archives does not include all necessary components of the package, leading to incomplete functionality.\n\n2. **Key Symptoms and Behaviors Observed**:\n   - Key directories and files, such as the `core` directory, are absent post-installation.\n   - The resulting installation is missing critical components required for the package to operate correctly.\n\n3. **Affected Components or Systems**:\n   - The `xarray` package, specifically when installed from its version 0.16.0 archived source file.\n\n4. **Potential Impact or Severity**:\n   - This issue can severely impact users relying on installation from source, particularly those needing the full functionality of the `xarray` package. The incomplete installation could lead to failures in running applications that depend on the missing components.\n\n5. **Relevant Technical Details Abstracted for Broader Understanding**:\n   - The issue is linked to the installation process from a tarball obtained from a source archive, which does not include all necessary files. A workaround requires manually adding versioning information to the `PKG-INFO` file to assist the setup tools in correctly processing the installation. However, this does not resolve the underlying issue of missing directories.\n\nOverall, addressing this issue would require ensuring that the source archive includes all necessary directories and files, or modifying the installation process to correctly handle such cases.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Installing from sources does not install everything\n\nBody:\n<!-- Please include a self-contained copy-pastable example that generates the issue if possible.\r\n\r\nPlease be concise with code posted. See guidelines below on how to provide a good bug report:\r\n\r\n- Craft Minimal Bug Reports: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports\r\n- Minimal Complete Verifiable Examples: https://stackoverflow.com/help/mcve\r\n\r\nBug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly.\r\n-->\r\n\r\n**What happened**:\r\n\r\nWhen installing from [sources](https://github.com/pydata/xarray/archive/v0.16.0.tar.gz) the package isn't fully installed, e.g. the `core` directory never gets added.\r\n```bash\r\n-rw-r--r-- 1   27K Aug  3 09:43 conventions.py\r\n-rw-r--r-- 1  9.5K Aug  3 09:43 convert.py\r\n-rw-r--r-- 1  2.4K Aug  3 09:43 __init__.py\r\ndrwxr-xr-x 1   274 Aug  3 09:43 __pycache__\r\n-rw-r--r-- 1     0 Aug  3 09:43 py.typed\r\ndrwxr-xr-x 1    14 Aug  3 09:43 static\r\n-rw-r--r-- 1   12K Aug  3 09:43 testing.py\r\ndrwxr-xr-x 1     8 Aug  3 09:43 tests\r\n-rw-r--r-- 1  3.6K Aug  3 09:43 tutorial.py\r\n-rw-r--r-- 1  4.7K Aug  3 09:43 ufuncs.py\r\n```\r\n\r\n**Minimal Complete Verifiable Example**:\r\n\r\n```bash\r\nwget -o xarray-0.16.0.tar.gz https://github.com/pydata/xarray/archive/v0.16.0.tar.gz\r\ntar xvfz xarray-0.16.0.tar.gz\r\ncd xarray-0.16.0\r\n# this is sadly required since the downloaded file does not contain *any* version information\r\n# setuptools_scm reads PKG-INFO as a last resort when trying to determine the version.\r\necho 'Version: 0.16.0' > PKG-INFO\r\npython3 setup.py install --prefix=<>\r\n# or\r\npip3 install -vvv --no-cache-dir --no-deps --no-index --no-build-isolation --compile --prefix=<> .\r\n```\r\n\r\n**Anything else we need to know?**:\r\nI think this is quite self-producing. It is just important that one does not do this on a git repo.\r\n\r\nDo you need anything else from me?\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nxarray/core/dataarray.py\n  function: DataArray.T\n  function: DataArray.drop_vars\n"
    }
  ]
}