\appendix

\section{Expanded Dataset Information}

In \Cref{sec:dataset} we provided an overview of several dataset statistics.
In this appendix we expand on that with additional plots.
The distribution of image pixel intensities is illustrated in \Cref{fig:spectra}.
The distribution of images collected over time is shown in \Cref{fig:images_over_time}.
The distribution of annotation location is shown in \Cref{fig:centroid_location_distri} and sizes is shown
  in \Cref{fig:annot_obox_size_dist} and \Cref{fig:annot_area_verts_distri}.


\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{figures/spectra.png}
\caption[]{
    The ``spectra'' or histogram of the pixel intensities in the dataset. 
    The dataset RGB mean/std is $[117, 124, 100], [61, 59, 63]$. 
    This was run on the older 2024-07-03 snapshot.
}
\label{fig:spectra}
\end{figure}


\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{figures/appendix/images_over_time.png}
\caption[]{
    The number of images collected over time.
}
\label{fig:images_over_time}
\end{figure}


\begin{figure}[ht]
\centering
\begin{subfigure}[b]{0.4\textwidth}
 \includegraphics[width=\textwidth]{figures/appendix/polygon_centroid_absolute_distribution.png}
 \caption{Absolute pixel coordinates.}
 \label{fig:centroid_abs}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.4\textwidth}
 \includegraphics[width=\textwidth]{figures/appendix/polygon_centroid_relative_distribution.png}
 \caption{Relative image coordinates.}
 \label{fig:centroid_rel}
\end{subfigure}
\caption{The distribution of annotation centroids in terms of (a) absolute image coordinates and (b) relative image coordinates. The absolute centroid distribution is bimodal because some images are taken in landscape mode and other in portrait mode.}
\label{fig:centroid_location_distri}
\end{figure}


\begin{figure}[ht]
\centering
\begin{subfigure}[b]{0.4\textwidth}
  \includegraphics[width=\textwidth]{figures/appendix/obox_size_distribution_jointplot.png}
  \caption{Linear scale.}
  \label{fig:annot_obox_size_dist_linear}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.4\textwidth}
  \includegraphics[width=\textwidth]{figures/appendix/obox_size_distribution_logscale.png}
  \caption{Log10 scale.}
  \label{fig:annot_obox_size_dist_log}
\end{subfigure}
\caption{The distribution of annotation sizes as measured by an oriented bounding box fit to each polygon. (a) shows this plot on a linear scale and (b) show this plot on a log scale.}
\label{fig:annot_obox_size_dist}
\end{figure}


\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{figures/appendix/polygon_area_vs_num_verts_jointplot.png}
\caption[]{
    The distribution of polygon areas versus the number of vertices in the polygon boundary.
    The SAM model tends to produce polygons with a higher number of vertices
    than manually drawn ones.  For smaller polygons there are two peaks in the
    number of vertices histograms likely corresponding to pure-manual versus
    AI-assisted annotations.
}
\label{fig:annot_area_verts_distri}
\end{figure}


\section{Expanded Dataset Comparison}

In \Cref{sec:relatedwork} we compared to related work. Here we expand on this
by comparing our analysis plots. Every dataset is converted into the COCO
format and visualized using the same logic. \Cref{fig:compare_allannots}
visualizes the annotations of all datasets. We make similar visualizations 
for other comparable dataset metrics.
\Cref{fig:combo_anns_per_image_histogram_splity} shows the number of annotations per image.
\Cref{fig:combo_image_size_scatter} shows of image sizes in each dataset.
\Cref{fig:combo_obox_size_distribution_logscale} shows the distribution of width and heights of oriented bounding boxes fit to annotation polygons.
\Cref{fig:combo_polygon_area_vs_num_verts_jointplot} shows the area of each polygon versus the number of vertices (which could be used to estimate the likelihood a polygon was generated by AI for our dataset).
\Cref{fig:combo_polygon_centroid_relative_distribution} shows the distribution of centroid positions (relative to the image size).


\begin{figure*}[ht]
\centering
\includegraphics[width=1.0\textwidth]{plots/appendix/dataset_compare/combo_anns_per_image_histogram_splity.png.png}
\caption[]{
    Number of annotations per image in each dataset.
}
\label{fig:combo_anns_per_image_histogram_splity}
\end{figure*}


\begin{figure*}[ht]
\centering
\includegraphics[width=1.0\textwidth]{plots/appendix/dataset_compare/combo_image_size_scatter.png.png}
\caption[]{
    Image size distributions of each dataset. 
    Ours has two primary width/heights.
}
\label{fig:combo_image_size_scatter}
\end{figure*}


\begin{figure*}[ht]
\centering
\includegraphics[width=1.0\textwidth]{plots/appendix/dataset_compare/combo_obox_size_distribution_logscale.png.png}
\caption[]{
    Oriented bounding box size distributions (log10 scale) of each dataset.
}
\label{fig:combo_obox_size_distribution_logscale}
\end{figure*}

\begin{figure*}[ht]
\centering
\includegraphics[width=1.0\textwidth]{plots/appendix/dataset_compare/combo_polygon_area_vs_num_verts_jointplot_logscale.png.png}
\caption[]{
    Polygon area versus number of vertices (log10 scale) for each dataset.
    The polygons with more vertices are more likely to be AI generated.
}
\label{fig:combo_polygon_area_vs_num_verts_jointplot}
\end{figure*}

\begin{figure*}[ht]
\centering
\includegraphics[width=1.0\textwidth]{plots/appendix/dataset_compare/combo_polygon_centroid_relative_distribution.png.png}
\caption[]{
    Polygon centroid relative distribution for each dataset. It is interesting
    to note patterns in this data. For instance, the outline of a street can be
    seen in CityScapes. In Zero Waste you can see the conveyor belt. ImageNet
    is more uniform. Ours is Gaussian distributed. 
}
\label{fig:combo_polygon_centroid_relative_distribution}
\end{figure*}


\section{VIT-sseg Models}
\label{sec:vit_models}

This section provides more details about the training of VIT-sseg models.

To train VIT-sseg models we use the training, prediction, and evaluation system presented in
  \cite{Greenwell_2024_WACV, crall_geowatch_2024}, which utilizes polygon annotations to train a pixelwise
  binary segmentation model.


In all experiments, we use half-resolution images, which means most images have an effective width $\times$
  height of 2,016 $\times$ 1,512.
We employ a spatial window size of 416 $\times$ 416 for network inputs, which means that multiple windows
  are needed to predict on entire images.
During prediction, we apply a window overlap of 0.3 with feathered stitching to prevent boundary artifacts.

To address the class imbalance in our dataset (where positives are patches containing annotations and
  negatives contain no annotations), we adopt a balanced sampling strategy.
Each ``epoch'' consists of randomly sampling 32,768 patches from the dataset with replacement, ensuring
  roughly equal numbers of positive and negative samples.
We train each network for 163,840 gradient steps.
For data augmentation we use random crops and flips.

Our baseline architecture is a variant \cite{bertasius2021space,Greenwell_2024_WACV} of a vision-transformer
  \cite{dosovitskiy_image_2021}.
The model is a 12-layer encoder backbone with 384 channels and 8 attention heads that feeds into a 4-layer
  MLP segmentation head.
It has 25,543,369 parameters and a size of 114.19 MB on disk.
At predict time it uses 1.96GB of GPU RAM.

We compute loss pixelwise using Focal Loss \cite{ross2017focal} with a small downweighting of pixels towards
  the edge of the window.
Our optimizer is AdamW \cite{loshchilov_decoupled_2018}, and we experiment with varying learning rate,
  weight decay, and perturb-scale (implementing the shrink perturb trick~\cite{ash_warm_starting_2020,dohare_loss_2023}).
We employ a OneCycle learning rate scheduler \cite{smith2019super} with a cosine annealing strategy and
  starting fraction of 0.3.
Our effective batch size is 24 with a real batch size of 2 and 12 accumulate gradient steps.
This setup consumes approximately 20 GB of GPU RAM during training.

\subsection{VIT-sseg Model Experiments}

To establish a baseline, we evaluated 35 training runs where we varied input resolutions, window sizes,
  model depth, and other parameters.
Although this initial search was somewhat ad-hoc, it provided insights into the optimal configuration for
  our model.
Building on the best hyperparameters from this search, we performed a sweep over 7 combinations of learning
  rate, weight decay, and perturb scale (i.e., shrink and perturb
  \cite{ash_warm_starting_2020,dohare_loss_2023}).
Scripts used to reproduce these experiments, as well as a log of the ad-hoc experiments, are available in
  the code repository.
Additionally, trained models are packaged and distributed with information about their training
  configuration.

Note:
the test dataset used in this appendix section is an older 30 image version with suffix {\tt d8988f8c},
  which is a subset of the more recent 121 image test set used in the main paper.


\begin{table*}[t]
\caption{
Results for the best-performing models on the validation set across 7 hyperparameter configurations.
The table provides detailed information about each configuration, including:
1) Configuration name (first column): a unique code identifying each training run used in the score scatter and box plots.
2) Varied hyperparameters (next three columns): specific values for learning rate, weight decay, and perturb scale that were used in each run.
3) Validation set performance (AP and AUC scores): metrics evaluating the model's performance on the validation set.
4) Test set performance (AP and AUC scores): metrics evaluating the model's performance on the test set using the same validation-maximizing models.
Note that the top AP score over all models on the test set was 0.65, but it did not correspond to one of these validation runs used for model selection.
Qualitative examples illustrating the performance of the top-scoring validation model listed here are provided in \cref{fig:test_heatmaps_with_best_vali_model}.
}
\label{tab:parameters_and_results}
\centering
\begin{tabular}{llllllll}
\toprule
            \multicolumn{4}{l}{} & \multicolumn{2}{c}{Validation (n=691)} & \multicolumn{2}{c}{Test (n=30)} \\
config name &   lr & weight\_decay & perterb\_scale & AP & AUC & AP & AUC \\
\midrule
        \textcolor[HTML]{623682}{D05} & 1e-4 &   1e-6 &  3e-6 & \textbf{0.7802} & \textbf{0.9943} &          0.5051 &          0.9125 \\
        \textcolor[HTML]{df8020}{D03} & 1e-4 &   1e-5 &  3e-7 &          0.7758 &          0.9707 &          0.4346 &          0.8576 \\
        \textcolor[HTML]{87b787}{D04} & 1e-4 &   1e-7 &  3e-7 &          0.7725 &          0.9818 &          0.4652 &          0.7965 \\
        \textcolor[HTML]{207fdf}{D02} & 1e-4 &   1e-6 &  3e-7 &          0.7621 &          0.9893 & \textbf{0.5167} & \textbf{0.9252} \\
        \textcolor[HTML]{20df20}{D00} & 3e-4 &   3e-6 &  9e-7 &          0.7571 &          0.9737 &          0.4210 &          0.7766 \\
        \textcolor[HTML]{df20df}{D01} & 1e-3 &   1e-5 &  3e-6 &          0.7070 &          0.9913 &          0.4607 &          0.9062 \\
        \textcolor[HTML]{b00403}{D06} & 1e-4 &   1e-6 &  3e-8 &          0.6800 &          0.9773 &          0.4137 &          0.8157 \\
        
\bottomrule
\end{tabular}
\end{table*}

\begin{comment}
    SeeAlso:
    ~/code/shitspotter/experiments/geowatch-experiments/run_pixel_eval_on_vali_pipeline.sh
    python ~/code/shitspotter/dev/poc/estimate_train_resources.py
\end{comment}

For each of the 7 hyperparameter combinations, we trained the model for 163,840 optimizer steps using a
  batch size of 24.
We defined an ``epoch'' as 1,365 steps, at which point we saved a checkpoint, evaluated validation loss, and
  adjusted learning rates.
To conserve disk space, we retained only the top 5 lowest-validation-loss checkpoints (although training
  crashes and restarts sometimes resulted in additional checkpoints, which are included in our evaluation).

Using the top-checkpoints, we predicted heatmaps for each image in the validation set.
We then performed binary classification on each pixel (poop-vs-background) using a threshold.
Next, we rasterized the truth polygons.
The corresponding truth and predicted pixels were accumulated into a confusion matrix, allowing us to
  compute standard metrics such as precision, recall, false positive rate, etc.
\cite{powers_evaluation_2011} for the specific threshold.
By sweeping a range of thresholds, we calculated the average precision (AP) and the area under the ROC curve
  (AUC).
We computed all metrics using scikit-learn \cite{scikit-learn}.
Due to the high number of true negative pixels, we preferred AP as the primary measure of model quality.
  
The details of the top model for each run, along with relevant hyperparameters, are presented in
  \Cref{tab:parameters_and_results}.
This table also includes the results on the small, held out, test set for the top model.

The results show strong performance on the validation set, with a maximum AP of $0.78$.
However, while the test AP for this model is good, it is significantly lower at $0.51$.
To investigate this discrepancy, we turned to qualitative analysis.

Qualitative results for the test, validation, and training sets are presented in
  \cref{fig:test_heatmaps_with_best_vali_model}.
These examples illustrate both success and failure cases.
The test and validation sets show clear responses to objects of interest, but the test set contains images
  of close-up and partially deteriorated poops.
This suggests a bias in the dataset towards ``fresh'' poops taken from some distance.

Notably, the much larger training set also contains errors, indicating more information can be extracted
  from this dataset using hard-negative mining.
There are clear difficult cases caused by sticks, leafs, pine cones, and dark areas on snow.
We note that while compiling these results, we checked over 1000 images and discovered 14 cases where an
  object failed to be annotated, and it is likely that more are missed, but we believe these cases are rare.

Although focal loss was used, the current learning curriculum is likely under-weighting smaller distant
  objects.
Our pixelwise evaluation metric is biased against this, which is a current limitation of our approach.
Future work evaluating this dataset on an object-detection level can remedy this.

\begin{figure*}[ht]
\centering
\includegraphics[width=1.0\textwidth]{figures/test_heatmaps_with_best_vali_model}%
\hfill
(a) Test set.
\includegraphics[width=1.0\textwidth]{figures/vali_heatmaps_with_best_vali_model.jpg}%
\hfill
(b) Validation set.
\includegraphics[width=1.0\textwidth]{figures/train_heatmaps_with_best_vali_model.jpg}%
\hfill
(c) Training set.
\caption[]{
    Qualitative results using the top-performing model on the validation set, applied to a selection of images
      from the (a) test, (b) validation, and (c) training sets.
    Success cases are presented on the left, with failure cases increasing towards the right.
    %
    Each figure is organized into three rows:
    %
    Top row:
    Binarized classification map, where true positive pixels are shown in white, false positives in red, false
      negatives in teal, and true negatives in black.
    The threshold for binarization was chosen to maximize the F1 score for each image, showcasing the best
      possible classification of the heatmap.
    Middle row:
    The predicted heatmap, illustrating the model's output before binarization.
    Bottom row:
    The input image, providing context for the prediction.
    %
    The majority of images in the test set exhibit qualitatively good results.
    Failure cases tend to occur with close-up images of older, sometimes partially deteriorated poops.
    These examples were manually selected and ordered to demonstrate dataset
    diversity in addition to representative results.
}
\label{fig:test_heatmaps_with_best_vali_model}
\end{figure*}


In \Cref{tab:parameters_and_results} we only presented the top results.
Here we've plotted the AP and AUC on the validation set for the top 5 AP-maximizing results from each of the
  7 training runs.
We also created a box-and-whisker plot for these top 5 results, which serves to assign a color and label to
  each training run.
These plots are shown in \Cref{fig:apauc_scatter}.


\begin{figure}[ht]
\centering
\begin{subfigure}[b]{0.4\textwidth}
 \includegraphics[width=\textwidth]{figures/macro_results_resolved_params.heatmap_pred_fit.trainer.default_root_dir_metrics.heatmap_eval.salient_AP_vs_metrics.heatmap_eval.salient_AUC_PLT02_scatter_nolegend.png}
 \caption{AP and AUC of 35 checkpoints.}
 \label{fig:apauc_scatter_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.4\textwidth}
 \includegraphics[width=\textwidth]{figures/macro_results_resolved_params.heatmap_pred_fit.trainer.default_root_dir_metrics.heatmap_eval.salient_AP_PLT04_box.png}
 \caption{AP of 35 checkpoints.}
 \label{fig:apauc_scatter_b}
\end{subfigure}
\caption{
    (a) Scatterplot of pixelwise average precision (AP) and Area Under the ROC curve (AUC) for the top
      5 checkpoints on the validation set.
    Points of the same color represent checkpoints from the same training run, which used identical
      hyperparameters.
    (b) Box-and-whisker plot the AP values across the top 5 checkpoints evaluated on
      the validation set.
    For each run, corresponding varied hyperparameters and maximum APs are given in
      \Cref{tab:parameters_and_results}.
}
\label{fig:apauc_scatter}
\end{figure}


\begin{table}[t]
\caption[]{
Resources used for training, prediction, and evaluation.
The "node" column is the pipeline stage:
"train" for training, "pred" for heatmap prediction, and "eval" for pixelwise heatmap evaluation.
The "resource" column lists the resource type: time, energy, or emissions.
The "total" and "\mu" columns show the total and average consumptions, and the "n" column indicates the
  frequency of each stage (e.g., across different hyperparameters).
Train rows marked with an asterisk (*) are based on indirect measurements.
}
\label{tab:resources}

  \centering
  \begin{subtable}[b]{\textwidth} % Adjust width as needed
    \caption{Presented experiment resources.}
    \centering
    \begin{tabular}{llllr}
    \toprule
            Node & Resource    &           Total  &           \mu &  n \\
    \midrule
    eval        &        time  & 14.24 hours      & 0.41 hours     &   35 \\
    \rule{0pt}{2ex}%
    pred        &        time  & 11.97 hours      & 0.34 hours     &   35 \\
    pred        &      energy  &  8.76 kWh        & 0.25 kWh       &   35 \\
    pred        &   emissions  &  1.84 \cotwo kg  & 0.05 \cotwo kg &   35 \\
    \rule{0pt}{2ex}%
    train$^{*}$ & time         &  39.22 days      & 5.60 days      &   7 \\
    train$^{*}$ & energy       & 324.75 kWh       & 46.39 kWh      &   7 \\
    train$^{*}$ & emissions    &  68.20 \cotwo kg & 9.74 \cotwo kg &   7 \\
    \bottomrule
    \end{tabular}
  \end{subtable}

  \hfill % Add horizontal space between the subfigures

  \begin{subtable}[b]{\textwidth} % Adjust width as needed
    \caption{All experiment resources.}
    \centering
    \begin{tabular}{llllr}
    \toprule
            Node & Resource &           Total &            \mu &  n \\
    \midrule
    % Note: for presentation simplicity, we are rewriting the following row
    % so num agrees with other rows. The reason the original value had an
    % additional number is because of rerun of one evaluation with different
    % parameters.
    % eval & time & 4 days 14:18:23 & 00:20:07 &  330 \\
    % 5.85 * 399/400 = 5.84
    %eval &        time &    5.85 day &  0.35 hours &  400 \\
    eval        &        time &    5.84 days     &  0.35 hours    &  399 \\
    \rule{0pt}{2ex}%
    pred        &        time &    7.29 days     &  0.44 hours    &  399 \\
    pred        &      energy &  102.83 kWh      &   0.26 kWh     &  399 \\
    pred        &   emissions &  21.6 \cotwo kg  & 0.05 \cotwo kg &  399 \\
    \rule{0pt}{2ex}%
    train$^{*}$ & time        & 158.95 days      &     3.78 days  &   42 \\
    train$^{*}$ & energy      & 1,316.07 kWh     &     31.34 kWh  &   42 \\
    train$^{*}$ & emissions   & 276.37 \cotwo kg & 6.58 \cotwo kg &   42 \\
    \bottomrule
    \end{tabular}
  \end{subtable}
\end{table}


\subsubsection{Resource Usage}

All models were trained on a single machine with an 11900k CPU and a 3090 GPU.
At predict time, using one background worker, our models processed 416 $\times$ 416 patches at a rate of
  20.93Hz with 94\% GPU utilization.

To better understand the energy requirements of our model, particularly for potential deployment on mobile
  devices, we used CodeCarbon \cite{lacoste2019codecarbon} to measure the resource usage during prediction and
  evaluation.
This analysis not only informs practical considerations but also helps us assess our contribution to the
  growing carbon footprint of AI \cite{kirkpatrick_carbon_2023}.
The results for the 7 presented training experiments and the total 42 training experiments are reported in
  \Cref{tab:resources}.

% See: ./scripts/estimate_training_resources.py
Direct measurement of resource usage during training is still under development, but we estimate the
  duration of each training run using indirect methods.
We approximate energy consumption by assuming a constant power draw of 345W from the 3090 GPU during
  training.
Emissions are estimated using a conversion ratio of 0.21 $\frac{\textrm{kg}\cotwo{}}{\textrm{kWh}}$.
  
Based on the validation set's 691 images, we estimate that predicting on a single image on our desktop
  requires approximately 1.15 seconds and 0.13 Wh of energy.
For context, typical mobile phones have a battery capacity of around 10 Wh and significantly less compute
  power than our desktop setup.
While our models demonstrate the feasibility of training a strong detector from our dataset, they are not
  optimized for the mobile setting.
To deploy our model on mobile devices, we will need to improve its efficiency or explore more efficient
  architectures.



\subsubsection{Dataset Versions}

There are two main versions of the dataset used in this paper. We can specify
these using content-based identifiers. The version from 2024-07-03 has a IPFS CID of: 
\texttt{\seqsplit{bafybeiedwp2zvmdyb2c2axrcl455xfbv2mgdbhgkc3dile4dftiimwth2y}} 
and a BitTorrent magnet of: 
\texttt{\seqsplit{magnet:?xt=urn:btih:ee8d2c87a39ea9bfe48bef7eb4ca12eb68852c49}}.
The version from 2025-04-20 has an IPFS CID of: 
\texttt{\seqsplit{bafybeia2uv3ea3aoz27ytiwbyudrjzblfuen47hm6tyfrjt6dgf6iadta4}} 
and a BitTorrent magnet of: 
\texttt{\seqsplit{magnet:?xt=urn:btih:27a2512ae93298f75544be6d2d629dfb186f86cf}}.
Note: the hash suffix of the magnet URL can be searched on \url{academictorrents.com}.


At the time of writing, the version of the dataset on HuggingFace is the latest, and we use git tags that
  correspond with the date of release and the IPFS CID to help identify dataset versions.
However, unlike the decentralized methods, these are guaranteed to point to the expected version of the
  dataset.
At the time of writing the HuggingFace URL is:
\url{https://huggingface.co/datasets/\redact{<ANONIMIZED_AUTHOR>}/scatspotter} and the Girder URL is:
\url{https://data.\redact{<ANONIMIZED_ORGANIZATION>}.com/?#user/598a19658d777f7d33e9c18b/folder/66b6bc7ef87a980650f41f98}.

%&tr=https%3A%2F%2Facademictorrents.com%2Fannounce.php%3Fpasskey%3D9ffbb169f882f3be1330a48ea87416e7&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
