A log of experiences while developing this project that aren't always obvious
from the git log.


# 2024-08-01

Getting bittorrent to work in the situation where I seed something on one
machine in a network and then download it on another it on another machine on
the network. I am going to write a stackoverflow post about this, and I will
draft it here. I'm looking for a way to record the following sort of
difficulties, and this "dev/lab journal" is what I came up with.

https://superuser.com/questions/1851016/a-minimal-example-of-sharing-data-with-bittorrent

.. code:: 

    A minimal example of sharing data with BitTorrent 

    I want to write a minimal, reproducible, and configurable minimal example that
    demonstrates how to create, seed, and download a torrent. 

    ### Issue & Question

    This is much harder than I anticipated. The command line tooling is weaker than
    I would have guessed, but it exists, and it should be possible. There are a lot
    of edge-cases to consider when writing a networking example, and perhaps my
    difficulty is because I'm hitting one of those. I'm looking for help with
    writing up this example in a way that helps document what some of the issue
    might be.


    ### Setup

    I'm running all of this on a home network behind a simple router. I'm using two
    machines on this network. A rasberry pi 4, and my x86-84 desktop. Both are
    running ubuntu. I'm not sure if being on the same LAN behind the same NAT-ed
    router is going to cause an issue? The seeding machine does have ports 6969 and
    51413 forwarded to it, but the download machine does not. Is that an issue?
    I would think it would still work.

    ### Script

    The following is the MWE I've written so far. Not sure how to make RST look
    nice on SuperUser, so putting it in a code block:

        <code from ~/code/shitspotter/docs/source/manual/minimal_bittorrent_example.rst>

# 2024-08-02

Made good progress on writing the paper. Went from 1ish to 4ish pages. The layout is coming together and I think the abstract is compelling.

# 2024-08-03

And my superuser question got closed for being too open ended. Annoying. In any case, I might have figured out the problem. It looks like the downloading machine does connect to the seed machine but there is a permission issue, by changing the download location I'm hoping that fixes it...

Also m "h" key is stuck and i'm currently typing it by copying it into the clipboard and pasting it. Need to clean my keyboard.

# 2024-09-21

Got desk rejected due to going over length, bummer. It was a stupid mistake
where I included the appendix in the main paper instead of the suplementary
materials. Guess I can try for CVPR, but passing review there will be less
likely.

We will increase chances of getting into CVPR if we include SOTA models.
I think I will try detectron2, as mmdet has not been great to work with lately,
but I might got back there later. When working with detectron2 I'm noticing
some pain points, which I should document. I will put these notes in

~/code/shitspotter/experiments/detectron2-experiments/README.md


# 2024-09-29


I just realized that the validation results I reported on in the paper were generated from the incorrect data. 
I used vali_imgs228_20928c8c.kwcoco.zip instead of
vali_imgs691_99b22ad0.kwcoco.zip. I suppose its good the paper was rejected,
otherwise I would have had to issue a redaction. More time to get it right I
suppose.

# 2024-10-14

Solicited pics on https://www.reddit.com/r/<ANONIMIZED_LOCATION>/comments/1g28ger/<ANONIMIZED_LOCATION>_dog_poop_detection_send_me_your_poop_pics/
Not sure if it will get any hits. Some of the responses were surprising.

# 2024-12-29

A few days ago I got 75 high-quality pictures from someone online. It doesn't
seem related to the <ANONIMIZED_LOCATION> reddit post which got nothing. There are some
challenging cases in the images and they are perfect for testing. Very happy
about that.

I've been unable to update as frequently as I would have liked in the past 2
months. I'm trying to figure out the best way to temporarily remove
identifiable metadata (i.e. GPS location). This is challenging for a few
reasons. GPS turns out to be less accurate than I would like, and surrounding
buildings seem to throw things off by a lot to the point where there isn't a
good line between locations I want to be public and those I want to be private.

What I've done so far is:
* setup a transcrypt repo that stores secret logic for how I determine which images will have their metadata removed. Some of this can be based on geolocation, but due to GPS errors that is not enough, and we need a different non-public method.
* After running the images through the logic, the ones that are sensitive are marked to have their GPS info removed.
* Given a list of marked images, I scrub the metadata, but I also create a binary diff that if applied to the original image would "rehydrate" it and restore it to its original state.

These diffs are going to be stored in an encrypted format and distributed with
the dataset. I do eventually want to "rehydrate" the stripped metadata, as
noisy GPS locations won't be sensitive forever (and frankly aren't that
sensitive right now, but this is more of a privacy exercise than a critical
component, otherwise I wouldn't even post the encrypted version online). An
interesting way I can think to do this would be to set up a dead man's switch
on Ethereum's EVM. Perhaps using https://sarcophagus.io/

RPI grad has a paper about kill switches: https://arxiv.org/html/2407.10302v1

The idea would be to store the encrypted version of the key on the blockchain
and then after some amount of time passes or some other condition is met the
decryption will automatically take place, and the full dataset will effectively
become public.

I'm currently using the detectron2 model to seed annotations for the next
round.  I think its at the point where it is ultimately less work. I'm also
annotating false positives (e.g. rock / leaf) so the network can be given those 
regions explicitly as hard negatives.

# 2024-12-30

Well, I got the scrubbing and release done. The pattern of uploading the entire
dataset to IPFS every time is somewhat unsustainable. It needs to scan 53GB to
build the final CID currently. Its taking over 30 minutes to just build the
CID. Not the end of the world, but it shouldn't be taking this long, or at least 
we should be able to have a system where only the new data needs to be scanned.

I think what I will do is use the new shitspotter.ipfs logic to build DVC like
sidecar files and then push the sidecars to a git repo. This will give much
better response times in the future.

However, something to note is that pinning on a new machine where all of the
sub-cids have been pinned, it is very fast. Err, this might not be correct.

Yeah, the CID it spat out was something that already existed, and the issue was
that it could read in a file with write protections. 


# 2025-02-07


I found a new related dataset:

Poultry fecal imagery dataset for health status prediction: A case of South-West Nigeria
Halleluyah O Aworinde 1, Segun Adebayo 2, Akinwale O Akinwunmi 1, Olufemi M
Alabi 2, Adebamiji Ayandiji 2, Aderonke B Sakpere 3, Abel K Oyebamiji 2, Oke
Olaide 1, Ezenma Kizito 1, Abayomi J Olawuyi 1
Affiliations Expand
PMID: 37674505 PMCID: PMC10477973 DOI: 10.1016/j.dib.2023.109517

https://pubmed.ncbi.nlm.nih.gov/37674505/

https://universe.roboflow.com/search?q=class:fecal%20matter

https://chatgpt.com/c/67a6a1ed-bd94-8013-b5a8-f19f8f9c9bd2

https://ieee-dataport.org/documents/fecal-microscopy-dataset

https://universe.roboflow.com/thesis-pr4oh/fecal-lbh0j

https://arxiv.org/pdf/1903.10578

https://www.reddit.com/r/poop/

https://www.theverge.com/2019/10/29/20937108/poop-database-ai-training-photo-upload-first-mit


# 2025-03-07

I haven't been updating this project much in the last 2-3 months, but I've
still been taking pictures.

I'm starting an update today, but its a big bundle of 1199 images from my phone
(of which some are for this dataset). Looks like 440 of them were poop images,
so 145ish new poop images, which should give us over 2k positive examples.

My picture taking pattern has also changed, given that we moved from a place
without a yard to a place with a yard. Now the images are much more bursty as I
don't need to take the dogs out on a leash. So I pick a bunch up later. That
means less diverse time of day, but now there is less of a freshness bias as
I'll wait a few days before picking everything up. I do still go on walks into
the city, so there are still new fresh and not-my-dogs examples coming in.

Been working on the YOLO training. Out of the box it did not work. I had to do
some rewrites of the WongKinYiu https://github.com/MultimediaTechLab/YOLO.git
YOLO fork. 
https://github.com/MultimediaTechLab/YOLO/pull/150

After I wrote code to visualize the training bathes, I realized the
augmentations were all wrong for this dataset, there was no flipping, and
the mosic and crop/resize probably hurt convergence. I played with exposing
more lightning CLI features and tuned gradient clipping, lr, batch size, and
simplified the dataset (merged small nearby boxes into a single box) with the
intent of helping YOLO learn better. ChatGPT said it had an unstable loss
surface, but I'd like to see sources that better explain why. I believe it
though. YOLO is a bitch to train. Was in v2, and it seems like it still is in
v9. But when it work, it does work well. But I've noticed gradients like to
explode. Wish I understood if this actually was more likely to happen or not
with YOLO vs other models.

Running with the simplified data, a low LR of 0.0003, a batch size of 16*50,
and starting from v9 weights seems to be doing better than I've done on this
dataset before.

# 2025-03-08

The two LRs I tested yesterday did about the same. Today I got AdamW working
(and maybe CosineLR - although its not clear I got the scheduler right), and it
converges to about the same performance much much quicker. Getting AP@.5 of
~64% and AP%0.5:0.95; of 42% small recall is about 60%, medium recall is about
80%


# 2025-04-21

Working on getting the dataset ready for another submission. Hugging face has
impressive dataset distribution infastructure. It's somewhat decentralized in
that it uses git-lfs, so maybe it is best of both worlds. In any case, I had
uploaded the 2024-08-09 snapshot to it before, which happens to correspond to
the 2024-07-03 dataset (as I didn't update the data while I was working on that
original submission). Additionally I tagged it with its IPFS CID:
bafybeiedwp2zvmdyb2c2axrcl455xfbv2mgdbhgkc3dile4dftiimwth2y, which
unfortunately won't match the data because its zipped, and I think the CID had
a few extra files that belong to it. In any case, I'm writing this to document
that all 3 of those identifiers belong to the same dataset.

The updated version of the dataset for the new publication will have a CID of:
bafybeia2uv3ea3aoz27ytiwbyudrjzblfuen47hm6tyfrjt6dgf6iadta4 and its zipped
snapshot date is: 2025-04-21T184031-5.


# 2025-04-26


Hugging face is so fast. I spoke with colleages, the likely reason is CDNs and
cake, which prevent buffer bloat.

Here are notes I took:

CDNs and CAKE.
Cake solves buffer bloat - look into this.
Filling all buffers along TCP path.
https://github.com/huggingface/hf_transfer
https://www.bufferbloat.net/projects/codel/wiki/Cake/


Talks on Data. 
https://www.youtube.com/watch?v=KD5TwLQnq_8
https://www.youtube.com/watch?v=KD5TwLQnq_8
https://www.youtube.com/watch?v=PgFcqRMDqlk
https://www.bufferbloat.net/projects/codel/wiki/Cake/

Also we may want to make these points: 

Annotations for data collected in <confirm time range> late 2024-2025 used
trained models to seed initial detections. However, each case is still manually
checked.  For simple cases this worked well, but there were a significant
number of errors that needed to be corrected. We did not delete the false
positive detections. Instead we gave them new category names (e.g. leaf, stick,
grass, shadow). These can serve as explicit hard negatives. We also began
adding notes to some annotations indicating the appearance of the poop (e.g.
old, smashed, wet).


# 2025-05-03

Spend some time getting the 2025-04-20 version of the dataset into a torrent.
Made some updates to the CLI wrapper around transmission-remote. Was having a
hard time getting new downloads to start, but after some time I realized I
forgot to forward port 51413 to my torrent box running transmission. Seems to
be starting now.

Also going to try forwarding 58334 to my box running deluge, and that seemed to
work.
