# Finding Most Influential Sets

R implementation of algorithms for most influential set (MIS) selection using the Dinkelbach algorithm.

## Setup

```r
# Source functions
for (f in list.files("R", "\\.R$", full.names = TRUE)) {
  source(f, local = FALSE)
}
```

External dependencies are required to reproduce results:
```r
# Mostly Optional Dependencies
install.packages("Rcpp") # Fast Top-K heap
install.packages(c("gbm", "xgboost", "randomForest")) # PLM options (gradient boosting, random forest)
devtools::install_github("nk027/infuential_sets") # Implementation of greedy algorithm
install.packages("haven") # Read DTA files
install.packages("robustbase") # Stats benchmarks
install.packages("MASS") # Boston housing
install.packages("rmarkdown") # Compile the application-report
install.packages("kableExtra") # Nicer tables
```

## Usage

```r
# Obtain data
N <- 1000L
x <- rnorm(N)
y <- x + rnorm(N)
# Create model
model <- lm(y ~ 0 + x)

# Find most influential set of size k
result <- find_miss(model, k = 10)
# Find sets for sizes 1 to K
results <- find_misses(model, K = 5)
```

### Available Functions

- `find_miss(model, k)`: Find optimal set of size k
- `enumerate_miss(model, k)`: Exhaustive enumeration
- `greedy_miss(model, k)`: Greedy approximation

## Analyses

Empirical analyses are provided in `scripts/`, relying on data that is packaged in **R** or `data/`:

- `applications.Rmd` compiles `applications.html`, which contains results for the additional applications mentioned in the paper
- `microcredit.R` produces the microcredit figures (up to an Inkscape-constant)
- `simulation.R` produces part of the simulation results
- Please note that the residualization strategy and selected simulation results are bespoke at the time of submission, and are not available in this Supplement at this time.
