% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.21 of 2022/01/12
%
\documentclass[runningheads]{llncs}
%
\usepackage[T1]{fontenc}
% T1 fonts will be used to generate the final print and online PDFs,
% so please use T1 fonts in your manuscript whenever possible.
% Other font encondings may result in incorrect characters.
%
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{tabularx}
\usepackage{makecell}
\usepackage{array}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following two lines
% to display URLs in blue roman font according to Springer's eBook style:
%\usepackage{color}
%\renewcommand\UrlFont{\color{blue}\rmfamily}
%\urlstyle{rm}
%
\begin{document}
%
\title{Language Model Morphology Evaluation on Canadian Indigenous Languages}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Duncan Stothers\inst{1}\inst{2}\orcidID{0000-0001-6873-851X}}
%
\authorrunning{D. Stothers}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.

\institute{Harvard University \and
University of British Columbia}

\maketitle              % typeset the header of the contribution
%
\begin{abstract}
We present an evaluation layer for Canadian Indigenous NLP. We combine an auditable Inuktut$\rightarrow$English machine translation sanity baseline with morphology-aware probes for Plains Cree (n\^ehiyaw\^ewin) and Ojibwe (Anishinaabemowin). For morphology, we evaluate reinflection from lemma plus feature bundle to surface form and structured analysis from surface form to plus-delimited segmentation and tags. We then ablate prompts, trivial and heuristic baselines, model choice, and a lightweight hybrid fallback strategy.

The results expose a gap in the current literature. On Inuktut MT, an open NLLB baseline remains brittle even after a six-code configuration sweep: the best configuration reaches BLEU $0.10$ and chrF $8.31$ on WMT20 dev, with $12.62\%$ null outputs and $48.89\%$ punctuation-only outputs, despite numeric recall of $0.88$. On morphology, prompt conditioning sharply improves \emph{analysis} but not \emph{reinflection}. For Cree, analysis rises from $0.00$ under the original prompt to $0.32$ with two-shot prompting, and the best open model reaches $0.45$. For Ojibwe, analysis rises from $0.00$ to $0.62$. Reinflection is much more resistant to prompting: the best prompt-conditioned language-model scores are $0.17$ accuracy for Cree and $0.33$ for Ojibwe, while simple heuristics match or outperform these results. A hybrid-lite rescue layer yields selective gains, most notably raising Ojibwe reinflection to $1.00$ accuracy by falling back to simple morphology-aware rules.
\end{abstract}

\keywords{Indigenous NLP \and Canadian Indigenous languages \and Plains Cree \and Ojibwe \and Inuktut \and Computational morphology \and Evaluation }



\section{Introduction}

Canadian Indigenous language technology currently combines two very different forms of progress. On the one hand, some languages now have visible industrial or benchmarked support, especially Inuktut and Inuktitut through the Nunavut Hansard, WMT20, and major platform deployments. On the other hand, several languages have strong linguistic infrastructure, such as analyzers, dictionaries, and corpora, without a correspondingly mature evaluation layer for general-purpose language models. This asymmetry is especially visible in Plains Cree, where open morphology tooling is already substantial, but there are comparatively few low-friction evaluations of what open language models can actually do with inflectional structure. \cite{joanis-etal-2020-nunavut,barrault-etal-2020-findings,knowles-etal-2020-nrc,hernandez-nguyen-2020-ubiqus,moshagen2023giellalt,giellalt-crk,itwewina-dict}

This paper addresses that gap by contributing an \emph{evaluation layer} rather than a new model. The central idea is simple: take task forms that are already canonical in computational morphology, make them runnable under unusually strict reproducibility constraints, and align them with the representations already used in community-facing Indigenous-language infrastructure. The resulting benchmark slices are deliberately small, but they are systematic: they support ablations over prompts, models, baselines, and lightweight hybrid rescue rules, and they produce artifacts that can be rerun on a single workstation without hidden dependencies.

The paper has three linked components. First, we establish an open Inuktut$\rightarrow$English MT sanity baseline on WMT20 dev using NLLB-200 distilled 600M and show that even a standard multilingual open checkpoint can be brittle in ways not obvious from surface support alone. Second, we introduce morphology-aware probes for Plains Cree and Ojibwe using two tasks: reinflection from lemma plus feature bundle to surface form, and structured analysis from surface form to plus-delimited segmentation and tags. Third, we systematically compare prompts, models, heuristics, and a hybrid-lite fallback layer.

This experimental program is motivated by five questions:

\begin{enumerate}
    \item Can an open Inuktut MT baseline be trusted as a default starting point under local, reproducible conditions?
    \item How much of morphology failure in open language models is genuine linguistic weakness, and how much is prompt-format instability?
    \item Are simple heuristics still competitive on low-resource morphology tasks?
    \item Do open model choices matter differently for reinflection and analysis?
    \item Can a lightweight hybrid layer improve utility without abandoning an open and inspectable workflow?
\end{enumerate}

\section{Background and Related Work}

\subsection{Indigenous-language NLP, infrastructure, and process}

A recurring theme in NLP for Indigenous and low-resource languages is that model performance alone is not enough. Global surveys of language representation continue to show severe inequality in data and evaluation, with most languages effectively absent from mainstream pipelines. \cite{joshi2020state} In Indigenous-language work, this broader problem is often expressed more concretely: useful language technology depends on inspectable infrastructure, collaborative process, and whether tools can be rerun and repurposed in community settings rather than only demonstrated in research environments. \cite{bird-2020-decolonising,littell-etal-2018-indigenous,kuhn-etal-2020-indigenous-tech} This perspective is central to our design. Open-only weights, no new installs, local execution, and machine-readable artifacts are not incidental engineering preferences; they are responses to a well-documented infrastructure gap.

\subsection{Canadian Indigenous NLP and the benchmark asymmetry}

Within Canada, the most visible Indigenous-language NLP work has centered on Inuktut and Inuktitut, largely because of the Nunavut Hansard parallel corpus and the WMT20 news translation task. The Hansard established an English--Inuktitut resource that could sustain public evaluation, and WMT20 converted that resource into a shared-task benchmark with multiple system descriptions. \cite{joanis-etal-2020-nunavut,barrault-etal-2020-findings,wmt20task,knowles-etal-2020-nrc,hernandez-nguyen-2020-ubiqus} Later work showed that evaluation design itself matters for this language pair and that character-level metrics correlate well with human judgment in this polysynthetic setting. \cite{knowles-lo-2022-human-eval}

Industrial deployments further increased the visibility of Canadian Indigenous language technology. Google Translate added Inuktut, and Microsoft introduced Inuktitut support in Translator and later publicized neural TTS voices. \cite{google2024inuktut,microsoft2021inuktitut,microsoft2024tts} These deployments matter for access, but they do not provide the kind of open, low-friction benchmark layer that researchers and communities can readily rerun.

\subsection{Multilingual Indigenous NLP beyond sentence-level MT}

Large multilingual efforts such as No Language Left Behind explicitly frame low-resource language support as a response to digital inequity and scale MT to hundreds of languages and tens of thousands of directions. \cite{nllbteam2022nllb,costa-jussa2024nllb200} Parallel work in the Americas has expanded Indigenous-language evaluation through shared tasks that move beyond plain MT into educational-material generation through morphological adaptation and translation metrics. \cite{mager-etal-2021-americasnlp,de-gibert-etal-2025-americasnlp} Model-centric work such as IndT5 likewise demonstrates that Indigenous-language pretraining can be useful under sparse data. \cite{nagoudi2021indt5}

What remains underdeveloped, especially in the Canadian context, is a morphology-aware evaluation layer for open language models. The current paper addresses exactly that space: not sentence-level MT, not analyzer construction, and not industrial deployment, but a benchmark layer that measures what open LMs can do with linguistically explicit morphology.

\subsection{Computational morphology as the methodological base}

Our task choice is not ad hoc. Morphological reinflection and related structured tasks have been standard in computational morphology for years through the SIGMORPHON shared-task series. \cite{cotterell-etal-2016-sigmorphon,cotterell-etal-2017-sigmorphon,mccarthy-etal-2019-sigmorphon} Those tasks were designed precisely to expose whether models can generalize over inflectional structure rather than simply produce plausible fluent text. Our contribution is to port that benchmark logic into a Canadian Indigenous language setting under unusually strong openness and deployment constraints.

\subsection{The Plains Cree and Ojibwe computational ecosystem}

Plains Cree is not a tooling blank slate. The GiellaLT ecosystem provides finite-state analyzers and generators with shared engineering conventions across many low-resource languages, including Plains Cree. \cite{moshagen2023giellalt,giellalt-crk} The \emph{itw\^{e}wina} dictionary integrates lexical resources with finite-state morphology so that users can search by inflected form and inspect paradigms. \cite{itwewina-dict} Additional work on the Ahenakew-Wolfart Plains Cree corpus, interactive completion, and word-level prediction shows that Cree morphology is already operationalized in corpora and usable software interfaces. \cite{ahenakew-wolfart-corpus,plains-cree-autocomplete,plains-cree-word-prediction}

The same general pattern holds for related Canadian Indigenous languages. OjibweMorph shows how approachable finite-state morphology can support educational and lexicographic applications. \cite{hammerly2025ojibwemorph} Gitksan finite-state work demonstrates that this infrastructure style extends beyond Cree and Ojibwe. \cite{forbes-etal-2021-gitksan} ReadAlong Studio offers a complementary lesson from speech technology: practical, licensed, easy-to-use infrastructure is often what makes language technology actionable. \cite{littell-etal-2022-readalong}

\subsection{LLMs and morphology}

Recent work on LLMs and morphology helps interpret our results. Multilingual Wug-style evaluations show that LLM performance deteriorates as morphological complexity increases. More recent work on compositional generalization argues that even instruction-tuned models remain weak when asked to systematically realize morphological structure over novel or uncommon combinations. \cite{llm-wug-test-2024,morph-compositional-generalization-2025} This makes our observed failure modes---lemma copying, conjunct weakness, person confusion, and format instability---look typical of a broader difficulty class rather than idiosyncratic properties of Cree or Ojibwe.

\section{Evaluation Program and Experimental Design}

\subsection{Experimental program}

The paper reports five linked experiment families:

\begin{enumerate}
    \item \textbf{Inuktut MT sanity baseline.} We evaluate an open NLLB checkpoint on WMT20 dev and sweep six tokenizer language-code settings to test whether configuration alone rescues the baseline.
    \item \textbf{Morphology prompt ablations.} We compare zero-shot, stricter formatting prompts, and one-shot/two-shot variants for reinflection and analysis in Cree and Ojibwe.
    \item \textbf{Heuristic baselines.} We compare open LMs against trivial and lightweight morphology-aware baselines.
    \item \textbf{Cross-model ablation.} We evaluate multiple open instruction-tuned models under the best prompts selected for each language and task.
    \item \textbf{Hybrid-lite rescue.} We apply simple validity checks and selective fallback to the best heuristic baseline when the LM output clearly fails structural requirements.
\end{enumerate}

This design is deliberate. It allows us to separate morphology failure from prompt-format failure, compare learned and rule-based behavior, and test whether a lightweight hybrid layer is already useful before any training or analyzer integration.

\subsection{Tasks}

We evaluate two morphology tasks.

\paragraph{Reinflection.}
Given a lemma $\ell$ and a plus-delimited feature bundle $b$, generate a surface form $\hat{y}$:
\[
(\ell,b)\mapsto\hat{y}.
\]

\paragraph{Analysis.}
Given a surface form $x$, produce a plus-delimited analysis $\hat{a}$:
\[
x\mapsto\hat{a}=m_1+m_2+\cdots+\tau_1+\cdots
\]
where the output may contain morphemes and feature tags in a single linearization.

\begin{table}[t]
\centering
\begin{tabular}{llll}
\toprule
Task & Input & Output & Primary score \\
\midrule
Reinflection & $\ell$, $b$ & surface $\hat{y}$ & exact match, Avg.\ ED \\
Analysis & $x$ & analysis $\hat{a}$ & Jaccard over tag sets \\
\bottomrule
\end{tabular}
\caption{Tasks and scores. Plus-delimited bundles follow GiellaLT conventions. \cite{giellalt-crk,moshagen2023giellalt}}
\label{tab:tasks}
\end{table}

\subsection{Data}

For morphology, we use compact curated diagnostic sets. Each reinflection item is a triple $(\ell,b,y^\star)$ and each analysis item is $(x,\mathcal{A}^\star)$ where $\mathcal{A}^\star$ is a small set of acceptable analyses. The Plains Cree diagnostic set contains $n{=}6$ reinflection items and $n{=}6$ analysis items. The Ojibwe replication set contains $n{=}3$ reinflection items and $n{=}3$ analysis items.

These sets are intentionally diagnostic rather than benchmark-scale. That design is justified by the role they play in the paper: they are \emph{benchmark slices} used to compare prompts, heuristics, models, and hybrid rescue strategies under controlled conditions. Their value lies in interpretability and rerunnability, not in providing a final population-level estimate of morphology competence.

All morphology items are non-sacred and drawn from widely taught paradigms. Feature bundles and segmentation follow GiellaLT-style conventions, and spellings are checked against \emph{itw\^{e}wina} where appropriate. \cite{giellalt-crk,moshagen2023giellalt,itwewina-dict}

For MT, we use the Inuktut$\rightarrow$English WMT20 dev set built from the official SGM files, yielding $5173$ sentence pairs.

\subsection{Models}

The MT sanity baseline uses \texttt{facebook/nllb-200-distilled-600M}. The morphology experiments use open causal instruction-tuned models:
\begin{itemize}
    \item \texttt{TinyLlama/TinyLlama-1.1B-Chat-v1.0}
    \item \texttt{Qwen/Qwen2.5-0.5B-Instruct}
    \item \texttt{Qwen/Qwen2.5-1.5B-Instruct}
\end{itemize}

All decoding is greedy, with no sampling and temperature $0$.

\subsection{Prompting, baselines, and hybrid-lite}

We use short instruction-style prompts that request single-line outputs. The key ablations compare:
\begin{itemize}
    \item the original zero-shot prompt,
    \item a stricter formatting prompt,
    \item a minimal prompt,
    \item one-shot prompting,
    \item two-shot prompting.
\end{itemize}

We also define trivial and lightweight heuristic baselines. For reinflection, these include lemma copying and simple person or conjunct prefix heuristics. For analysis, these include identity outputs, prefix splitting, and prefix splitting with coarse tags.

Hybrid-lite is a simple decision layer. It accepts the LM output when it is structurally valid and otherwise falls back to the best heuristic baseline for that language-task pair. The point is not to simulate a full analyzer-backed system, but to test whether a small amount of explicit structure already changes the tradeoff.

\subsection{Normalization, metrics, and artifacts}

Before scoring, all outputs are normalized with Unicode NFC and whitespace cleanup, while preserving diacritics and case. Reinflection is scored with exact match accuracy and average Levenshtein distance. Analysis is scored with Jaccard overlap over plus-delimited atom sets. We also track invalid-output rate, prompt-echo rate, and, for analysis, copy-input rate.

Every experiment writes per-item JSONL files, summary JSON files, and a machine-readable runtime configuration. This is one of the paper’s practical contributions: the benchmark layer is not just described, it is emitted as a rerunnable artifact set.

\section{Results}

\subsection{Inuktut MT sanity baseline}

We begin with the Inuktut MT baseline because it situates the broader evaluation problem. The Section~11 code sweep evaluated six tokenizer language-code settings:
\texttt{ike\_Cans}, \texttt{iku\_Cans}, \texttt{iu\_Cans}, \texttt{ike\_Latn}, \texttt{iku\_Latn}, and \texttt{iu\_Latn}. On the subset sweep, all six conditions tied. The best full-run configuration---\texttt{sec11\_mt\_native\_01\_ike\_Cans}---achieved BLEU $0.10$ and chrF $8.31$ on the full WMT20 dev set, with $12.62\%$ null outputs and $48.89\%$ punctuation-only outputs. Numeric recall remained high at $0.88$, but date recall was only $0.01$.

This is a useful result even though the absolute translation quality is poor. It shows that an open multilingual checkpoint can preserve superficial structure, especially digits, while still failing badly at sentence-level translation. In other words, apparent language support is not enough; evaluation must also reveal output pathologies.

\begin{table}[t]
\centering
\small
\begin{tabular}{lrrrrrrr}
\toprule
Slice & n & BLEU & chrF & Null & Punct. & Num. & Date \\
\midrule
All & 5173 & 0.10 & 8.31 & 0.13 & 0.49 & 0.88 & 0.01 \\
Short ($\leq 5$) & 677 & 0.01 & 9.04 & 0.18 & 0.48 & 0.96 & 0.00 \\
Medium (6--15) & 1700 & 0.08 & 6.42 & 0.06 & 0.61 & 0.93 & 0.09 \\
Long ($>15$) & 2796 & 0.09 & 8.53 & 0.15 & 0.42 & 0.82 & 0.00 \\
Numeric ref & 1518 & 0.24 & 11.43 & 0.02 & 0.03 & 0.88 & 0.01 \\
Date ref & 77 & 0.00 & 10.10 & 0.03 & 0.05 & 0.83 & 0.01 \\
\bottomrule
\end{tabular}
\caption{Slice-level diagnostics for the best open Inuktut$\rightarrow$English MT configuration. Null and punctuation-only columns are rates.}
\label{tab:mt_slices}
\end{table}

Figure~\ref{fig:mt_slices} makes the same pattern visible at a glance. Numeric-heavy lines are relatively more stable, while date-heavy and ordinary sentence slices remain brittle. The middle panel is especially revealing: output failure is not a marginal artifact but a central part of the baseline’s behavior.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{sec25.png}
\caption{Inuktut MT slice diagnostics for the best full-run configuration. Left: chrF by slice. Middle: null-output and punctuation-only rates. Right: numeric and date recall. The figure shows that the open baseline preserves digits more readily than sentence content and date structure, and frequently degenerates into null or punctuation-only outputs.}
\label{fig:mt_slices}
\end{figure}

\subsection{Morphology condition landscape}

Figure~\ref{fig:morph_heatmaps} summarizes the full morphology condition space across languages, tasks, prompts, models, baselines, and hybrid-lite. Each panel is column-normalized within task so that clustering highlights \emph{relative} condition behavior rather than raw metric scale. Two broad patterns stand out.

First, reinflection is much less responsive to prompting than analysis. Across both Cree and Ojibwe, many reinflection LM conditions cluster near weak or moderate performance, while heuristic and hybrid-lite rows remain comparatively strong. Second, analysis is markedly more prompt- and model-sensitive. Two-shot prompting and the stronger analysis models occupy visibly better parts of the condition space, especially for Ojibwe.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{sec24.png}
\caption{Clustered heatmaps of morphology conditions across Cree and Ojibwe. Each panel summarizes prompt variants, model ablations, baselines, and hybrid-lite under column-normalized metrics. Reinflection panels remain relatively cool except where heuristics dominate, while analysis panels show clear gains from prompt conditioning and, for some settings, model choice.}
\label{fig:morph_heatmaps}
\end{figure}

\subsection{Prompt sensitivity: analysis is recoverable, reinflection is not}

The first major morphology result is that prompt design matters much more for analysis than for reinflection.

For Cree reinflection, the original and strict prompts tie at accuracy $0.17$ with AvgED $3.17$, and one-shot or two-shot prompting does not improve accuracy. By contrast, Cree analysis improves from $0.00$ under the original prompt to $0.21$ with one-shot and $0.32$ with two-shot prompting.

Ojibwe shows an even sharper split. Reinflection under the best prompts remains at only $0.33$ accuracy with AvgED $1.33$, while analysis rises from $0.00$ under the original prompt to $0.53$ with one-shot and $0.62$ with two-shot prompting.

This difference is important for interpreting the literature gap. The benchmark is not merely showing that small open LMs fail. It shows \emph{how} they fail. For structured analysis, a substantial portion of the initial error was due to output-format instability and prompt echo. For reinflection, however, prompting does far less. Once the model is asked to realize a specific inflectional bundle as a single surface form, the bottleneck appears to be morphological realization rather than surface formatting.

\subsection{Prompt-conditioned LMs versus best heuristics}

Table~\ref{tab:cross_prompt_vs_baseline} compares the best prompt-conditioned LM result for each language-task pair to the strongest trivial or heuristic baseline.

\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{4pt}
\renewcommand{\arraystretch}{1.1}
\begin{tabularx}{\linewidth}{
l l
>{\raggedright\arraybackslash}X
l
>{\raggedright\arraybackslash}X
l
}
\toprule
Language & Task & Best LM prompt & LM score & Best baseline & Baseline score \\
\midrule
Cree
& reinflection
& strict
& Acc=0.17, ED=3.17
& \makecell[l]{conjunct\_or\_\\person\_prefix}
& Acc=0.17, ED=1.50 \\
Cree
& analysis
& two\_shot
& Jacc=0.32
& \makecell[l]{prefix\_split\_with\_\\coarse\_tags}
& Jacc=0.24 \\
Ojibwe
& reinflection
& strict
& Acc=0.33, ED=1.33
& \makecell[l]{person\_prefix\_only}
& Acc=1.00, ED=0.00 \\
Ojibwe
& analysis
& two\_shot
& Jacc=0.62
& \makecell[l]{prefix\_split\_with\_\\coarse\_tags}
& Jacc=0.34 \\
\bottomrule
\end{tabularx}
\caption{Best prompt-conditioned LM results versus best trivial or heuristic baselines.}
\label{tab:cross_prompt_vs_baseline}
\end{table}

Two points are especially notable.

First, reinflection remains an area where heuristics are surprisingly competitive. In Cree, the best heuristic baseline matches the LM on accuracy and substantially improves average edit distance. In Ojibwe, a simple person-prefix baseline reaches $1.00$ accuracy, clearly outperforming the LM. This suggests that for highly regular, low-resource inflectional patterns, minimal explicit morphology can still dominate small general-purpose LMs.

Second, the story is different for analysis. Here the best prompt-conditioned LM already exceeds the best heuristic in both languages, most clearly in Ojibwe ($0.62$ versus $0.34$). That indicates that structured analysis is a task where small open LMs can provide real added value once prompt formatting is controlled.

\subsection{Model ablation}

Table~\ref{tab:cross_best_models} reports the best-performing open model for each language and task under the selected prompt conditions.

\begin{table}[t]
\centering
\small
\begin{tabularx}{\textwidth}{l l l l c c}
\toprule
Language & Task & Best model & Main score & Invalid & Echo \\
\midrule
Cree & reinflection & tinyllama\_1.1b\_chat & Acc=0.17, ED=3.17 & 0.00 & 0.00 \\
Cree & analysis & qwen2.5\_1.5b\_instruct & Jacc=0.45 & 0.00 & 0.00 \\
Ojibwe & reinflection & tinyllama\_1.1b\_chat & Acc=0.33, ED=1.33 & 0.00 & 0.00 \\
Ojibwe & analysis & tinyllama\_1.1b\_chat & Jacc=0.62 & 0.00 & 0.00 \\
\bottomrule
\end{tabularx}
\caption{Best-performing open models by language and task under the selected prompt variants.}
\label{tab:cross_best_models}
\end{table}

The ablation shows that model choice matters, but not uniformly. TinyLlama is the strongest reinflection model in both languages, while Qwen 2.5 1.5B is the strongest Cree analysis model. More importantly, bigger is not automatically better. Qwen 2.5 1.5B achieves the best Cree analysis score ($0.45$), but catastrophically fails on reinflection, yielding $100\%$ invalid outputs for Cree and effectively unusable reinflection outputs for Ojibwe as well. This is precisely the kind of benchmark behavior the current literature lacks: the evaluation layer distinguishes model families by \emph{task type} and \emph{output stability}, not just aggregate score.

\subsection{Hybrid-lite: selective gains, not universal rescue}

Table~\ref{tab:hybrid_lite} compares raw LM outputs, hybrid-lite outputs, and the best heuristic baseline.

\begin{table}[t]
\centering
\small
\begin{tabularx}{\textwidth}{l l l l l}
\toprule
Language & Task & Raw LM & Hybrid-lite & Best heuristic \\
\midrule
Cree & reinflection & Acc=0.17, ED=3.17 & Acc=0.17, ED=1.50 & Acc=0.17, ED=1.50 \\
Cree & analysis & Jacc=0.45 & Jacc=0.45 & Jacc=0.24 \\
Ojibwe & reinflection & Acc=0.33, ED=1.33 & Acc=1.00, ED=0.00 & Acc=1.00, ED=0.00 \\
Ojibwe & analysis & Jacc=0.62 & Jacc=0.62 & Jacc=0.34 \\
\bottomrule
\end{tabularx}
\caption{Hybrid-lite comparison using the best model per language-task pair. The hybrid condition applies simple validity checks and falls back to the best heuristic baseline when appropriate.}
\label{tab:hybrid_lite}
\end{table}

The hybrid layer is useful precisely because it is selective. It helps where explicit morphology is already a strong fallback option, as in Ojibwe reinflection, but it does not improve cases where the raw LM already outperforms the heuristic, as in Cree and Ojibwe analysis. This is a more useful operational result than a blanket claim that hybridization always helps. It shows where simple rule-based rescue is worth keeping in the loop and where it is unnecessary.

\subsection{Qualitative findings}

The qualitative examples reinforce the quantitative story.

\paragraph{Cree reinflection.}
The dominant failure is lemma copying. For the conjunct forms \texttt{V+AI+Cnj+Prs+1Sg} and \texttt{V+AI+Cnj+Prs+2Sg}, the raw LM outputs \textit{m\^icisow} instead of the gold \textit{\^e-m\^icisoy\^an} or \textit{\^e-m\^icisoyan}. Hybrid-lite and the heuristic do not fix accuracy here, but they do reduce edit distance by at least inserting the conjunct prefix.

\paragraph{Cree analysis.}
The strongest raw Cree analysis output is not perfect, but it is structurally meaningful. For \textit{m\^icisow}, the best model outputs \texttt{ni+m\^icisow+V+AI+Ind+Prs+4Sg}, which is wrong in person marking yet still overlaps heavily with the gold analysis. This is exactly the kind of behavior a morphology-aware benchmark should expose: the model is not simply random, but it is systematically misaligned at the feature level.

\paragraph{Ojibwe reinflection.}
Ojibwe reinforces the heuristic point. For \textit{nibimose} and \textit{gibimose}, the raw LM often returns the bare lemma \textit{bimose}. The hybrid and heuristic layers restore the correct person prefix and achieve exact match.

\paragraph{Ojibwe analysis.}
The strongest raw analysis outputs are strikingly better than the Cree ones. For \textit{bimose}, the best raw LM output is \texttt{gi+bimose+V+AI+Ind+Prs+3Sg}, which is wrong in prefix/person but structurally close enough to score Jaccard $0.86$ against the gold. This is not perfect morphology, but it is useful evidence that structured analysis is more tractable for small open LMs than reinflection.

\section{Discussion}

\subsection{What the experimental program reveals}

Across the whole experimental program, three high-level conclusions emerge.

\paragraph{1. Analysis is more recoverable than reinflection.}
This is the most robust result in the notebook. In both Cree and Ojibwe, prompt conditioning moves analysis from effectively broken to partially or strongly usable. The same is not true for reinflection, which remains weak across prompts and often loses to heuristics.

\paragraph{2. Heuristics still matter.}
The benchmark shows this concretely rather than abstractly. In Cree reinflection, a simple heuristic matches the LM on accuracy and beats it on edit distance. In Ojibwe reinflection, a trivial person-prefix heuristic is perfect on the current set. This does not diminish the value of LM-based evaluation; it sharpens it. The benchmark can tell us when learned behavior is actually adding something and when explicit morphology is still the better tool.

\paragraph{3. Model size is not a monotonic predictor of usefulness.}
Qwen 2.5 1.5B is the strongest Cree analysis model, yet catastrophically unstable for reinflection. TinyLlama is weaker on Cree analysis but more stable overall and the best model for Ojibwe analysis. This is exactly the sort of task-specific benchmark finding that broad sentence-level evaluation often hides.

\subsection{Why the small benchmark slices are justified}

The morphology slices are intentionally diagnostic. That is not an accidental weakness; it is part of the design logic. The purpose of this paper is to establish an evaluation position and show that it is scientifically and operationally useful. For that purpose, diagnostic slices are enough. They support:
\begin{itemize}
    \item prompt ablations,
    \item baseline comparisons,
    \item model comparisons,
    \item hybrid rescue evaluation,
    \item cross-language replication.
\end{itemize}

All of those comparisons produce stable and interpretable patterns. The contribution is therefore more like a benchmark note or evaluation systems paper than a final leaderboard paper. Future work should scale the slices, but the present scale is already sufficient to make the benchmark layer useful.

\subsection{Practical implications}

The benchmark supports a simple deployment lesson. If the goal is community-facing tooling under low-resource conditions, analysis and reinflection should not be treated symmetrically. For analysis, prompt-conditioned small LMs may already be worth integrating into experimental tools, especially when outputs are inspected or post-checked. For reinflection, however, lightweight heuristics remain highly competitive and sometimes dominant. A practical system for today would therefore likely be hybrid by default: LM-assisted for structured analysis, rule-anchored for inflection-heavy generation, and explicitly benchmarked before deployment.

\section{Conclusion}

We introduced a reproducible evaluation layer for Canadian Indigenous NLP that combines an Inuktut MT sanity baseline with morphology-aware probes for Cree and Ojibwe. The contribution is not a new model or a new analyzer. It is a benchmark position that bridges three literatures that are often disconnected in practice: public Indigenous-language MT benchmarks, mature finite-state morphology infrastructure, and the emerging use of general-purpose open LMs.

The results expose a consistent pattern. For Inuktut, a standard open multilingual MT baseline remains brittle even after configuration sweeps, underscoring the need for explicit local evaluation. For Cree and Ojibwe, prompt-conditioned LMs can recover structured analysis, but reinflection remains much harder and is often rivaled or beaten by simple heuristics. Model choice matters, but not uniformly, and lightweight hybrid rescue can help selectively.

These findings justify the benchmark layer as a contribution in its own right. It makes failure modes visible, makes prompt and model comparisons concrete, and gives communities and researchers an auditable scaffold they can expand with larger datasets, analyzer-backed checks, and more languages. In a field where strong linguistic tools already exist but easy-to-rerun LM evaluations are still scarce, that is a meaningful step forward.

\begin{credits}
\subsubsection{\ackname}
This study was supported by funding from iCORD at the University of British Columbia, as well as funded by NVidia through the NVidia Academic Grants Program.
\end{credits}

\bibliographystyle{splncs04}
\bibliography{bibliography}

\end{document}