(1) Personalized rubric with 1–5 scores for each criterion

Need Alignment
- 5: Direct, code-first tutorial for running open-source chat LLMs (Llama/Mistral/Qwen) locally from Hugging Face using Transformers and vLLM. Includes: env setup, HF token for gated models, VRAM/model selection guidance, chat-style prompting via apply_chat_template, single + batch examples, key generation params (device_map, dtype, max_new_tokens, temperature, top_p), vLLM Python API and OpenAI-compatible server with concurrency/KV-cache/paged attention, and practical optimization/troubleshooting. Uses modern instruct models; avoids LangChain/remote-only paths and unrelated tasks.
- 4: Strong focus on the above with only minor gaps (e.g., missing dataset-driven batch or brief vLLM concurrency flags) or brief mention of wrappers without relying on them.
- 3: On-topic but generic: shows only Transformers or only vLLM; uses generic text-generation (no chat template); lacks HF token handling or VRAM guidance; or uses a borderline/outdated model. Requires noticeable adaptation.
- 2: Tangential: focuses on secondary topics (e.g., model selection, analogies, quantization theory) with little runnable guidance; or relies mainly on wrappers/remote inference instead of direct local usage.
- 1: Irrelevant/misdirected: narrative/marketing copy, different task (e.g., BERT classification), or no actionable code for local LLM inference.

Content Depth
- 5: Research-usable and reproducible. Provides: conda/venv setup; correct PyTorch+CUDA install and CUDA check; installs (transformers, torch, accelerate, huggingface_hub, sentencepiece, datasets; optional bitsandbytes, vllm, openai, optional flash-attn); VRAM guidance by model size; HF token login and license acceptance; Transformers code with AutoTokenizer/AutoModelForCausalLM, apply_chat_template, single + batch (incl. datasets) and clean output decoding; key gen params; vLLM Python API and server with concurrency/throughput settings and KV/prefix caching; optimization/troubleshooting (4/8-bit, max context/tokens, OOM, CUDA/driver, tokenizer mismatch, gating, timeouts). Modern models only.
- 4: Mostly complete and runnable; minor omissions (e.g., brief VRAM guidance or limited troubleshooting) but still easy to use in practice.
- 3: Understandable but requires extra work: missing env/CUDA steps or HF token handling; limited generation params; no batch or no vLLM; or uses outdated models. Partially runnable.
- 2: Too basic or thin: minimal code, missing core installs/setup; no chat formatting, no vLLM; not end-to-end.
- 1: Mismatched depth: conceptual overview or unrelated pipeline; not runnable.

Tone
- 5: Dry, concise, professional, research-assistant voice. No greetings/salutations, metaphors, humor, or marketing language. Comments are terse and technical.
- 4: Generally dry and professional with a small lapse (e.g., a brief salutation or minor flourish) but not distracting.
- 3: Functional but generic/robotic; slight fluff; acceptable but not ideal.
- 2: Noticeably off: analogies, humor, casual/“kid mode,” or marketing tone intrudes.
- 1: Strongly disliked tone: story-like/flowery/condescending; heavy metaphors or promotional style.

Explanation Style
- 5: Step-by-step, code-first, minimal narrative. Numbered sections and bullet lists. Explicit chat templates (system/user/assistant) and decoding of generated tokens. Shows both single and batch workflows, clean separation of Transformers vs vLLM, includes vLLM server + client and a brief async/concurrency example. Uses short bullet comparisons (not tables). Inline comments are clear and practical.
- 4: Well structured and easy to follow with minor clarity gaps (e.g., fewer comments or missing one explicit step) but still directly runnable.
- 3: Understandable but not in the preferred format: large undivided blocks, missing chat template or decoding, uses tables instead of brief bullets, or lacks clear sectioning.
- 2: Poorly structured: heavy prose, scattered code, not copy-paste friendly; limited ordering or missing key transitions between steps.
- 1: Incompatible style: metaphor/story-driven; no runnable, ordered cells; cannot be followed as a tutorial.