Training Pipeline
End-to-end flow from raw source material to a deployed model serving inference via Ollama.
Pipeline Stages
raw sources
↓
scrub_training_data.py # Sanitize, strip DRM/paths/infra names
↓
merge_training_data.py # Deduplicate, validate, combine
↓
prepare_for_training.py # Split train/eval, format for trainer
↓
SCP to GPU node # Transfer to training hardware
↓
pipeline.py # QLoRA fine-tuning (on GPU node)
↓
merge adapters # Merge LoRA weights into base model
↓
GGUF quantization # llama.cpp convert to Q4_K_M / Q5_K_M
↓
Ollama deployment # Create modelfile, deploy to inference nodes
Stage 1: Scrubbing
scrub_training_data.py is a hard gate. Every source file passes through it before entering the corpus.
python scripts/scrub_training_data.py \
--input data/raw/new-source.jsonl \
--output data/scrubbed/new-source.jsonl
The scrubber removes:
- Manning DRM watermarks at all known truncation lengths
- Obsidian artifacts:
#tags,[[links]], YAML frontmatter - SSH patterns:
user@hoststrings, connection URIs - Local paths:
/home/,/Users/, absolute filesystem references - Infrastructure names: Real node names and IPs replaced with generics
Scrubbing is idempotent. Running it twice produces identical output.
Stage 2: Merge
merge_training_data.py combines all scrubbed source files into the unified corpus.
python scripts/merge_training_data.py \
--sources data/scrubbed/ \
--output data/corpus/combined.jsonl
Operations performed:
- JSON structure validation (every pair must have
instructionandoutput) - Deduplication by instruction hash
- Source tracking metadata appended to each pair
- Final pair count and domain distribution logged
Stage 3: Prepare
prepare_for_training.py converts the merged corpus into the format expected by the training script.
python scripts/prepare_for_training.py \
--input data/corpus/combined.jsonl \
--output data/train-ready/ \
--eval-split 0.05 \
--color-filter red,orange,yellow # Optional: for Shinobit
This stage:
- Splits into train and eval sets (default 95/5)
- Applies optional color filtering for model variants
- Converts to Alpaca format with chat template wrapping
- Produces
train.jsonlandeval.jsonl
Alpaca Format
The training format wraps each pair in the Alpaca instruction template:
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}
The input field is included when present but is empty for most pairs.
Stage 4: Transfer
Training data is transferred to the GPU node via SCP:
scp -r data/train-ready/ gpu-node-1:~/blackrainbow/data/
Training never runs on the development machine. Data flows one direction: dev machine to GPU node.
Stage 5: QLoRA Fine-Tuning
pipeline.py on the GPU node orchestrates the training run.
python pipeline.py \
--config configs/blackrainbow-base.yaml \
--data ~/blackrainbow/data/train-ready/ \
--output ~/blackrainbow/output/v08/
Key training parameters (from config):
model:
base: Qwen/Qwen2.5-7B-Instruct
lora:
r: 64
alpha: 128
dropout: 0.05
target_modules: all
training:
epochs: 3
batch_size: 1
gradient_accumulation: 8
learning_rate: 2e-4
scheduler: cosine
max_seq_length: 4096
bf16: true
eval:
eval_steps: 500
save_steps: 500
Output: LoRA adapter weights, training logs, eval metrics.
Stage 6: Merge Adapters
After training completes, the LoRA adapter is merged back into the base model:
python scripts/merge_lora.py \
--base Qwen/Qwen2.5-7B-Instruct \
--adapter ~/blackrainbow/output/v08/checkpoint-final/ \
--output ~/blackrainbow/output/v08/merged/
This produces a full-weight model directory suitable for quantization.
Stage 7: GGUF Quantization
The merged model is quantized using llama.cpp:
python llama.cpp/convert_hf_to_gguf.py \
~/blackrainbow/output/v08/merged/ \
--outfile blackrainbow-v08.f16.gguf
llama.cpp/llama-quantize \
blackrainbow-v08.f16.gguf \
blackrainbow-v08.Q5_K_M.gguf Q5_K_M
llama.cpp/llama-quantize \
blackrainbow-v08.f16.gguf \
blackrainbow-v08.Q4_K_M.gguf Q4_K_M
Two quantization levels are produced:
- Q5_K_M (~5.1GB): Primary inference, higher fidelity
- Q4_K_M (~4.4GB): Fast inference, acceptable quality tradeoff
Stage 8: Ollama Deployment
Create the Ollama modelfile and register:
ollama create blackrainbow-v08 -f deploy/Modelfile.blackrainbow
Modelfile contents:
FROM ./blackrainbow-v08.Q5_K_M.gguf
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
SYSTEM """You are BlackRainbow, a security assurance domain expert.
Provide precise, actionable responses for penetration testing,
red team operations, and security analysis."""
Verify deployment:
ollama run blackrainbow-v08 "Enumerate attack surface for a host running Apache 2.4.49 on port 80"
Model Variant Configs
Each model variant has its own training config in configs/:
| Config | Model | Color Filter |
|---|---|---|
blackrainbow-base.yaml | BlackRainbow | None (all colors) |
shinobit.yaml | Shinobit | red, orange, yellow |
onibit.yaml | Onibit | blue, grey |
immortal-blade.yaml | Immortal Blade | red, blue |
The only difference between configs is the color filter applied in Stage 3. Base model, hyperparameters, and training infrastructure are identical across all variants.
Key Scripts
| Script | Location | Purpose |
|---|---|---|
scrub_training_data.py | scripts/ | Sanitize raw sources |
merge_training_data.py | scripts/ | Combine and deduplicate |
prepare_for_training.py | scripts/ | Format for trainer, split, filter |
pipeline.py | scripts/ | QLoRA training orchestration |
merge_lora.py | scripts/ | Merge adapter into base |
eval_model.py | scripts/ | Run eval prompts against model |
benchmark.py | scripts/ | Compare model versions |
inference.py | scripts/ | Interactive inference testing |