Why I'm Building a Cybersecurity LLM from Scratch
Most language models aren't built for security work. I decided to change that by training one from scratch — no pretrained weights, no wrappers, every component written by hand. Here's why, and what I've learned so far.
The Problem with Generic Models
If you've ever asked GPT-4 or Claude to help you analyze a CVE, write a Burp Suite plugin, or reason through a CTF challenge, you've probably noticed something: the answers are decent but surface-level. The models know about security, but they don't think like security researchers.
That's not a criticism — these are general-purpose models trained on everything from cooking recipes to quantum physics. Security is a fraction of their training data. They lack the deep pattern recognition that comes from being immersed in vulnerability descriptions, exploit chains, and attack methodologies.
I wanted a model that speaks security natively. One that understands what a use-after-free looks like, why ASLR matters, and how a SQL injection escalates from data leak to full RCE. Not a model that can define these terms — one that can reason about them.
Why from Scratch?
The obvious path would be to fine-tune an existing model. Take Llama or Mistral, throw security data at it, done. But I didn't want to do that, for a few reasons:
- 1.Understanding. I wanted to understand every layer of the transformer — not just how to call
model.generate(), but how attention actually works, how loss propagates, how learning rate schedules affect convergence. Building from scratch forces that understanding. - 2.Control. When you fine-tune someone else's model, you inherit their architectural decisions. Their tokenizer. Their biases. Starting from zero means every decision is intentional — I chose 6 layers, 8 attention heads, 512 embedding dimensions because I tested alternatives, not because someone else decided.
- 3.The learning is the point. I'm a CS student. The goal isn't just to have a model at the end — it's to deeply understand how language models work. There's no better way than building one yourself.
The Architecture
GhostLM is a decoder-only transformer — the same family as GPT. The current version, ghost-tiny, has about 14.5 million parameters. That's tiny by modern standards (GPT-4 is rumored to be over a trillion), but it's enough to learn meaningful patterns in security text.
Every component is hand-written in PyTorch: the multi-head causal self-attention, the pre-norm transformer blocks, the cosine learning rate scheduler with linear warmup, weight-tied output projections. No transformers library, no nn.TransformerDecoder — just raw tensor operations and matrix math.
The Training Data
This is where it gets interesting. GhostLM is trained on a curated corpus of cybersecurity text:
- •CVE descriptions — thousands of vulnerability entries from the National Vulnerability Database, covering everything from buffer overflows to authentication bypasses
- •CTF writeups — detailed walkthroughs of Capture The Flag challenges, showing real problem-solving methodology
- •Security research — papers, blog posts, and technical analyses from the security community
The data pipeline handles deduplication, quality filtering, and tokenization using a GPT-2 BPE tokenizer extended with security-specific tokens.
Where Things Stand
GhostLM has completed its 10,000-step Phase 1 training on CPU. The loss curves looked healthy throughout — training loss dropped from ~4.5 to ~2.25, and validation loss trended down to ~2.75. Phase 2 (100K steps on Mac Mini M4) is up next.
Is it good? Not yet. The perplexity is still high compared to GPT-2 (expected — GPT-2 has 8x the parameters and was trained on orders of magnitude more data). But the model is learning. It's picking up security vocabulary, sentence structure, and basic patterns in how vulnerabilities are described.
Phase 2 will push training to 100,000 steps on better hardware (Apple Silicon M4), with a larger context window and more data. That's where I expect real capability to emerge.
What I've Learned So Far
Building a language model from scratch teaches you things that no tutorial or course can:
- →Attention is everything, literally. The self-attention mechanism is beautiful in its simplicity — it's just learned weighted averages — but the emergent behavior from stacking these layers is genuinely surprising.
- →Data quality beats data quantity. Early experiments with noisy, unfiltered security text produced garbage. Curating the dataset carefully made more difference than any hyperparameter change.
- →Training on CPU is pain. But it forces you to be efficient with your architecture and deliberate with your experiments. Every training run costs hours, so you learn to think before you run.
- →Small models can still be useful. You don't need billions of parameters to build something that understands a specific domain. A 14.5M parameter model trained on focused data can develop real expertise.
What's Next
The roadmap is clear: finish phase 1 training, benchmark the model, then scale up on better hardware. Long-term, I want GhostLM to be a tool that security researchers actually use — for CVE triage, for understanding exploit chains, for learning about vulnerabilities in a way that's more interactive than reading documentation.
The code is fully open source. If you're interested in the intersection of AI and cybersecurity, or if you just want to see how a transformer is built from scratch, check out the GhostLM repository.
This is the first in a series of posts about building GhostLM. Next up: the technical deep dive into the attention mechanism and why I chose pre-norm over post-norm transformer blocks.