
High-performance LLM application layer offering runtimes and CLI tools for Llama, Gemma, Qwen and BERT models; safetensors model loading and hardware-accelerated inference.
Tranformers based LLM application layer on top of the SKaiNET engine. Provides model-specific inference, agentic chat with tool calling, and a unified CLI for transformer-based models, all in Kotlin Multiplatform.
[!WARNING] Project status — early / experimental. This repository is an initial version. Nothing here is stable, and there is no support or status guarantee for any feature, model, or API. Model coverage, tool calling, and the runtime APIs are all work in progress and may not work for a given model or model version — for example, tool calling can fail to trigger or parse even on a model that generates plain text correctly. The capabilities described below are goals, not promises. Treat everything as a preview and expect things to break.
SKaiNET Transformers is Kotlin Multiplatform. The fastest way to verify it on
your machine is the unified skainet-cli:
./gradlew :llm-apps:skainet-cli:run \
--args="-m /absolute/path/to/model.gguf 'The capital of France is'"Expected result: the CLI auto-detects the model architecture, loads the model, and streams a generated answer. See the getting-started tutorial for model setup notes.
Working in Java? SKaiNET Transformers ships first-class Java support — see the
kllama-java-sample starter and the
Java getting-started guide.
Use the version shown in this README as the source of truth for first-run snippets.
The list below describes the project's intended scope. Maturity varies widely per item and many paths are unverified — see the project-status note above.
KLlamaJava, JavaTools.definition, JavaAgentLoop) exist, but tool calling is not reliable yet — it may fail to trigger or parse even when plain generation works.NATIVE_OPTIMIZED quant policy keeps weights in their packed SIMD-friendly form.SKaiNET Transformers follows the SKaiNET engine's core path: a transformer model is defined once in the Kotlin DSL, captured as a tape or DAG, and then either compiled to native code or executed eagerly — without rewriting it.
llamaNetwork(), apertusNetwork(), …).flowchart LR
DSL["Transformer model — Kotlin DSL"] --> Graph["Tape / DAG"]
Graph --> HLO["MLIR / StableHLO"]
Graph --> Eager["Eager backend (JVM, …)"]
HLO --> Native["Native code"]Today every model family runs through the eager JVM path. The StableHLO / native path is shared with the engine and not yet wired for full transformer models.
Honest status — see the project-status note at the top of this README.
| Architecture | State |
|---|---|
| Llama / Mistral | Most exercised path — basic text generation works on the eager JVM path. |
| Qwen 2 / 3 | DSL + loaders present; runs through the shared decoder path. Early; Qwen3 RoPE / QK-norm fixes landed in 0.23.2. |
| Gemma 2 / 3 / 3n | DSL + loaders present (Gemma 4 via the SafeTensors path); has the most test coverage, but not verified end-to-end. |
| Apertus | DSL + loaders present; declared end-to-end in 0.23.1, still early. |
| BERT | Encoder for embeddings only — no text generation, no tool calling. |
| Voxtral | TTS / voice; architecture code only — no runtime facade or CLI yet. |
iree-compiles to a
vmfb (GemmaMlirDumpTest); next is running the compiled module and extending
the same path to the other families.The current release is 0.33.0 (against SKaiNET 0.33.0). It adopts the new engine line —
no transformers API changes — so transformer models inherit the engine's 0.33.0 work: layerNorm/
rmsNorm now lower to real stablehlo.reduce (exports compile on stock IREE), a silent autodiff
gradient-drop is fixed, and new differentiable ops (cos/sin/gather/…) are available. On top of
the 0.32.1 streaming-detokenization fix and the 0.32.0 real-GGUF Llama eager +
StableHLO/IREE export work:
NATIVE_OPTIMIZED path now works for Llama (Q4_K/Q6_K): weights stay
packed and LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED) + OptimizedLLMRuntime decodes
coherently, matching llama.cpp — fixing the packed token-embedding
gather: unsupported input rank 1.seqQ == 1) skips the repeatKVHeads concat + SDPA plumbing
for a faster decode loop (~1.5×), bit-identical output.iree-compile to a vmfb) instead of baking a disconnected constant.The earlier transformer-core extraction (0.31.1) and the Gemma NATIVE_OPTIMIZED
footprint work (0.31.0) still apply.
The recommended way to consume is via the BOM. It pins every published skainet-transformers-* artifact and re-exports the upstream sk.ainet:skainet-bom, so the engine-side sk.ainet.core:skainet-* artifacts get the matching version too — you only need to declare the BOM version in one place.
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.1"))
// Versions resolved from the BOM:
implementation("sk.ainet.transformers:skainet-transformers-core")
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama") // or runtime-kgemma, inference-qwen, inference-apertus
implementation("sk.ainet.transformers:skainet-transformers-agent") // chat templates + tool calling
}To opt in to the native FFM CPU provider (recommended for JVM consumers):
dependencies {
implementation("sk.ainet.core:skainet-backend-cpu") // priority-50 Panama Vector
implementation("sk.ainet.core:skainet-backend-native-cpu") // priority-100 FFM (auto-discovered)
}KernelRegistry picks the highest-priority available provider; on hosts where the native lib doesn't load (sandboxed JDKs, unsupported arches), it cleanly falls back to Panama with no functional regression.
| Module | Purpose |
|---|---|
llm-api |
Framework-neutral interfaces (ChatModel, EmbeddingModel, ToolDefinition) — Spring AI-shaped. |
transformer-core |
Framework NN primitives (attention, KV-cache family, embedding, norms, RoPE, FFNs, linear projection). lang-core-only → all targets incl. androidNative; re-exported by llm-core. |
llm-core |
OptimizedLLMRuntime, ModelRegistry, UnifiedModelLoader, shared abstractions. |
llm-inference/<arch> |
Per-architecture network DSLs and weight loaders (llama, gemma, qwen, apertus, bert). |
llm-runtime/<arch> |
Per-architecture runtime facades (kllama, kgemma, kqwen, kapertus). |
llm-agent |
Chat templates, tool-call parsers, agent loops; Java surface. |
llm-apps |
CLIs: skainet-cli (unified), kllama-cli, kbert-cli, plus kllama-java-sample. |
llm-test/llm-test-java |
JUnit 5 end-to-end tests for the Java surface (gated on TINYLLAMA_MODEL_PATH). |
# Plain generation
./gradlew :llm-apps:skainet-cli:shadowJar
java -jar llm-apps/skainet-cli/build/libs/skainet-all.jar \
-m /path/to/model.gguf "The capital of France is"
# Tool-calling demo (calculator + file-listing tools auto-registered)
java -jar skainet-all.jar -m model.gguf --demo --template=llama3 "What is 17 * 23?"
# Interactive agent
java -jar skainet-all.jar -m model.gguf --agent --template=apertus--template accepts llama3, chatml, qwen, gemma, apertus (auto-detected from GGUF metadata if omitted).
try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ null)) {
JavaTool calc = new JavaTool() {
@Override public ToolDefinition getDefinition() {
return JavaTools.definition(
"calculator", "Evaluate an arithmetic expression.",
"{\"type\":\"object\",\"properties\":{\"expression\":{\"type\":\"string\"}},\"required\":[\"expression\"]}"
);
}
@Override public String execute(Map<String, ?> args) { /* ... */ }
};
JavaAgentLoop agent = JavaAgentLoop.builder()
.session(session).tool(calc).template("llama3").build();
String response = agent.chat("What is 17 * 23?");
}See llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java for a runnable reference.
tokenizer.decode(tokenId)) no longer runs words together. SentencePieceSpecialTokens and
UpstreamTokenizerAdapter route decode(Int) through engine 0.32.4's Tokenizer.decodeToken,
which preserves each SentencePiece piece's leading space (llama.cpp token_to_piece semantics).
Engine pin 0.32.2 → 0.32.4.NATIVE_OPTIMIZED for real-GGUF Llama. LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED)
now keeps Q4_K/Q6_K weights packed and runs them through OptimizedLLMRuntime, mirroring the
Gemma path (new LlamaQuantLayout + LlamaPackedWeights.convertLlamaWeightsPacked). Output is
coherent and matches llama.cpp; fixes the packed token-embedding gather: unsupported input rank 1.
This is the low-footprint path real-GGUF Llama inference on constrained ARM was missing. (ccbd87e)seqQ == 1), MultiHeadAttention runs
scores → softmax → GQA-weighted-V straight from the cached K/V, bypassing the repeatKVHeads concat
and the unsqueeze → SDPA → squeeze → permute chain. ~1.5× decode throughput on the JVM eager path;
bit-for-bit-equivalent output. Prefill keeps the general SDPA path. (3791f88)RoPE in INTERLEAVED mode (Llama / Mistral / most
GGUF) used a raw-array path (copyToFloatArray / fromFloatArray) that, under graph tracing, recorded
the rotated Q/K as a disconnected constant — severing them from the projection weights and crashing
iree-compile downstream. It now records the rotation as tensor ops when tracing (gated on the tracing
wrapper; eager keeps the fast raw-array path byte-identical). Unblocks TinyLlama → StableHLO → IREE. (019b049)skainet 0.31.0 → 0.32.2.transformer-core module — NN primitives reusable on all targets incl. androidNative. The
attention / KV-cache / embedding / norm / RoPE / FFN / linear-projection primitives were trapped in
llm-core (whose io/compile/backend deps lack androidNative); they only need skainet-lang-core
(which has it), so they're extracted into transformer-core and llm-core re-exports them. Existing
consumers are unaffected; ARM-native downstreams (on-device whisper, future models) reuse them instead of
reimplementing. Ships against engine 0.31.0 (additive, no engine change). (#183)NATIVE_OPTIMIZED). FunctionGemma's
token_embd is Q8_0 and tied, so convertGemmaWeightsPacked was dequantizing
both token_embd and output to FP32 (2×~0.67 GB) — OOM on the 1.9 GB
SL2610. output/lm_head now packs as Q8_0 (runs on the NEON Q8_0 kernel);
token_embd stays FP32 (it's gathered) but is wrapped no-copy. Footprint
~1.34 GB → ~0.76 GB; byte-identical decode (GemmaQ5KPackedParityTest),
stable ~1.06 GB load on the SL2610.GemmaNetworkLoader.load(maxInferenceLen = …) — cap the context so the KV
cache + RoPE tables stay tiny on constrained devices (default
min(contextLength, 4096)).skainet 0.30.0 → 0.31.0 — picks up ops.transpose's
lazy-rewrap fix for all packed matmul dtypes (Q8_0/Q4_0), required so the
packed lm_head transposes through linearProject instead of ClassCastException.GemmaMemSegConverter used to
dequantize Q5_K weights to FP32 on load; SKaiNET 0.30.0 provides a first-class
Q5_K packed matmul (Q5_KBlockTensorData + Q5KMatmulKernel), so the converter
now relayouts the GGUF bytes to block-major and keeps them packed (176 B/block).
FunctionGemma-270M (Q5_K_M) decodes byte-identically to the FP32 baseline
(GemmaQ5KPackedParityTest).NATIVE_OPTIMIZED path is Kotlin/Native–ready. The reusable layout +
packing helpers (GemmaQuantLayout.kt, GemmaPackedWeights.kt) moved to
commonMain, and GemmaNetworkLoader.load() now runs convertGemmaWeightsPacked
under NATIVE_OPTIMIZED — so the board binary keeps K-quant weights packed with
no java.lang.foreign MemSeg dependency. Verified on JVM and linuxX64.skainet 0.28.1 → 0.30.0 — released Q5_K packed matmul, NEON
native kernels, and Kotlin/Native cinterop. The mavenLocal()-first dev shim is
reverted; the release resolves the engine from Maven Central.NATIVE_OPTIMIZED now dequant to FP32
[out, in] instead of crashing on a rank-1 transpose; DecoderGgufMemSegConverter
dequantizes Q4_1 and every other non-packed quant type instead of passing raw
bytes through to a matmul crash (#654).skainet 0.27.0 → 0.28.1. Picks up the completed Kotlin DSL →
StableHLO → IREE export path. Every shape-changing op now declares its inferred
output type (reshape/matmul/concatenate, #673;
conv1d/gather/pooling/flatten, #675),
and reduce_window is emitted in IREE's generic region form — so a full gemma3
graph traced via GemmaMlirDumpTest lowers to StableHLO that iree-compiles to
a vmfb. No transformers-side API changes; existing callers compile unchanged.:llm-inference:gemma:jvmTest green against the published
0.28.1 (GemmaMlirDumpTest, GemmaTraceTest pass).DTypePolicy on every *NetworkLoader.fromGguf / .fromSafeTensors
entry. A sealed DTypePolicy type (Any | Require | Prefer | OneOf,
upstream of SKaiNET 0.25.0) is now accepted on every loader companion in
LlamaNetworkLoader, QwenNetworkLoader, GemmaNetworkLoader,
ApertusNetworkLoader, and VoxtralNetworkLoader. The policy is
validated eagerly via sk.ainet.apps.llm.DTypePolicyValidation —
Require(BF16) rejects on GGUF paths (no KEEP_NATIVE GGUF yet),
accepts on SafeTensors paths. Default DTypePolicy.Any keeps the
existing adaptive behaviour; every existing caller compiles
unchanged.DecoderSafeTensorsLoader. With
Require(BF16) (or Prefer(BF16) / OneOf containing BF16) the
loader stops dequanting BF16 SafeTensors weights and instead wraps
the packed 2-bytes-per-element buffer in Bf16DenseTensorData. The
matmul dispatch in DefaultCpuOpsJvm detects Bf16TensorData at
runtime and routes to the SIMD BF16 kernel — a BF16 checkpoint now
stays near its on-disk footprint in RAM instead of ~2× FP32 inflation.skainet-* alias in
gradle/libs.versions.toml is now coordinate-only (no version.ref).
Versions come from the sk.ainet:skainet-bom platform constraint
re-exported by :llm-bom, and every consumer module pulls in
implementation(project.dependencies.platform(project(":llm-bom")))
in each affected source set. Engine bumps are still a one-line edit
at the top of the catalog, but every internal build now exercises
the BOM end-to-end — a missing-from-BOM regression fails locally
instead of leaking into a published artifact.@Tag("smoke-reference") —
the smoke tier that pins the architectures we always want to run end-
to-end: Qwen3ReferenceSmokeTest (Qwen3-1.7B Q8 GGUF; exercises the
new 0.25.0 Q8_0MatmulKernel + Qwen's RoPEMode.SPLIT_HALF +
QK-Norm), Gemma4ReferenceSmokeTest (Gemma-4 E2B SafeTensors;
sliding-window attention + per-layer KV sharing), and
BertLeafReferenceSmokeTest (MongoDB mdbr-leaf-ir SafeTensors via
the Java KBertJava surface). Run with
./gradlew test -PsmokeReference -PincludeIntegration. Each test
self-skips via JUnit Assumptions when the model file isn't
reachable through the standard ~/.lmstudio/models/ /
~/.cache/huggingface/hub/ / env-var fallback chain.0.23.5 — skainet-cli reliability on JDKs without the
jdk.incubator.vector module: --enable-preview --add-modules jdk.incubator.vector flags reach the generated launchers (previously
only gradle :run); detection of scalar-fallback CPU ops with auto
weight dequant to FP32; backend label printed after the real ops
probe so it can't disagree with the warning beside it.
0.23.4 — BOM is now correct and self-maintaining: :llm-inference:apertus
and :llm-inference:voxtral were missing from the BOM's constraints and are now
covered, so consumers pulling them through the BOM get proper version alignment;
the constraint list is auto-discovered by a buildSrc/ convention plugin. The
README and tutorial dependency snippets were also fixed to use the published
artifact IDs (skainet-transformers-core etc.) via the BOM pattern.
0.23.3 — Prefill progress callback: generateUntilStop and
AgentLoop expose (done, total) progress during the autoregressive
prefill loop via a default-no-op AgentListener.onPrefillProgress
method, so UIs on CPU-only runtimes can show that work is happening
between round start and the first generated token.
0.23.2 — kllama-cli, kllama-native, kllama-wasm, and
KLlamaJava swapped to the DSL path (OptimizedLLMRuntime +
llamaNetwork()); GPU stubs deleted; SentencePiece + GGUF tokenizers
unified through upstream sk.ainet.io.tokenizer; markdown-fenced Llama 3
JSON tool calls now parse correctly; Qwen3 NEOX RoPE pairing fix; QK-norm
RMSNorm-eps wiring fix.
0.23.1 — Apertus end-to-end (routing through OptimizedLLMRuntime +
apertusNetwork(), chat template + tool calling, real-GGUF Q4_K
loading); Gemma 4 chat-model JVM facade with mmap-arena cleanup; multi-id
EOS / stop-token support in the chat layer; SentencePiece auto-detect in
fromTokenizerJson; LEAF + Llama 3 single-JVM smoke test;
ServiceLoader shadow-jar fix-up so the priority-100 native-cpu provider
is picked up post-merge.
See CHANGELOG.md for the full set of changes.
This project uses SKaiNET as its underlying execution engine — tensor ops, neural-network DSL, kernel SPI, GGUF / SafeTensors I/O.
MIT — see LICENCE.
Tranformers based LLM application layer on top of the SKaiNET engine. Provides model-specific inference, agentic chat with tool calling, and a unified CLI for transformer-based models, all in Kotlin Multiplatform.
[!WARNING] Project status — early / experimental. This repository is an initial version. Nothing here is stable, and there is no support or status guarantee for any feature, model, or API. Model coverage, tool calling, and the runtime APIs are all work in progress and may not work for a given model or model version — for example, tool calling can fail to trigger or parse even on a model that generates plain text correctly. The capabilities described below are goals, not promises. Treat everything as a preview and expect things to break.
SKaiNET Transformers is Kotlin Multiplatform. The fastest way to verify it on
your machine is the unified skainet-cli:
./gradlew :llm-apps:skainet-cli:run \
--args="-m /absolute/path/to/model.gguf 'The capital of France is'"Expected result: the CLI auto-detects the model architecture, loads the model, and streams a generated answer. See the getting-started tutorial for model setup notes.
Working in Java? SKaiNET Transformers ships first-class Java support — see the
kllama-java-sample starter and the
Java getting-started guide.
Use the version shown in this README as the source of truth for first-run snippets.
The list below describes the project's intended scope. Maturity varies widely per item and many paths are unverified — see the project-status note above.
KLlamaJava, JavaTools.definition, JavaAgentLoop) exist, but tool calling is not reliable yet — it may fail to trigger or parse even when plain generation works.NATIVE_OPTIMIZED quant policy keeps weights in their packed SIMD-friendly form.SKaiNET Transformers follows the SKaiNET engine's core path: a transformer model is defined once in the Kotlin DSL, captured as a tape or DAG, and then either compiled to native code or executed eagerly — without rewriting it.
llamaNetwork(), apertusNetwork(), …).flowchart LR
DSL["Transformer model — Kotlin DSL"] --> Graph["Tape / DAG"]
Graph --> HLO["MLIR / StableHLO"]
Graph --> Eager["Eager backend (JVM, …)"]
HLO --> Native["Native code"]Today every model family runs through the eager JVM path. The StableHLO / native path is shared with the engine and not yet wired for full transformer models.
Honest status — see the project-status note at the top of this README.
| Architecture | State |
|---|---|
| Llama / Mistral | Most exercised path — basic text generation works on the eager JVM path. |
| Qwen 2 / 3 | DSL + loaders present; runs through the shared decoder path. Early; Qwen3 RoPE / QK-norm fixes landed in 0.23.2. |
| Gemma 2 / 3 / 3n | DSL + loaders present (Gemma 4 via the SafeTensors path); has the most test coverage, but not verified end-to-end. |
| Apertus | DSL + loaders present; declared end-to-end in 0.23.1, still early. |
| BERT | Encoder for embeddings only — no text generation, no tool calling. |
| Voxtral | TTS / voice; architecture code only — no runtime facade or CLI yet. |
iree-compiles to a
vmfb (GemmaMlirDumpTest); next is running the compiled module and extending
the same path to the other families.The current release is 0.33.0 (against SKaiNET 0.33.0). It adopts the new engine line —
no transformers API changes — so transformer models inherit the engine's 0.33.0 work: layerNorm/
rmsNorm now lower to real stablehlo.reduce (exports compile on stock IREE), a silent autodiff
gradient-drop is fixed, and new differentiable ops (cos/sin/gather/…) are available. On top of
the 0.32.1 streaming-detokenization fix and the 0.32.0 real-GGUF Llama eager +
StableHLO/IREE export work:
NATIVE_OPTIMIZED path now works for Llama (Q4_K/Q6_K): weights stay
packed and LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED) + OptimizedLLMRuntime decodes
coherently, matching llama.cpp — fixing the packed token-embedding
gather: unsupported input rank 1.seqQ == 1) skips the repeatKVHeads concat + SDPA plumbing
for a faster decode loop (~1.5×), bit-identical output.iree-compile to a vmfb) instead of baking a disconnected constant.The earlier transformer-core extraction (0.31.1) and the Gemma NATIVE_OPTIMIZED
footprint work (0.31.0) still apply.
The recommended way to consume is via the BOM. It pins every published skainet-transformers-* artifact and re-exports the upstream sk.ainet:skainet-bom, so the engine-side sk.ainet.core:skainet-* artifacts get the matching version too — you only need to declare the BOM version in one place.
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.1"))
// Versions resolved from the BOM:
implementation("sk.ainet.transformers:skainet-transformers-core")
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama") // or runtime-kgemma, inference-qwen, inference-apertus
implementation("sk.ainet.transformers:skainet-transformers-agent") // chat templates + tool calling
}To opt in to the native FFM CPU provider (recommended for JVM consumers):
dependencies {
implementation("sk.ainet.core:skainet-backend-cpu") // priority-50 Panama Vector
implementation("sk.ainet.core:skainet-backend-native-cpu") // priority-100 FFM (auto-discovered)
}KernelRegistry picks the highest-priority available provider; on hosts where the native lib doesn't load (sandboxed JDKs, unsupported arches), it cleanly falls back to Panama with no functional regression.
| Module | Purpose |
|---|---|
llm-api |
Framework-neutral interfaces (ChatModel, EmbeddingModel, ToolDefinition) — Spring AI-shaped. |
transformer-core |
Framework NN primitives (attention, KV-cache family, embedding, norms, RoPE, FFNs, linear projection). lang-core-only → all targets incl. androidNative; re-exported by llm-core. |
llm-core |
OptimizedLLMRuntime, ModelRegistry, UnifiedModelLoader, shared abstractions. |
llm-inference/<arch> |
Per-architecture network DSLs and weight loaders (llama, gemma, qwen, apertus, bert). |
llm-runtime/<arch> |
Per-architecture runtime facades (kllama, kgemma, kqwen, kapertus). |
llm-agent |
Chat templates, tool-call parsers, agent loops; Java surface. |
llm-apps |
CLIs: skainet-cli (unified), kllama-cli, kbert-cli, plus kllama-java-sample. |
llm-test/llm-test-java |
JUnit 5 end-to-end tests for the Java surface (gated on TINYLLAMA_MODEL_PATH). |
# Plain generation
./gradlew :llm-apps:skainet-cli:shadowJar
java -jar llm-apps/skainet-cli/build/libs/skainet-all.jar \
-m /path/to/model.gguf "The capital of France is"
# Tool-calling demo (calculator + file-listing tools auto-registered)
java -jar skainet-all.jar -m model.gguf --demo --template=llama3 "What is 17 * 23?"
# Interactive agent
java -jar skainet-all.jar -m model.gguf --agent --template=apertus--template accepts llama3, chatml, qwen, gemma, apertus (auto-detected from GGUF metadata if omitted).
try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ null)) {
JavaTool calc = new JavaTool() {
@Override public ToolDefinition getDefinition() {
return JavaTools.definition(
"calculator", "Evaluate an arithmetic expression.",
"{\"type\":\"object\",\"properties\":{\"expression\":{\"type\":\"string\"}},\"required\":[\"expression\"]}"
);
}
@Override public String execute(Map<String, ?> args) { /* ... */ }
};
JavaAgentLoop agent = JavaAgentLoop.builder()
.session(session).tool(calc).template("llama3").build();
String response = agent.chat("What is 17 * 23?");
}See llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java for a runnable reference.
tokenizer.decode(tokenId)) no longer runs words together. SentencePieceSpecialTokens and
UpstreamTokenizerAdapter route decode(Int) through engine 0.32.4's Tokenizer.decodeToken,
which preserves each SentencePiece piece's leading space (llama.cpp token_to_piece semantics).
Engine pin 0.32.2 → 0.32.4.NATIVE_OPTIMIZED for real-GGUF Llama. LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED)
now keeps Q4_K/Q6_K weights packed and runs them through OptimizedLLMRuntime, mirroring the
Gemma path (new LlamaQuantLayout + LlamaPackedWeights.convertLlamaWeightsPacked). Output is
coherent and matches llama.cpp; fixes the packed token-embedding gather: unsupported input rank 1.
This is the low-footprint path real-GGUF Llama inference on constrained ARM was missing. (ccbd87e)seqQ == 1), MultiHeadAttention runs
scores → softmax → GQA-weighted-V straight from the cached K/V, bypassing the repeatKVHeads concat
and the unsqueeze → SDPA → squeeze → permute chain. ~1.5× decode throughput on the JVM eager path;
bit-for-bit-equivalent output. Prefill keeps the general SDPA path. (3791f88)RoPE in INTERLEAVED mode (Llama / Mistral / most
GGUF) used a raw-array path (copyToFloatArray / fromFloatArray) that, under graph tracing, recorded
the rotated Q/K as a disconnected constant — severing them from the projection weights and crashing
iree-compile downstream. It now records the rotation as tensor ops when tracing (gated on the tracing
wrapper; eager keeps the fast raw-array path byte-identical). Unblocks TinyLlama → StableHLO → IREE. (019b049)skainet 0.31.0 → 0.32.2.transformer-core module — NN primitives reusable on all targets incl. androidNative. The
attention / KV-cache / embedding / norm / RoPE / FFN / linear-projection primitives were trapped in
llm-core (whose io/compile/backend deps lack androidNative); they only need skainet-lang-core
(which has it), so they're extracted into transformer-core and llm-core re-exports them. Existing
consumers are unaffected; ARM-native downstreams (on-device whisper, future models) reuse them instead of
reimplementing. Ships against engine 0.31.0 (additive, no engine change). (#183)NATIVE_OPTIMIZED). FunctionGemma's
token_embd is Q8_0 and tied, so convertGemmaWeightsPacked was dequantizing
both token_embd and output to FP32 (2×~0.67 GB) — OOM on the 1.9 GB
SL2610. output/lm_head now packs as Q8_0 (runs on the NEON Q8_0 kernel);
token_embd stays FP32 (it's gathered) but is wrapped no-copy. Footprint
~1.34 GB → ~0.76 GB; byte-identical decode (GemmaQ5KPackedParityTest),
stable ~1.06 GB load on the SL2610.GemmaNetworkLoader.load(maxInferenceLen = …) — cap the context so the KV
cache + RoPE tables stay tiny on constrained devices (default
min(contextLength, 4096)).skainet 0.30.0 → 0.31.0 — picks up ops.transpose's
lazy-rewrap fix for all packed matmul dtypes (Q8_0/Q4_0), required so the
packed lm_head transposes through linearProject instead of ClassCastException.GemmaMemSegConverter used to
dequantize Q5_K weights to FP32 on load; SKaiNET 0.30.0 provides a first-class
Q5_K packed matmul (Q5_KBlockTensorData + Q5KMatmulKernel), so the converter
now relayouts the GGUF bytes to block-major and keeps them packed (176 B/block).
FunctionGemma-270M (Q5_K_M) decodes byte-identically to the FP32 baseline
(GemmaQ5KPackedParityTest).NATIVE_OPTIMIZED path is Kotlin/Native–ready. The reusable layout +
packing helpers (GemmaQuantLayout.kt, GemmaPackedWeights.kt) moved to
commonMain, and GemmaNetworkLoader.load() now runs convertGemmaWeightsPacked
under NATIVE_OPTIMIZED — so the board binary keeps K-quant weights packed with
no java.lang.foreign MemSeg dependency. Verified on JVM and linuxX64.skainet 0.28.1 → 0.30.0 — released Q5_K packed matmul, NEON
native kernels, and Kotlin/Native cinterop. The mavenLocal()-first dev shim is
reverted; the release resolves the engine from Maven Central.NATIVE_OPTIMIZED now dequant to FP32
[out, in] instead of crashing on a rank-1 transpose; DecoderGgufMemSegConverter
dequantizes Q4_1 and every other non-packed quant type instead of passing raw
bytes through to a matmul crash (#654).skainet 0.27.0 → 0.28.1. Picks up the completed Kotlin DSL →
StableHLO → IREE export path. Every shape-changing op now declares its inferred
output type (reshape/matmul/concatenate, #673;
conv1d/gather/pooling/flatten, #675),
and reduce_window is emitted in IREE's generic region form — so a full gemma3
graph traced via GemmaMlirDumpTest lowers to StableHLO that iree-compiles to
a vmfb. No transformers-side API changes; existing callers compile unchanged.:llm-inference:gemma:jvmTest green against the published
0.28.1 (GemmaMlirDumpTest, GemmaTraceTest pass).DTypePolicy on every *NetworkLoader.fromGguf / .fromSafeTensors
entry. A sealed DTypePolicy type (Any | Require | Prefer | OneOf,
upstream of SKaiNET 0.25.0) is now accepted on every loader companion in
LlamaNetworkLoader, QwenNetworkLoader, GemmaNetworkLoader,
ApertusNetworkLoader, and VoxtralNetworkLoader. The policy is
validated eagerly via sk.ainet.apps.llm.DTypePolicyValidation —
Require(BF16) rejects on GGUF paths (no KEEP_NATIVE GGUF yet),
accepts on SafeTensors paths. Default DTypePolicy.Any keeps the
existing adaptive behaviour; every existing caller compiles
unchanged.DecoderSafeTensorsLoader. With
Require(BF16) (or Prefer(BF16) / OneOf containing BF16) the
loader stops dequanting BF16 SafeTensors weights and instead wraps
the packed 2-bytes-per-element buffer in Bf16DenseTensorData. The
matmul dispatch in DefaultCpuOpsJvm detects Bf16TensorData at
runtime and routes to the SIMD BF16 kernel — a BF16 checkpoint now
stays near its on-disk footprint in RAM instead of ~2× FP32 inflation.skainet-* alias in
gradle/libs.versions.toml is now coordinate-only (no version.ref).
Versions come from the sk.ainet:skainet-bom platform constraint
re-exported by :llm-bom, and every consumer module pulls in
implementation(project.dependencies.platform(project(":llm-bom")))
in each affected source set. Engine bumps are still a one-line edit
at the top of the catalog, but every internal build now exercises
the BOM end-to-end — a missing-from-BOM regression fails locally
instead of leaking into a published artifact.@Tag("smoke-reference") —
the smoke tier that pins the architectures we always want to run end-
to-end: Qwen3ReferenceSmokeTest (Qwen3-1.7B Q8 GGUF; exercises the
new 0.25.0 Q8_0MatmulKernel + Qwen's RoPEMode.SPLIT_HALF +
QK-Norm), Gemma4ReferenceSmokeTest (Gemma-4 E2B SafeTensors;
sliding-window attention + per-layer KV sharing), and
BertLeafReferenceSmokeTest (MongoDB mdbr-leaf-ir SafeTensors via
the Java KBertJava surface). Run with
./gradlew test -PsmokeReference -PincludeIntegration. Each test
self-skips via JUnit Assumptions when the model file isn't
reachable through the standard ~/.lmstudio/models/ /
~/.cache/huggingface/hub/ / env-var fallback chain.0.23.5 — skainet-cli reliability on JDKs without the
jdk.incubator.vector module: --enable-preview --add-modules jdk.incubator.vector flags reach the generated launchers (previously
only gradle :run); detection of scalar-fallback CPU ops with auto
weight dequant to FP32; backend label printed after the real ops
probe so it can't disagree with the warning beside it.
0.23.4 — BOM is now correct and self-maintaining: :llm-inference:apertus
and :llm-inference:voxtral were missing from the BOM's constraints and are now
covered, so consumers pulling them through the BOM get proper version alignment;
the constraint list is auto-discovered by a buildSrc/ convention plugin. The
README and tutorial dependency snippets were also fixed to use the published
artifact IDs (skainet-transformers-core etc.) via the BOM pattern.
0.23.3 — Prefill progress callback: generateUntilStop and
AgentLoop expose (done, total) progress during the autoregressive
prefill loop via a default-no-op AgentListener.onPrefillProgress
method, so UIs on CPU-only runtimes can show that work is happening
between round start and the first generated token.
0.23.2 — kllama-cli, kllama-native, kllama-wasm, and
KLlamaJava swapped to the DSL path (OptimizedLLMRuntime +
llamaNetwork()); GPU stubs deleted; SentencePiece + GGUF tokenizers
unified through upstream sk.ainet.io.tokenizer; markdown-fenced Llama 3
JSON tool calls now parse correctly; Qwen3 NEOX RoPE pairing fix; QK-norm
RMSNorm-eps wiring fix.
0.23.1 — Apertus end-to-end (routing through OptimizedLLMRuntime +
apertusNetwork(), chat template + tool calling, real-GGUF Q4_K
loading); Gemma 4 chat-model JVM facade with mmap-arena cleanup; multi-id
EOS / stop-token support in the chat layer; SentencePiece auto-detect in
fromTokenizerJson; LEAF + Llama 3 single-JVM smoke test;
ServiceLoader shadow-jar fix-up so the priority-100 native-cpu provider
is picked up post-merge.
See CHANGELOG.md for the full set of changes.
This project uses SKaiNET as its underlying execution engine — tensor ops, neural-network DSL, kernel SPI, GGUF / SafeTensors I/O.
MIT — see LICENCE.