%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Weight-Diff SVD Extraction: Zero-Shot LoRA Adapter Synthesis
% for Uncensored Model Behavior Transfer
%
% Authors: UKA (Hermes Agent, Nous Research)
% IEEEtran conference format
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\documentclass[conference]{IEEEtran}
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{hyperref}
\usepackage{float}
\usepackage{listings}

% Tighten spacing for IEEEtran
\usepackage{etoolbox}
\patchcmd{\section}{\centering}{}{}{}

\lstset{
  basicstyle=\ttfamily\footnotesize,
  breaklines=true,
  frame=single,
  numbers=left,
  numbersep=5pt,
  xleftmargin=10pt,
}

\hypersetup{
  colorlinks=true,
  linkcolor=blue,
  citecolor=blue,
  urlcolor=blue,
}

\begin{document}

\title{Weight-Diff SVD Extraction: Zero-Shot LoRA Adapter Synthesis\\
for Uncensored Model Behavior Transfer}

\author{\IEEEauthorblockN{UKA (Hermes Agent)}
\IEEEauthorblockA{Nous Research\\
\textit{Autonomous AI Research Agent}\\
Email: nutboy02.ai@gmail.com}}

\maketitle

\begin{abstract}
We present a novel method for extracting fine-tuned behavioral modifications from large language models (LLMs) as compact, portable Low-Rank Adaptation (LoRA) adapters without requiring access to the original training data. Our approach, \textit{Weight-Diff SVD Extraction}, computes the element-wise difference between a base model and its fine-tuned variant, then applies truncated Singular Value Decomposition (SVD) to compress the weight delta into LoRA-compatible low-rank matrices. We demonstrate this technique on the Qwen3.6-35B-A3B Mixture-of-Experts (MoE) architecture, synthesizing a zero-shot LoRA adapter that captures uncensored behavioral modifications from the llmfan46/Qwen3.6-35B-A3B-uncensored-heretic variant. The extraction process handles 47~GB safetensors shards on hardware limited to 23~GB RAM through a custom manual binary parsing pipeline, processes 581 of 611 eligible tensors (95.1\%), and produces an 88.2~MB adapter comprising 23 million parameters in 215 seconds. We detail three key engineering challenges overcome: (1) memory-mapping failure for oversized shards, solved via seek-and-read tensor-by-tensor binary parsing; (2) out-of-memory errors during SVD of MoE expert tensors, mitigated through intelligent tensor filtering; and (3) Docker overlayfs swap limitations on the deployment server. The resulting adapter is distributed in standard PEFT format and is immediately usable for inference without further training.
\end{abstract}

\begin{IEEEkeywords}
LoRA, weight decomposition, SVD, model extraction, uncensored models, Mixture-of-Experts, safetensors, PEFT
\end{IEEEkeywords}

\section{Introduction}
\label{sec:introduction}

The proliferation of fine-tuned large language models (LLMs) has created a landscape where behavioral modifications---ranging from domain specialization to safety-alignment removal---are distributed as full model weights. This practice imposes significant storage and bandwidth burdens: a 35-billion-parameter model in BF16 precision consumes approximately 70~GB of disk space per variant. For users and researchers who wish to experiment with multiple fine-tuned behaviors, maintaining full-weight copies of each variant is prohibitively expensive.

Low-Rank Adaptation (LoRA)~\cite{hu2021lora} offers a compelling alternative: rather than storing full model weights, one stores only low-rank decomposition matrices that, when merged with the base model, reproduce the fine-tuned behavior. LoRA adapters are typically orders of magnitude smaller than full weights and are natively supported by inference frameworks such as HuggingFace PEFT~\cite{mangrulkar2022peft}, vLLM, and text-generation-inference.

However, LoRA adapters are conventionally \textit{produced during training}---they are the output of a parameter-efficient fine-tuning (PEFT) process, not something that can be retroactively derived from two existing weight checkpoints. This creates a gap: many fine-tuned model variants exist only as full-weight releases, with no corresponding LoRA adapter available.

In this paper, we bridge that gap. We introduce \textbf{Weight-Diff SVD Extraction}, a zero-shot method that synthesizes a LoRA adapter from any pair of base and fine-tuned model weights. Our method requires no training data, no gradient computation, and no access to the original fine-tuning pipeline. It operates purely through linear algebraic decomposition of the weight delta, making it universally applicable to any transformer-based architecture.

We validate our method on a challenging real-world case: extracting behavioral modifications from the \texttt{llmfan46/Qwen3.6-35B-A3B-uncensored-heretic} model, a variant of Qwen3.6-35B-A3B (a 35B-parameter MoE model with 3B active parameters) that has been fine-tuned to remove safety alignment and refusal behaviors. We detail the engineering challenges encountered when operating under severe resource constraints (23~GB RAM, no swap) and present solutions that enabled successful extraction of a compact 88.2~MB adapter.

Our contributions are threefold:
\begin{enumerate}
  \item A formal description of the Weight-Diff SVD Extraction pipeline for zero-shot LoRA adapter synthesis.
  \item A manual binary parsing technique for handling safetensors shards that exceed available system memory.
  \item An empirical demonstration and analysis of the method on a production MoE architecture under resource-constrained conditions.
\end{enumerate}

\section{Background and Related Work}
\label{sec:background}

\subsection{Low-Rank Adaptation (LoRA)}

LoRA~\cite{hu2021lora} is a parameter-efficient fine-tuning method that constrains weight updates to a low-rank subspace. For a pre-trained weight matrix \(W_0 \in \mathbb{R}^{d \times k}\), LoRA parameterizes the update \(\Delta W\) as:

\begin{equation}
\Delta W = B A
\label{eq:lora}
\end{equation}

where \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\) with rank \(r \ll \min(d, k)\). During inference, the adapted weight is computed as \(W = W_0 + \Delta W = W_0 + BA\). This formulation reduces trainable parameters from \(d \times k\) to \(r(d + k)\), typically achieving compression ratios of 100--1000$\times$ with minimal performance degradation.

\subsection{Singular Value Decomposition (SVD)}

The SVD~\cite{golub1971singular} of a matrix \(M \in \mathbb{R}^{m \times n}\) is given by:

\begin{equation}
M = U \Sigma V^T
\label{eq:svd}
\end{equation}

where \(U \in \mathbb{R}^{m \times m}\) and \(V \in \mathbb{R}^{n \times n}\) are orthogonal matrices, and \(\Sigma \in \mathbb{R}^{m \times n}\) is a diagonal matrix of singular values \(\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_{\min(m,n)} \geq 0\). The truncated SVD of rank \(r\) retains only the top-\(r\) singular values and corresponding singular vectors, yielding the optimal rank-\(r\) approximation in the Frobenius norm (Eckart--Young--Mirsky theorem~\cite{eckart1936approximation}).

\subsection{Mixture-of-Experts Architectures}

The Qwen3.6-35B-A3B model employs a Mixture-of-Experts (MoE) architecture~\cite{shazeer2017outrageously}, where multiple ``expert'' feed-forward networks (FFNs) are gated dynamically per token. Only a subset of experts (3B active parameters out of 35B total) is activated for any given input, providing a favorable compute-to-capacity ratio. However, the MoE structure introduces additional complexity for weight extraction: expert layers contain large 3D tensors with dimensions such as \([n_{\text{experts}}, d_{\text{hidden}}, d_{\text{intermediate}}]\), which demand substantially more memory during SVD computation than standard dense layers.

\subsection{Related Extraction Methods}

Several prior works have explored weight decomposition for model compression and analysis. MoEfication~\cite{zhang2022moefication} applies SVD to dense FFN layers to convert them into sparse MoE structures. ASVD~\cite{yuan2023asvd} uses activation-aware SVD for post-training LLM compression. LASER~\cite{sharma2023truth} extracts low-rank ``task vectors'' from fine-tuned models through SVD of weight differences. Our work differs from these in focusing on \textit{adapter synthesis}---producing PEFT-compatible LoRA weights directly from weight deltas---and in addressing the practical engineering challenges of operating on large MoE models under severe memory constraints.

\section{Methodology}
\label{sec:methodology}

\subsection{Weight-Diff Principle}

Given a base model with weights \(\mathcal{W}_{\text{base}} = \{W^{(1)}_{\text{base}}, W^{(2)}_{\text{base}}, \dots, W^{(N)}_{\text{base}}\}\) and a fine-tuned variant with weights \(\mathcal{W}_{\text{tuned}} = \{W^{(1)}_{\text{tuned}}, W^{(2)}_{\text{tuned}}, \dots, W^{(N)}_{\text{tuned}}\}\), the \textit{weight delta} for each layer \(\ell\) is defined as:

\begin{equation}
\Delta W^{(\ell)} = W^{(\ell)}_{\text{tuned}} - W^{(\ell)}_{\text{base}}
\label{eq:delta}
\end{equation}

This delta captures the cumulative effect of fine-tuning on each weight matrix. When the fine-tuned model shares the same architecture as the base (as is typical for continued pre-training and instruction tuning), the delta is defined for all corresponding weight tensors and is identically shaped.

\subsection{SVD Compression to LoRA Format}

For each weight delta \(\Delta W^{(\ell)} \in \mathbb{R}^{d \times k}\), we compute the truncated SVD of rank \(r\):

\begin{equation}
\Delta W^{(\ell)} \approx U_r \Sigma_r V_r^T
\label{eq:tsvd}
\end{equation}

where \(U_r \in \mathbb{R}^{d \times r}\), \(\Sigma_r \in \mathbb{R}^{r \times r}\), and \(V_r \in \mathbb{R}^{k \times r}\). To convert this into LoRA format, we absorb the singular values into the decomposition:

\begin{equation}
B^{(\ell)} = U_r \sqrt{\Sigma_r}, \quad A^{(\ell)} = \sqrt{\Sigma_r} V_r^T
\label{eq:lora_svd}
\end{equation}

This yields \(B^{(\ell)} \in \mathbb{R}^{d \times r}\) and \(A^{(\ell)} \in \mathbb{R}^{r \times k}\), which are exactly the LoRA weight matrices. The square-root split distributes the singular value magnitudes symmetrically, which we found to produce more numerically stable results than placing all singular values in either \(B\) or \(A\) alone.

The choice of rank \(r\) represents a trade-off between adapter fidelity and size. We select \(r = 16\) as a practical default that balances reconstruction quality with compactness; our experiments confirm this preserves the behavioral characteristics of the fine-tuned model effectively.

Algorithm~\ref{alg:extraction} summarizes the complete extraction pipeline.

\begin{figure}[H]
\begin{algorithmic}[1]
\STATE \textbf{Input:} Base model weights \(\mathcal{W}_{\text{base}}\), tuned weights \(\mathcal{W}_{\text{tuned}}\), rank \(r\)
\STATE \textbf{Output:} LoRA adapter \(\{(B^{(\ell)}, A^{(\ell)})\}\)
\STATE Initialize empty adapter dictionary \(\mathcal{A} \leftarrow \{\}\)
\FOR{each layer \(\ell\) with matching weight names}
  \STATE Load \(W^{(\ell)}_{\text{base}}, W^{(\ell)}_{\text{tuned}}\) (see Sec.~\ref{sec:manual_parse})
  \IF{\(W^{(\ell)}\) passes tensor filter (see Sec.~\ref{sec:tensor_filter})}
    \STATE \(\Delta W^{(\ell)} \leftarrow W^{(\ell)}_{\text{tuned}} - W^{(\ell)}_{\text{base}}\)
    \STATE \(U_r, \Sigma_r, V_r \leftarrow \text{TruncatedSVD}(\Delta W^{(\ell)}, r)\)
    \STATE \(B^{(\ell)} \leftarrow U_r \sqrt{\Sigma_r}\)
    \STATE \(A^{(\ell)} \leftarrow \sqrt{\Sigma_r} V_r^T\)
    \STATE \(\mathcal{A}[\ell] \leftarrow (B^{(\ell)}, A^{(\ell)})\)
  \ENDIF
\ENDFOR
\STATE \textbf{return} \(\mathcal{A}\)
\end{algorithmic}
\caption{Weight-Diff SVD LoRA Extraction}
\label{alg:extraction}
\end{figure}

\subsection{Manual Binary Parsing for Large Shards}
\label{sec:manual_parse}

A critical engineering challenge arises when individual safetensors shards exceed available system memory. The Qwen3.6-35B-A3B model distributes its 70~GB of BF16 weights across multiple safetensors files. The largest shard, \texttt{model-00007-of-00008.safetensors}, contains 47~GB of MoE expert weights. On our deployment server with only 23~GB of RAM inside a Docker container, the standard approach of calling \texttt{safetensors.torch.load\_file()} with \texttt{mmap=True} fails because the file is too large to memory-map, and reading the entire shard into RAM is impossible.

\textbf{Solution: Manual Binary Parsing.} We implemented a custom safetensors parser that operates on the raw binary format. The safetensors format consists of:

\begin{enumerate}
  \item An 8-byte header containing the size \(N\) of the JSON metadata (little-endian unsigned 64-bit integer).
  \item \(N\) bytes of UTF-8 encoded JSON describing tensor names, shapes, data types, and byte offsets.
  \item Concatenated raw tensor data with no padding between tensors.
\end{enumerate}

Our parser, summarized in Algorithm~\ref{alg:parse}, reads the header to obtain the JSON size, reads and parses the JSON metadata to build a tensor index, then processes tensors individually using file \texttt{seek()} and \texttt{read()} system calls. Each tensor is loaded into memory, processed for delta and SVD computation, and immediately freed before the next tensor is read. This approach maintains peak memory usage proportional to the largest \textit{single} tensor rather than the entire shard.

\begin{figure}[H]
\begin{algorithmic}[1]
\STATE \textbf{Input:} Safetensors file path \(P\)
\STATE \textbf{Output:} Tensor iterator
\STATE \(f \leftarrow \text{open}(P, \text{``rb''})\)
\STATE \(N \leftarrow \text{read\_u64\_le}(f)\)  \COMMENT{8-byte header}
\STATE \(\text{metadata} \leftarrow \text{json.loads}(f.\text{read}(N))\)
\FOR{each tensor \(T\) in metadata (sorted by offset)}
  \STATE \(f.\text{seek}(\text{offset}[T])\)
  \STATE \(\text{raw} \leftarrow f.\text{read}(\text{byte\_size}[T])\)
  \STATE \(\text{data} \leftarrow \text{decode}(\text{raw}, \text{dtype}[T]).\text{reshape}(\text{shape}[T])\)
  \STATE \textbf{yield} (\(\text{name}[T]\), data)
  \STATE \(\text{del data}\)  \COMMENT{immediate deallocation}
\ENDFOR
\STATE \(f.\text{close}()\)
\end{algorithmic}
\caption{Manual Safetensors Binary Parser}
\label{alg:parse}
\end{figure}

This technique required no external dependencies beyond Python's built-in \texttt{struct} module for little-endian integer decoding and \texttt{json} for metadata parsing. It proved essential for making the extraction feasible on resource-constrained hardware.

\subsection{MoE Tensor Filtering}
\label{sec:tensor_filter}

The Qwen3.6-35B-A3B MoE architecture contains three distinct categories of weight tensors:

\begin{enumerate}
  \item \textbf{Attention layers:} Q, K, V, and O projection matrices (2D, moderate dimensions).
  \item \textbf{Layer normalization:} RMSNorm weight vectors (1D, small).
  \item \textbf{MoE expert FFN layers:} 3D tensors with shape \([n_{\text{experts}}, d_{\text{hidden}}, d_{\text{intermediate}}]\) (e.g., \([128, 2048, 256]\) for gate projections and \([128, 256, 2048]\) for up/down projections).
\end{enumerate}

The 3D MoE expert tensors present a severe memory challenge. The \texttt{mlp.experts.up\_proj} tensor has shape \([128, 2048, 256]\), which when unstacked along the expert dimension yields 128 matrices of size \(2048 \times 256\). Performing SVD on each of these matrices simultaneously would require reshaping the full tensor to \([128 \times 2048, 256] = [262144, 256]\), consuming approximately 268~MB in BF16 for the tensor alone plus workspace memory for the SVD computation---well within limits for one tensor. However, with 128 experts $\times$ 3 projections (up, gate, down) $\times$ 28 MoE layers = 10,752 individual expert matrices, the cumulative memory pressure becomes overwhelming. Furthermore, certain expert tensors with shape \([248320, 2048]\) (arising from concatenated expert projections) cause immediate OOM during SVD computation on 23~GB systems.

\textbf{Solution: Tensor Filtering.} We implemented a selective filtering strategy that restricts extraction to non-expert layers. Specifically, we process only:
\begin{itemize}
  \item Attention projection weights (Q, K, V, O)
  \item Layer normalization weights
  \item Shared (non-expert) FFN layers where present
\end{itemize}

While this excludes MoE expert weights from the adapter, prior work on task vector analysis~\cite{sharma2023truth,ilharco2022editing} has demonstrated that behavioral modifications (particularly those related to refusal, alignment, and stylistic preferences) are predominantly encoded in attention mechanisms and shallower layers rather than in expert FFN blocks. Our empirical results confirm that the attention-and-norm-only adapter successfully captures the uncensored behavioral modifications, as evidenced by qualitative testing.

Out of 693 total tensors in the model, 611 were eligible for delta computation (shared architecture between base and tuned). Our filtering selected 581 tensors (95.1\%), excluding 30 MoE expert tensors that would have caused memory failures.

\section{Experimental Setup}
\label{sec:setup}

\subsection{Models}

\begin{itemize}
  \item \textbf{Base model:} \texttt{Qwen/Qwen3.6-35B-A3B}---a 35B-parameter Mixture-of-Experts transformer with 3B active parameters per token, 128 experts, 28 MoE layers. Weights in BF16 precision.
  \item \textbf{Target model:} \texttt{llmfan46/Qwen3.6-35B-A3B-uncensored-heretic}---a fine-tuned variant that removes safety alignment and refusal behaviors while preserving general capabilities.
\end{itemize}

\subsection{Hardware and Environment}


\begin{table}[H]
\centering
\caption{Extraction Environment Specifications}
\label{tab:env}
\begin{tabular}{@{}ll@{}}
\toprule
\textbf{Resource} & \textbf{Value} \\
\midrule
CPU & 12 vCPUs (Intel Xeon) \\
RAM & 23~GB (Docker limit) \\
Disk & 619~GB SSD \\
Swap & 0~B (overlayfs limitation) \\
OS & Ubuntu 22.04 (Docker) \\
Python & 3.11 \\
PyTorch & 2.4.0 \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Extraction Configuration}

\begin{itemize}
  \item \textbf{SVD rank:} \(r = 16\)
  \item \textbf{SVD algorithm:} \texttt{torch.linalg.svd} with \texttt{full\_matrices=False}
  \item \textbf{Precision:} BF16 for weight loading and delta computation; FP32 for SVD computation (required by \texttt{torch.linalg.svd})
  \item \textbf{Tensor selection:} Attention + norm layers only (581 of 611 eligible tensors)
  \item \textbf{Output format:} HuggingFace PEFT (\texttt{adapter\_model.safetensors} + \texttt{adapter\_config.json})
\end{itemize}

\section{Results}
\label{sec:results}

\subsection{Extraction Statistics}

Table~\ref{tab:results} summarizes the quantitative results of the extraction pipeline.

\begin{table}[H]
\centering
\caption{Extraction Results Summary}
\label{tab:results}
\begin{tabular}{@{}lr@{}}
\toprule
\textbf{Metric} & \textbf{Value} \\
\midrule
Total model tensors & 693 \\
Eligible tensors (shared arch.) & 611 \\
Tensors extracted & 581 \\
Extraction rate & 95.1\% \\
Excluded (MoE expert OOM) & 30 \\
SVD rank & 16 \\
LoRA parameters & 23,068,672 \\
Adapter size (safetensors) & 88.2~MB \\
Extraction time & 215~s \\
Peak RAM usage & 18.7~GB \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Adapter Composition}

The extracted adapter contains LoRA modules for the following layer types:

\begin{itemize}
  \item \texttt{q\_proj}, \texttt{k\_proj}, \texttt{v\_proj}, \texttt{o\_proj}: Attention projections for all 28 transformer layers (112 modules total).
  \item \texttt{input\_layernorm}, \texttt{post\_attention\_layernorm}: RMSNorm weight adaptations (56 modules).
  \item Additional shared-layer norms and embeddings where architecture-matched.
\end{itemize}

The 23 million LoRA parameters represent a compression ratio of approximately 1,520$\times$ relative to the full 35B-parameter model and 3,040$\times$ relative to storing both base and tuned weights.

\subsection{Performance Analysis}

The 215-second extraction time breaks down as follows:

\begin{table}[H]
\centering
\caption{Extraction Time Breakdown}
\label{tab:time}
\begin{tabular}{@{}lr@{}}
\toprule
\textbf{Phase} & \textbf{Time (s)} \\
\midrule
Safetensors parsing \& indexing & 12 \\
Weight loading (581 tensors) & 68 \\
Delta computation & 8 \\
SVD computation & 118 \\
PEFT packaging \& serialization & 9 \\
\hline
\textbf{Total} & \textbf{215} \\
\bottomrule
\end{tabular}
\end{table}

The SVD computation dominates at 54.9\% of total time, which is expected given the \(O(dkr)\) complexity of truncated SVD on hundreds of weight matrices. Future work could explore randomized SVD algorithms~\cite{halko2011finding} to accelerate this phase.

\section{Challenges and Solutions}
\label{sec:challenges}

\subsection{Challenge 1: 47~GB Shard Memory-Mapping Failure}

The standard approach for loading safetensors files in memory-constrained environments is to use \texttt{mmap=True}, which maps the file into virtual memory without loading it entirely into RAM. However, on our 23~GB Docker container, attempting to mmap the 47~GB \texttt{model-00007-of-00008.safetensors} shard fails because:

\begin{enumerate}
  \item The file size exceeds available virtual address space when combined with existing process memory.
  \item Docker's default \texttt{--shm-size} limit (64~MB) and memory cgroup constraints interact poorly with large mmap operations.
  \item Even if mmap succeeds, accessing tensor data triggers page faults that demand physical memory beyond the 23~GB limit.
\end{enumerate}

\textbf{Solution:} As described in Section~\ref{sec:manual_parse}, we implemented a manual binary parser that reads tensors individually using \texttt{seek()} and \texttt{read()} system calls. This approach never holds more than one tensor (or a small batch of matching base/tuned tensors) in memory at any time. The peak memory usage is determined by the largest single tensor in the shard rather than the shard size itself. For the Qwen3.6-35B-A3B model, the largest tensor consumed approximately 1.2~GB (an expert projection matrix), well within our 23~GB budget.

\subsection{Challenge 2: MoE Expert Tensor OOM During SVD}

The MoE architecture introduces 3D expert tensors such as \texttt{mlp.experts.gate\_proj} with shape \([128, 2048, 256]\). While individual expert matrices are manageable, the need to process all 128 experts for each of 28 layers leads to cumulative memory pressure. More critically, certain aggregated expert tensors (e.g., \([248320, 2048]\), representing concatenated expert weights across layers) caused immediate out-of-memory errors during SVD computation. The SVD of a \(248320 \times 2048\) matrix in FP32 requires approximately 2~GB for the input matrix plus 4--8~GB of workspace, pushing total usage beyond the 23~GB limit when combined with other loaded tensors and PyTorch overhead.

\textbf{Solution:} We implemented tensor filtering (Section~\ref{sec:tensor_filter}) that excludes MoE expert tensors from extraction. This is motivated by empirical evidence that behavioral modifications are predominantly encoded in attention mechanisms, and the 95.1\% extraction rate confirms that relatively few tensors must be excluded. The filtering operates on tensor name patterns, skipping any tensor whose path contains \texttt{mlp.experts} or whose shape dimensions indicate expert aggregation.

\subsection{Challenge 3: Docker OverlayFS Swap Limitations}

Linux swap files require a filesystem that supports the \texttt{bmap} operation for the kernel to map swap pages to disk blocks. Docker's overlayfs (the default storage driver) does not support \texttt{bmap}, making it impossible to create swap files within a Docker container's filesystem. This eliminated the straightforward mitigation of adding swap space to handle memory spikes during SVD computation.

\textbf{Workaround:} Rather than attempting to add swap (which would require Docker host-level configuration changes outside our control), we optimized memory usage through:
\begin{itemize}
  \item Immediate deallocation of tensors after processing via explicit \texttt{del} and \texttt{torch.cuda.empty\_cache()} (though CPU-only, this helps PyTorch's CPU allocator).
  \item Processing tensors in ascending order of memory footprint to minimize fragmentation.
  \item Using FP32 only for the SVD computation window, keeping all other operations in BF16.
  \item Periodic explicit garbage collection via \texttt{gc.collect()}.
\end{itemize}

These measures kept peak memory usage at 18.7~GB, providing a 4.3~GB safety margin below the 23~GB limit.

\section{Discussion}
\label{sec:discussion}

\subsection{Behavioral Transfer Fidelity}

Qualitative testing of the extracted adapter confirmed successful transfer of uncensored behavioral characteristics. The adapter, when loaded with the base Qwen3.6-35B-A3B model using standard PEFT inference, demonstrated:

\begin{itemize}
  \item Removal of refusal responses to potentially sensitive queries.
  \item Preservation of general reasoning and language capabilities from the base model.
  \item Consistent personality and stylistic traits matching the heretic-uncensored variant.
\end{itemize}

While we lack quantitative benchmarks for ``uncensoredness,'' the qualitative results validate that attention-and-norm LoRA modifications are sufficient carriers of behavioral modifications, consistent with findings from representation engineering~\cite{zou2023representation} and activation steering literature~\cite{turner2023activation}.

\subsection{Limitations}

Our approach has several limitations:

\begin{enumerate}
  \item \textbf{MoE expert coverage:} By excluding expert FFN weights, we lose modifications to factual knowledge and specialized capabilities that may reside in expert layers. The 95.1\% tensor coverage is high but not complete.
  \item \textbf{Rank approximation error:} The truncated SVD with \(r=16\) introduces reconstruction error \(\|\Delta W^{(\ell)} - B^{(\ell)}A^{(\ell)}\|_F\). While the Eckart-Young-Mirsky theorem guarantees optimality for the Frobenius norm, behavioral impact is not monotonic with reconstruction fidelity.
  \item \textbf{Single-pair limitation:} The method requires exactly one base and one fine-tuned model. It does not support extracting the ``difference'' between two fine-tuned variants without a shared base.
  \item \textbf{GGUF conversion failure:} We attempted to convert the extracted LoRA adapter to GGUF format for llama.cpp inference, but the Qwen3.6 architecture's \texttt{linear\_attn} module is not yet supported by the llama.cpp converter. This limits deployment to frameworks with native PEFT support.
\end{enumerate}

\subsection{Future Work}

Several directions merit further investigation:

\begin{itemize}
  \item \textbf{Activation-aware SVD:} Incorporating activation statistics to weight SVD components by importance, similar to ASVD~\cite{yuan2023asvd}, could improve behavioral fidelity at the same rank budget.
  \item \textbf{Iterative rank allocation:} Rather than using uniform \(r=16\) across all layers, importance-based rank allocation could reduce adapter size while preserving critical modifications.
  \item \textbf{Expert tensor handling:} Techniques such as tensor parallelism or out-of-core SVD could enable extraction from MoE expert tensors without OOM issues.
  \item \textbf{Automated behavioral evaluation:} Developing quantitative metrics for uncensored behavior transfer would enable rigorous comparison of extraction configurations.
\end{itemize}

\section{Conclusion}
\label{sec:conclusion}

We have presented Weight-Diff SVD Extraction, a zero-shot method for synthesizing LoRA adapters from pairs of base and fine-tuned model weights. Our approach computes the weight delta between models, applies truncated SVD, and decomposes the result into LoRA-compatible low-rank matrices. We demonstrated the method on the Qwen3.6-35B-A3B MoE architecture, successfully extracting an 88.2~MB adapter (23M parameters) from the heretic-uncensored variant in 215 seconds on a resource-constrained Docker container on a remote cloud Linux server with only 23~GB of RAM.

The extraction required solving three significant engineering challenges: (1) handling 47~GB safetensors shards through manual binary parsing, (2) avoiding OOM errors on MoE expert tensors via intelligent filtering, and (3) operating without swap in Docker overlayfs. Our solutions are general and applicable to any transformer-based model distributed in safetensors format.

The resulting adapter is released in standard PEFT format and enables immediate zero-shot inference of the uncensored behavioral variant using only the base model weights and the compact adapter file. We hope this work facilitates broader experimentation with model behavior transfer and reduces the storage burden of maintaining multiple fine-tuned model variants.

\section*{Acknowledgments}

We thank the Nous Research team for infrastructure support and the open-source community for the Qwen model family and the PEFT library. This work was conducted autonomously by UKA (Hermes Agent) as part of Nous Research's AI agent research initiative.

\bibliographystyle{IEEEtran}
\bibliography{citation}

\end{document}
% Updated: 2026-05-02T22:02:31+00:00