Manual markup / hardcoded 3D+ attention for file diffs / patches

abitrolly · March 31, 2026, 6:49am

Intro into unseeable biology

Math don’t give me enough understanding. I prefer looking at bytes in memory and at objects and mechanisms in real world. After I got enough observations to “understand” how they work, then math starts to make sense for me, as a very limited representation.

With this type of mind, papers about attention are elusive for me. They focus on a single formula and expect me to imagine in my head how the matrix multiplication recursively works to make the whole transformer smart. Over time I kind of receive the feeling that I “understand something”, but it could be as well hallucination/confabulation in my head. The worst - I can not explain, refer and even discuss these associations in my head. This might be my attempt to put them into word sequence. Because, finally, I may have found the path to make sense out of how transformers work, and what I really need.

I think that the world of LLMs, transformers and related neural networks knowledge is missing the video that could be commented as:

This 7 minutes animation sums up 500+ pages of textbook

Or

We should feel extremely lucky to have access to this huge amount of knowledge, all in just 7 min. The best thing I’ve seen in a while

In the world of biology there is such video. I post it below.

I remember this video by the phrase “Animations of unseeable biology”, which leads to TED talk by the author (whatever that means). By watching it, you will understand how our DNA works. In just 7 minutes.

Grokking “adenine, guanine” vs “softmax, cross-entropy”

There is direct analogy with my missing case of “Animation of invisible computer science”. If you try to read a books about DNA, you will be swamped with of cryptic words like “adenine”, “guanine” and even more cryptic combination of these words into sequences like “prine nitrogenous bases that form crucial base pairs with pyrimidines”. In the papers about (digital) neural networks there are “softmax”, “cross-entropy” and similar cryptic phrases that do makes a lot of sense after 5 years of study.

You kind of load these cryptic words into your brain, then boil them for a prolonged period of time, and as the brain becomes smaller, it has to find a place for them to optimize energy waste and energy spendings. If you’ve read good books and proper papers, you may have been lucky to find optimal structure for the information that now fits your brains perfectly and don’t waste extra space. You may even have an ability to serialize (“serialize”, eh) the structure into sequence of words to pass to another human, so that it can reconstruct and adopt it to save own space. Aha! But it also could be that when when the brain is out of space and time, it just throws away bits, and now we have a “belief” that probably can not be explained (serialized?) anymore.

diffs/patches as a path to explain attention (in 3D)

So how to find that optimal structure that can explain to other person how neural network should work to solve some specific problem? I have no choice - I have to start with the problem. And the problem I faced yesterday to wake up in the morning with this necessity to optimize my brain is “LLMs are unable to make sense of diffs and patches”.

OMG. Now I have to explain the hardest part. It all started with this review comment by some LLM used by GitLab ci: Upgrade deprecated magic `pages:` job to `pages.publish` syntax (!229493) · Merge requests · GitLab.org / GitLab · GitLab I have no idea how GitLab sends LLM data about the code change, but I suspect it is a usual plain text diff.

diff --git a/.gitlab/ci/pages.gitlab-ci.yml b/.gitlab/ci/pages.gitlab-ci.yml
index bb86ea80b505a318d4f6b6f3619065946969845e..55e8ca19993c5876efc4b20034f58db5695ba140 100644
--- a/.gitlab/ci/pages.gitlab-ci.yml
+++ b/.gitlab/ci/pages.gitlab-ci.yml
@@ -1,8 +1,11 @@
+# This CI job is responsible for the contents of development
+# support web site at https://gitlab-org.gitlab.io/
+
 .compress-public: &compress-public
   - find public -type f -regex '.*\.\(htm\|html\|txt\|text\|js\|json\|css\|svg\|xml\)$' -exec gzip -f -k {} \;
   - find public -type f -regex '.*\.\(htm\|html\|txt\|text\|js\|json\|css\|svg\|xml\)$' -exec brotli -f -k {} \;
 
-pages:
+upload-pages:
   extends:
     - .default-retry
     - .pages:rules
@@ -34,7 +37,6 @@ pages:
     - mv $GLCI_PREDICTIVE_RSPEC_PACKED_TESTS_MAPPING_ALT_PATH.gz public/$GLCI_PREDICTIVE_RSPEC_PACKED_TESTS_MAPPING_ALT_PATH.gz || true
     - mv $GLCI_PREDICTIVE_FRONTEND_FIXTURES_MAPPING_PATH public/$GLCI_PREDICTIVE_FRONTEND_FIXTURES_MAPPING_PATH || true
     - *compress-public
-  artifacts:
-    paths:
-      - public
+  pages:
+    publish: public
     expire_in: 31d

It is a unified diff that shows which lines were added and which lines were removed from some source file (the file is .gitlab/ci/pages.gitlab-ci.yml in this case). The problem is that the format is line oriented, while LLMs are word sequence oriented. They predict the next word, their attention is placed on the sequence of words (correct me if I wrong), so they operate on one-dimensional sequences. 1D sequences. This sentence is 1D - each symbol in it has only one coordinate - the number from the start of the sentence. But if I place a line break…

Now it is no longer a 1D line/sentence. Now it is a 2D text. Each symbol still has coordinate from the beginning of the line (the column), but also the number of line/row. If you take a look at the above diff, the cryptic string@@ -1,8 +1,11 @@ means “lines 1-8 from source file are modified below, and they will be lines 1-11 in target file”. The amount the lines in the file can stay the same while the file may grow (endlessly).

No matter how big is transformer window (I have a “belief” there is a window for words to put attention to), if it is not aware of 2D structure of the text, it will eventually run out of this window. But if it goes “line by line”, it may be able to “punch wormholes” to different places where the same word or concept is used, without relying on them to be together. As we humans do with our smol bran. When start to study the code, I get one variable at a time and follow how it is (supposed to be) used. Looks like LLMs try to eat the whole file first to get all connections at once. Yes, they can do this, but for 2D things like diffs they may fail miserably. If they don’t get “lines and columns", the probably don’t grok indentation based languages like Python very well.

The pinpoint of the problem with LLMs not getting diffs

The problem with the LLM is this diff piece and the LLM comment.

+  pages:
+    publish: public
     expire_in: 31d

Check the indentation of expire_in - it should be at the same level as publish under the pages: key

In diff, there is a column with + and - which show which lines are added and which are removed. And for us, humans, it is obvious, because we can see x and y. And we can see that publish aligns with expire_in. Because first empty space is not “extra indentation” - it is space in the first column that shows that the line is not added or removed, the space means that the line is not modified. It doesn’t participate in indentation. But LLM doesn’t get it.

Teaching LLMs or manually hacking weights for 2D/3D navigation

If LLM reads the above explanation about diffs, what is the chance that it will understand all other diffs properly now?

How much parameters will be required to store this knowledge, and how reliably can it still use it?

If diffs are 2D, then making connection between files is a 3D task. So how to teach it that? I am pretty sure that GitLab LLMs are services that are well over 100B params. The Python code that parses unified diffs is about 10kB max. In machine code “weights” it would probably be 1000 bytes max. So what minimal architecture of the “transformer” I need, and which weights to put in it, so that it can successfully navigate and apply unified diffs to the content, also spotting mistakes?

This handcraft won’t be machine learning anymore, but understanding how to map these 2D things into weights and filters will help to finally get some “extra small text coding model” with decent performance. Maybe it can then serve as a module for larger networks. Like an “attachable brain piece” to help them get diffs.

That would be awesome! I “believe”.

(I have to stop for now, because the time and size limits of my brain are reached, but the idea of transplanting a piece of weights space with proper 2D multiplication logic between models still looks exciting to explore)

Topic		Replies	Views
How magical is the Transformer NLP with Attention Models week-module-2	4	653	January 29, 2022
Attention mechanism visualized # Transformers Sequence Models linkedin , coursera-platform	3	440	September 4, 2023
Confused by the label of attention mask diagram Build and Train an LLM with JAX	0	19	April 5, 2026
I can't quite understand the transformer structure NLP with Sequence Models week-module-4	8	1272	August 25, 2023
Prof Ng's talk yesterday AI Discussions ai-discussions	0	71	September 17, 2024