HN Reader

Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

381

132

I built this project as a way to learn more about NLP by applying it to something weird and unsolved.

The Voynich Manuscript is a 15th-century book written in an unknown script. No one’s been able to translate it, and many think it’s a hoax, a cipher, or a constructed language. I wasn’t trying to decode it — I just wanted to see: does it behave like a structured language?

I stripped a handful of common suffix-like endings (aiin, dy, etc.) to isolate what looked like root forms. I know that’s a strong assumption — I call it out directly in the repo — but it helped clarify the clustering. From there, I used SBERT embeddings and KMeans to group similar roots, inferred POS-like roles based on position and frequency, and built a Markov transition matrix to visualize cluster-to-cluster flow.

It’s not translation. It’s not decryption. It’s structural modeling — and it revealed some surprisingly consistent syntax across the manuscript, especially when broken out by section (Botanical, Biological, etc.).

GitHub repo: https://github.com/brianmg/voynich-nlp-analysis Write-up: https://brig90.substack.com/p/modeling-the-voynich-manuscrip...

I’m new to the NLP space, so I’m sure there are things I got wrong — but I’d love feedback from people who’ve worked with structured language modeling or weird edge cases like this.

I see that you're looking for clusters within PCA projections -- You should look for deeper structure with hot new dimensional reduction algorithms, like PaCMAP or LocalMAP!

I've been working on a project related to a sensemaking tool called Pol.is [1], but reprojecting its wiki survey data with these new algorithms instead of PCA, and it's amazing what new insight it uncovers with these new algorithms!

https://patcon.github.io/polislike-opinion-map-painting/

Painted groups: https://t.co/734qNlMdeh

(Sorry, only really works on desktop)

[1]: https://www.technologyreview.com/2025/04/15/1115125/a-small-...

8 months agoby patcon

A point of note is that the text embeddings model used here is paraphrase-multilingual-MiniLM-L12-v2 (https://huggingface.co/sentence-transformers/paraphrase-mult...), which is about 4 years old. In the NLP world, that's effectively ancient, particularly as the robustness of even small embeddings models due to global LLM improvements has increased dramatically both in information representation and distinctiveness in the embedding space. Even modern text embedding models not explicitly trained for multilingual support still do extremely well on that type of data, so they may work better for the Voynich Manuscript which is a relatively unknown language.

The traditional NLP techniques of stripping suffices and POS identification may actually harm embedding quality than improvement, since that removes relevant contextual data from the global embedding.

8 months agoby minimaxir

(I know nothing about NLP)

Does it make sense to check the process with a control group?

E.g. if we ask a human to write something that resembles a language but isn’t, then conduct this process (remove suffixes, attempt grouping, etc), are we likely to get similar results?

8 months agoby thih9

I had a look at the manuscript for a while and found it suspicious how tightly packed the writing was against the illustrations on some pages. In common language words and letters vary in width, so when you approach the end of the line when writing, you naturally insert a break to begin a new word and avoid overrun. The manuscript is missing these kinds of breaks - I saw many places where it looked like whatever letter might squeeze in had been written at the end of the line.

I wanted to do an analysis of what letters occur just before/after a line break to see if there is a difference from the rest of the text, but couldn't find a transcribed version.

My completely amateur take is that it's an elaborate piece of art or hoax.

8 months agoby cedws

UMAP or TSNE would be nice, even if PCA already shows nice separation.

Reference mapping each cluster to all the others would be a nice way to indicate that there's no variability left in your analysis

8 months agoby tetris11

This is very interesting. You should post a link to https://www.voynich.ninja/index.php

I'm not familiar with SBERT, or with modern statistical NLP in general, but SBERT works on sentences, and there are no obvious sentence delimiters in the Voynich Manuscript (only word and paragraph delimiters). One concern I have is "Strips common suffixes from Voynich words". Words in the Voynich Manuscript appear to be prefix + suffix, so as prefixes are quite short, you've lost roughly half the information before commencing your analysis.

You might want to verify that your method works for meaningful text in a natural language, and also for meaningless gibberish (encrypted text is somewhere in between, with simpler encryption methods closer to natural language and more complex ones to meaningless gibberish). Gordon Rugg, Torsten Timm, and myself have produced text which closely resembles the Voynich Manuscript by different methods. Mine is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.h... and the equivalent EVA is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.t...

8 months agoby DonaldFisk

Maybe I missed it in the README but how did you do the initial encoding for the "words"? so for example, if you have ""okeeodair" as a word, where do you map that back to original symbols?

8 months agoby Avicebron

I’ve found this to be one of the most interesting hypotheses: http://voynichproject.org/

The author made an assumption that Voynichese is a Germanic language, and it looks like he was able to make some progress with it.

I’ve also come across accounts that it might be an Uralic or Finno-Ugric language. I think your approach is great, and I wonder if tweaking it for specific language families could go even further.

8 months agoby us-merul

Being from the 15th Century the obvious reason to encrypt text was to avoid religious persecution during "The Inquisition" (and other religion-motivated violence of that time). So it would be interesting to run the same NLP against the Gospels and look for correlations with that. You'd want to first do a 'word'-based comparison, and then a 'character'-based comparison. I mean compare the graphs from Bible to graphs from Voynich.

Also there might be some characters that are in there just to confuse. For example that bizarre capital "P"-like thing that has multiple variations seems to appear sometimes far too often to represent real language, so it might be just an obfuscator that's removed prior to decryption. There may be other characters that are abnormally "frequent" and they're maybe also unused dummy characters. But the "too many Ps" problem is also consistent with just pure fiction too, I realize.

8 months agoby quantadev

what I'd expect from a handwritten book like that, if it is just a gibberish, and not a cypher of any sorts - the style, calligraphy, the words used, even letters themselves should evolve from page 1 to the last page. Pages could be reordered of course, but it still should be noticeable.

Unless author hadn't written tens of books exactly like that before, which didn't survive, of course.

I don't think it's a very novel idea, but I wonder if there's analysis for pattern like that. I haven't seen mentions of page to page consistency anywhere.

8 months agoby codesnik

My favorite part of this thread is like a dozen different people replying that it's already been deciphered and none of them posted the same one.

8 months agoby empath75

> Traditional analyses often fall into two camps: statistical entropy checks or wild guesswork.

I'd argue that these are just the camps that non-traditional, amateur analysis efforts fall into. I've only briefly skimmed Voynich work, but my impression is that, traditionally, more academic analyses rely on a combination of linguistic and cryptological analysis. This does happen to be informed by some statistical analysis, but goes way beyond that.

For example, as I recall the strongest argument that Voynichese probably isn't just an alternative alphabet for a well-known language relies on comparing Voynichese to the general patterns for how writing systems map symbols to sounds. That permits the development of more specific hypotheses about how it could possibly function, including how likely it is to be an alphabet or abjad, and, hypotheses about which characters could plausibly represent more than one sound, possible digraphs, etc. All of that work casts severe doubt on the likelihood of it representing a language from the area because it just can't plausibly represent a language with the kinds of phonological inventories we see in the language families that existed in that place and time.

There's also been some pretty interesting work on identifying individual scribes based on a confluence of factors including, but not limited to, analysis of the text itself. Some of the inferred scribes exclusively wrote in the A language (oh yeah, Voynichese seems to contain two distinct "languages"), some exclusively wrote in the B language, I think they've even hypothesized that there's one who actually used both languages.

There isn't a lot of popular awareness of this work because it's not terribly sexy to anyone but a linguistics nerd. But I'd guess that any attempt to poke at the Voynich manuscript that isn't informed by it is operating at a severe disadvantage. You want to be standing on the shoulders of the tallest giants, not the ones with the best social media presence.

8 months agoby bunderbunder

Confirm or deny my suspicion: your post and your comments in this thread are substantially written by ChatGPT?

8 months agoby gwillen

This is hands-down the nerdiest and coolest deep-dive into the Voynich I’ve seen.

8 months agoby pawanjswal

Would analysis of a similar body of text in a known language yield similar patterns? Put it in another way, could you use this type of an analysis on different types of text help understand what this script describes?

8 months agoby user32489318

Really cool work here. Have you considered applying these same techniques to the Rohonc Codex? As far as I know, the only other book similar to the Voynich Manuscript.

8 months agoby frozenseven

> "New multispectral analysis of Voynich manuscript reveals hidden details"

https://arstechnica.com/science/2024/09/new-multispectral-an...

but imagine if it was just a (wealthy) child's coloring book or practice book for learning to write lol

8 months agoby ck2

Another great natural mystery that machine learning could tackle is earthquake prediction. Sure you could find some patterns modeling historical data.

8 months agoby bdbenton5255

Although I skimmed the methodology out of curiosity, what really drew my eye was the transcription in the repository of the manuscript. This led me down a rabbit hole leading here [1] about historic efforts to transcript or transliterate the manuscript.

[1] https://www.voynich.nu/transcr.html

8 months agoby andrewla

How expensive is a "brute force" approach to decode it? I mean, how about mapping each unknown word by a known word in a known language and improve this mapping until a 'high score' is reached?

8 months agoby marcodiego

Sorry if I missed it, but what about keeping the suffixes and trying to do some finetuning on the source then clustering sentences or at least pages which given the media should be consistent-ish

8 months agoby gthompson512

I thought it was old turkish?

https://www.youtube.com/watch?v=p6keMgLmFEk&t=1s

8 months agoby bpiroman

Voynich is one of my favorite unsolved puzzles. This approach looks fascinating, so thanks for sharing your work here!

8 months agoby thearn4

I feel like we are missing an important point...the transformer model that has been used here is trained on known languages. This means it cannot extract meaningful embeddings from a text in a unknown language...are the plots just noise then?

8 months agoby theRealEros

The link to the write-up seems broken, can you write the correct one?

8 months agoby GTP

TIL about the Voynich manuscript. Fascinating. Thank you.

8 months agoby rossant

theres no need to do any of this, its fake, its a forgery

8 months agoby mach5

https://m.xkcd.com/593/

8 months agoby AStonesThrow

I strongly believe the manuscript is undecipherable in the sense thats it's all gibberish. I can't prove it, but at this point I think it's more likely than not to be hoax.

8 months agoby glimshe

This looks very interesting - nice work!

I have no background in NLP or linguistics, but I do have a question about this:

> I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants

This seems to imply stripping the right-hand edges of words, with the assumption that the text was written left to right? Or did you try both possibilities?

Once again, nice work.

8 months agoby andyjohnson0

The best work on Voynich has been done by Emma Smith, Coons and Patrick Feaster, about loops and QOKEDAR and CHOLDAIIN cycles. Here's a good presentation: https://www.youtube.com/watch?v=SCWJzTX6y9M Zattera and Roe have also done good work on the "slot alphabet". That so many are making progression in the same direction is quite encouraging!

https://www.voynich.ninja/thread-4327-post-60796.html#pid607... is the main forum discussing precisely this. I quite liked this explanation of the apparent structure: https://www.voynich.ninja/thread-4286.html

> RU SSUK UKIA UK SSIAKRAINE IARAIN RA AINE RUK UKRU KRIA UKUSSIA IARUK RUSSUK RUSSAINE RUAINERU RUKIA

That is, there may be 2 "word types" with different statistical properties (as Feaster's video above describes)(perhaps e.g. 2 different Cyphers used "randomly" next to each other). Figuring out how to imitate the MS' statistical properties would let us determine cypher system and make steps towards determining its language etc. so most credible work's gone in this direction over the last 10+ years.

This site is a great introduction/deep dive: https://www.voynich.nu/

8 months agoby veqq

In short, the manuscript looks like a genuine text, not like a random bunch of characters pretending to be a text.

<quote>

Key Findings

* Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group

* Cluster 3 has high diversity and flexible positioning — likely a root content class

* Transition matrix shows strong internal structure, far from random

* Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)

Hypothesis

The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.

</quote>

8 months agoby nine_k

Wasn't it already deciphered, though?

https://www.researchgate.net/publication/368991190_The_Voyni...

8 months agoby ablanton