Project 44: Cracking the Voynich Code

Similar documents
Prof. Derek Abbott, Yaxin Hu

2016 Masters project 141 Cracking the Voynich manuscript code

141: Cracking the Voynich manuscript code (The first draft) Ruihang Feng

Cracking the Voynich manuscript code

Garel Page 1 of 8 Translator s Note: The Voynich Manuscript through an Intersemiotic Approach A translation by Angelica Garel The Voynich Manuscript i

Cracking the Voynich manuscript code

Natural Language Processing. Project Proposal: Voynich Manuscript. By: Scott Daniels 4/14/04

Real Estate Appraisal / Finance 322 Spring, 2011

Research report Tenancy sustainment in Scotland

Trip Rate and Parking Databases in New Zealand and Australia

Yellow highlighting emphases added by A.L. Appraisal Co.

Procedures Used to Calculate Property Taxes for Agricultural Land in Mississippi

CENTRAL GOVERNMENT ACCOUNTING STANDARDS

Attachment 2 Civil Engineering

Oil & Gas Lease Auctions: An Economic Perspective

Examining Local Authority Housing Waiting Lists. A Submission to the Joint Oireachtas Committee on Housing, Planning and Local Government.

Quality management system. of supplies and services

MAAO Sales Ratio Committee 2013 Fall Conference Seminar

The U.S. Bureau of Information File. On Wilfrid Voynich

SUBDIVISION FEASIBILITY REPORT

LISTING -VS- FOR SALE BY OWNER

Definitions For the purposes of this procedure, the following definitions apply to the following words or phrases:

MARKET VALUE BASIS OF VALUATION

Frog Street Pre-K Curriculum Support for Teaching Strategies GOLD Assessment

Mr Hans Hoogervorst Chairman of the International Accounting Standards Board 30 Cannon Street London EC4M 6XH United Kingdom

Knowledge based Condition Assessments

Housing Price Prediction Using Search Engine Query Data. Qian Dong Research Institute of Statistical Sciences of NBS Oct. 29, 2014

Applying IFRS. A closer look at the new leases standard. August 2016

History and Theory of Architecture

Review of the Prices of Rents and Owner-occupied Houses in Japan

Session outline IAS 11 IAS 18, 5 28, 39 IAS 18 IAS 18 IAS 18, 39 SIC 31 IAS 18. Multiple elements. Construction contracts

Chapter 35. The Appraiser's Sales Comparison Approach INTRODUCTION

If you have any questions about this guide, the dataset or our wider work on co-operative intelligence, please contact

Crestview Realty. John Doe, Salesperson REALTOR Franchise or office logo

Exposure Draft 64 January 2018 Comments due: June 30, Proposed International Public Sector Accounting Standard. Leases

Crediting Conservation: Frequently Asked Questions

DAYLIGHT SIMULATION FOR CODE COMPLIANCE: CREATING A DECISION TOOL. Krystle Stewart 1 and Michael Donn 1

RMS USER GUIDE. Version 17. Steps for the Doctor being Appraised Page 5 Steps for the Appraiser Page 31 Help Section Page 36

HOWEY TEST FOR STORIQA DIGITAL TOKEN

SEC HOWEY TEST FOR ENJIN COIN CROWDSALE Refer to: full legal analysis

IREDELL COUNTY 2015 APPRAISAL MANUAL

Attachment 10 Structural Engineering

Appraisers and Assessors of Real Estate

The Texas 2005 Profile of Home Buyers and Sellers. Prepared by: NATIONAL ASSOCIATION OF REALTORS Research Division

PROPERTY TAX IS A PRINCIPAL REVENUE SOURCE

By David A. Melvin, PLS, CFM

Real Estate Appraisal Professional Standards

Valuing Land in Dispute Resolution: Using Coefficient of Variation to Determine Unit of Measurement

DEFINING SERVICE EXCELLENCE IN REAL ESTATE

AFFIRMATIVELY FURTHERING FAIR HOUSING

For example, if all tokens are distributed for free, or are only produced through mining, then there is no sale for value.

The South Australian Housing Trust Triennial Review to

Purpose of this Study

KPMG s CFO. Webcast. Administrative

Welcome.

UNDERSTANDING HOW USPAP APPLIES TO REAL PROPERTY APPRAISAL PRACTICE USPAP Matrix

Evacuation Design Focused on Quality of Flow

1. *Does the document clearly specify the aims, objectives and scope of the proposed programme of archaeological work?

concepts and techniques

Mass Appraisal of Income-Producing Properties

Cube Land integration between land use and transportation

WYOMING DEPARTMENT OF REVENUE CHAPTER 7 PROPERTY TAX VALUATION METHODOLOGY AND ASSESSMENT (DEPARTMENT ASSESSMENTS)

Hennepin County Economic Analysis Executive Summary

OHIO DEPARTMENT OF TRANSPORTATION OFFICE OF REAL ESTATE. James J. Viau, Manager, Relocation Section. Changes and Updates to the Real Estate Manual

Executive Summary of the Direct Investigation Report on Monitoring of Property Services Agents

IFRS 16: Leases; a New Era of Lease Accounting!

Project Appraisal Guidelines for National Roads Unit Introduction

Mass appraisal Educational offerings and Designation Requirements. designations provide a portable measurement of your capabilities

Camp Central Appraisal District LEGAL AND ADMINISTRATIVE REQUIREMENTS

ROADMAP to ENGINEERING DESIGN

AASB 16: Experience the Fundamental Overhaul of Lease Accounting for Lessees

2007 Profile of Home Buyers and Sellers Pennsylvania Report

How Do We Live Skender Kosumi

RTPI South West Region Research into the delivery and affordability of housing. Invitation to Tender

Correcting Coverage Deficiencies in Address-Based Frames: The Use of Enhanced Listing

Identifying Troubled NYCHA Developments in Brooklyn. Cost Considerations for Rehabilitating Troubled NYCHA Brooklyn Developments.

The cost of increasing social and affordable housing supply in New South Wales

Copernicus Land Monitoring Service (Pan- European and Local) in the Netherlands

The Improved Net Rate Analysis

Project Finance Ratios Tutorial February 2017

MARCH GUIDE TO BUILDING CONDITION ASSESSMENTS and RESERVE FUND STUDIES

Response to the IASB Exposure Draft Leases

Intangible Assets Web Site Costs

2016 Resource - FG Fixed Assets 3/29/2016

REPEATABILITY & REPRODUCIBILITY (R&R) STUDY

Protection for Residents of Long Term Supported Group Accommodation in NSW

THE APPLICATION OF GIS AND LIS Solutions and Experiences in East Africa. Lenny Kivuti

SLAS 19 (Revised 2000) Sri Lanka Accounting Standard SLAS 19 (Revised 2000) LEASES

DIRECTIVE # This Directive Supersedes Directive # and #92-003

CHAPTER 18 Lease Financing and Business Valuation

PART ONE - GENERAL INFORMATION

A CADASTRAL GEODATA BASE FOR LAND ADMINISTRATION USING ARCGIS CADASTRAL FABRIC MODEL A CASE STUDY OF UWANI ENUGU, ENUGU STATE, NIGERIA

Arlington County, Virginia. Internal Audit of the Real Estate Assessment Appeals Process Calendar Year Ended December 31, 2014

Rough Proportionality and the City of Austin. Prepared for the Austin Bar Association 2016 Land Development Seminar (9/30/16)

The Landlord and Tenant Act 1954 governs the rights and obligations of landlords and tenants of

Support to Implementation of Multipurpose Cadastral Information system in Vietnam

Fixed Asset Policy and Procedure Manual

Minneapolis St. Paul Residential Real Estate Index

Michael Rotondi Billard Leece Partnership Pty Ltd HKS

The Analytic Hierarchy Process. M. En C. Eduardo Bustos Farías

Transcription:

Students: Peter Roush and Bryce Shi Supervisors: Derek Abbott, Maryam Ebrahimpour and Brian Ng Project 44: Cracking the Voynich Code Final Seminar

Outline The Voynich Manuscript Objectives, Background and Motivation Analysing the Manuscript Techniques, Research, and Testing The Information Learnt Results and Analysis Project Management Team Roles, Milestones, and Budgeting Conclusions Slide 2 of 44

Background, Motivation, and Objectives The Voynich Manuscript

A Brief History Voynich Manuscript Found in an Italian Castle by Wilfred Voynich, a book collector Pages and some references dated to the 15 th Century Author or authors unknown Language unknown Pictures have been inconclusively matched to plants in Europe and South America Electronic Transcriptions At least two different languages or dialects Hard to separate letters into a fixed alphabet Interlinear Transcription File Slide 4 of 44

Current Theories Early Language or Writing System Early Welsh (Tim Ackerson) Romanised Manchu Chinese (Zbigniew Banasik) Code Fake cipher related to Arabic numerals (D Imperio) Cipher by Roger Bacon (William Newbold) Cipher by Antonio Averlino (Nick Pelling) Certain pages are key to unlocking the mystery (Mark Sullivan) Hoax Written to scam money out of Rudolf II (Raphael Mnishovsky) Written by Voynich for money and fame Slide 5 of 44

Voynich Manuscript Part 1 (Herbal) 129 pages Part 2 (Astronomical) 12 pages Part 3 (Biological) 20 pages Part 4 (Cosmological) 20 pages Part 5 (Pharmaceutical) 18 pages Part 6 (Recipes) 25 pages Detailed chemical analysis can be found at Yale: http://beinecke.library.yale.edu/sites/default/files/voynich_analysis.pdf Pictures reproduced from Beinecke Library under the free public domain licence Slide 6 of 44

Characters (EVA Alphabet) Humanist miniscule writing (left) Picture from http://www.afternight.com/runes/a-voynich.gif and dictionarytoday.tumblr.com Slide 7 of 44

Objectives Develop Data Mining Techniques for the unknown language/code in the Voynich Manuscript. Compare linguistic features of the Voynich Manuscript and other languages. Determine whether the language in the Voynich manuscript is real, a code, or a hoax. Develop a code base and documentation to aid future projects. Slide 8 of 44

Research, Methods and Tools Analysing The Manuscript

Electronic Transcriptions Slide 10 of 44

Testing Methodology Used Takahashi Transcription and EVA alphabet for all tests Handwritten text files for basic verification 10 Comparison Texts of similar length in selected languages English (3 Texts) Latin Italian Hungarian Hebrew (Without vowel accents) Chinese (Simplified Characters) Chinese (Pinyin) Slide 11 of 44

The UN Declaration of Human Rights 382 translated languages Allows greater selection of comparison languages. Translations contain an average of 1800 word tokens. Picture Reproduced From: www.boes.org (Public Domain) Slide 12 of 44

Collocations A collocation is a word combination that occurs more often than would be expected by chance: Strong Tea Friendly Footing Saucer of Milk Scotland Yard Collocations indicate names and expressions in a language, and don t translate well into other languages. Slide 13 of 44

TF-IDF TF: Term Frequency Proportional to the number of times a word is used in a document or section IDF: Inverse Document Frequency Inversely Proportional to the number of documents or sections in which a word appears TF-IDF scores provide a way to find words relevant to a section, while ignoring words that are common across all sections. Slide 14 of 44

Word Recurrence Interval (WRI) WRI is defined as the number of words in between successive occurrences of a keyword Keyword being: I 1 2 3 4 5 6 7 8 9 10 11 I have six locks on my door all in a row. When I go out, I lock every other one. I figure no matter how long somebody stands there picking the locks, they are always locking three. Word Recurrence interval is: {0, 11, 2, 4} Slide 15 of 44

Support Vector Machine (SVM) Unknown Training Data SVM is a binary classifier Defines a decision point from a set of training data which is split into two distinct classes Assigns new testing data into one of those classes based on the decision point. Can be used for authorship detection Picture Modified From: Martin Law, 3/1/11, http://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf Reference: Ebrahimpour M, Putniņš TJ, Berryman MJ, Allison A, Ng BW-H, et al. (2013) Automated Authorship Attribution Using Advanced Signal Classification Techniques. PLoS ONE 8(2): e54998. doi:10.1371/journal.pone.0054998 Slide 16 of 44

Language Investigations (Herbal Book) Language and grammar was lax at times Repeated letters skipped Words abbreviated with symbols Position dependent letters Two different interchangeable versions of letter s Different authors, different substitutions Separate authors would substitute words with own symbols Penmanship questionable Words sometimes written as one word sometimes split apart Words continued on different lines Occasionally would have an indicator to show word had been split Slide 17 of 44

Results and Analysis Information Learnt

Section Currier Language Pages Tokens Words Words per Page Full Alphabet Length Cosmological Unknown 20 3008 1521 150 27 24 Biological B 20 6917 1549 346 21 18 Herbal A A 97 7956 2492 82 32 21 Herbal B B 32 3442 1349 108 23 20 Recipes B 25 11417 3328 457 29 19 Pharma A 18 2573 1139 143 21 19 Zodiac Unknown 12 1331 808 111 20 19 Unclassified Unknown 12 1276 708 106 28 24 Missing 20 0 0 0 0 0 Common Alphabet Length Full Manuscript 256 37945 8105 161 47 21 Slide 19 of 44

Common Letter Combinations Slide 20 of 44

Word and Illustration Relationships Slide 21 of 44

Words and Illustration Relationships Astrological Biological Cosmologic al Pharma Recipes Herbal osar qol v daiin qokeedy Daiin oteody qolkeedy ytaiin okeol qokaiin chor oteotey qokedy k ctheol lchedy cthor eody qokain {&169} olchor lkaiin ctho okalar shedy {&171} qoor lkain qotchor okeodaly lchedy x shockhey qokain qotchy Slide 22 of 44

Word Lengths and Frequency Slide 23 of 44

UDHR and Word Lengths Text Tolerance Match UDHR Match Peak Length Voynich 10% 45.45% Arabic, Standard 2 Voynich 15% 54.54% Arabic, Standard 2 Voynich 25% 63.63% Malay (Arabic) 4 Voynich 40% 72.72% Hebrew, Malay (Arabic), Guarayu, Arabic (Standard) Voynich 50% 81.81% Arabic (Standard), Hausa (Niger), Hausa (Nigeria) Voynich: 1 2 3 4 5 6 7 8 9 10 11 4.13% 8.52% 9.45% 17.01% 23.95% 18.84% 11.12% 4.49% 1.68% 0.52% 0.14% 4, 4 5 2 2 2 2 Slide 24 of 44

WRI and Rank Plot Slide 25 of 44

UDHR and WRI Name Tolerance Match UDHR Match Comments Voynich Herbal A 10% 17% Bosnian (Latin) f15r - f22v Voynich Herbal A 10% 12% Jola-Fonyi f3r - f10v Voynich Biology B 10% 3% Hmong (Southern Qiandong), Aceh Voynich Recipe B 10% 22% Bosnian (Latin), Mapudungun Herbal Book 10% 8% Hmong, Southern Qiandong Comparison text of ~1500 words Average UDHR text length is ~1800 words Top 100 data points f83r - f85r1 f113r - f114r 16 th Century Slide 26 of 44

Word Frequency and Zipf s Law Slide 27 of 44

Word Entropy Slide 28 of 44

Collocations Slide 29 of 44

Word Structure Slide 30 of 44

Punctuation Slide 31 of 44

Punctuation Slide 32 of 44

Support Vector Machine (SVM) Language Comparisons Group Voynich Takahashi Normalised Frequency Hebrew Voynich Takahashi σ WRI Russian Normal Languages Compared Chinese English Sherlock Holmes Hebrew Hungarian Italian Latin PinYin Russian Language Comparisons Group Voynich Takahashi Herbal A Frequency Zodiac Voynich Takahashi Herbal A WRI Pharmaceutical Voynich Languages Compared Biological Cosmological Herbal A Herbal B Pharmaceutical Recipes Unknown Zodiacs Slide 33 of 44

Multiple Discriminant Analysis (MDA) MDA Frequency A B Slide 34 of 44

Multiple Discriminant Analysis (MDA) MDA WRI B A Slide 35 of 44

Risk Management, Budgeting, Timeframes and Approach Project Management

Risk Management and Budget No. Risk Likelihood Consequence Risk Level 1 Not understanding the project correctly and the processes required 2 Inaccurate allocation of time and resources to a particular area 3 Health issues due to long periods of time sitting and working at a PC Almost Certain Moderate Very High Likely Major Very High Likely Moderate High 4 Files and working copies lost Rare Major Medium 5 UofA Electrical Engineering server down for unknown reasons 6 Not being able to solve the Voynich Manuscript code Unlikely Moderate Medium Almost Certain Negligible Medium $396.46 (Spent on 3 books, printing and lamination) Slide 37 of 44

Final Approach Phase 1: Characterise the text Phase 2: Associate Pictures with word frequency Phase 3: WRI vs Rank Plots Phase 4: Other ideas Phase 5: SVM and MDA Authorship Techniques Slide 38 of 44

Team Roles Peter Python Code Phase 2 Phase 4 Compilation of testing material Research as necessary Bryce MATLAB Code Phase 3 Phase 5 Analysis of known 15 th Century Text Research as necessary Slide 39 of 44

Project Progress Slide 40 of 44

Interpretation of Results Conclusion

Conclusions The writing and language in the Voynich appears to have evolved over time, making analysis difficult. There is a relationship between language and section, but this may not have anything to do with illustrations Based on characteristics such as word length distribution and WRI, appears similar to languages such as Hebrew and Latin May contain punctuation, based on line characteristics. Weak word order, indicating lack of phrases and proper nouns, or perhaps indicating the characteristics of a code Slide 42 of 44

Future Pathways Expand research into word/illustration relationship Test the effect of modified alphabets Expand research into authorship if possible Hidden Markov Model classification of text Develop a rule-based grammar for the the Manuscript if possible Test characteristics against transcriptions of known 15 th century codes Slide 43 of 44

Questions? Reproduced under the Creative Commons Attribution-Non Commercial 2.5 licence from xkcd Slide 44 of 44