Model Description: Memory

It's a must have a data structure that allows store the concepts, features, attributes and skills extracted from the language analysis, as well as a container that list the words and syntactic information from the processed words.



Lexicon

A lexicon is essentially a catalogue of a given language words.


The system will save every semantic word. Semantic words are nouns (concepts and attributes), verbs (skills) and adjectives (features);

The rest of types of words has syntactic and grammatical meaningful as adverbs, modal verbs, conjunctions, interjections or punctuation.


This structure is a list of every lemma existent in the dictionary with the types of word that is.

As a word can be of various types, (e.g. animal is a noun or an adjective depending of his position inside a sentence) the structure allow store more than one type.

In case the word has noun type, the system will create a new frame and it will save his reference to facilitate his posterior search.




Dictionary

Valid set of words extracted from the sentences with the corresponding reference to the lexicon.


In the dictionary only will be stored words with semantic meaning (nouns, verbs and adjectives) and also adverbs for others purposes.

The system considers that a word doesn't exists if it doesn't appear in the dictionary.


The system handles only general concepts, but thanks to the dictionary, the system is capable to link every existent word with their concept.

Therefore when you want to refer a concept you can refer it with every word that mentions it.

E.g. if you would like to ask for a "table" you can ask mentioning "Table", "Tables" ... not only to the concept "table" (take in mind, people sometimes do not know the lemma of a word).


In fact the dictionary is not necessary for the purpose of this project, but help a lot of to interact with it; it's not the same ask for does cats four legs? that do cat 4 leg?.

This structure will be a Tree as searches are faster than in a list.


The nodes of the tree, contains the word, a lexicon reference to their correspondent lemma, and also stores the number of times this word has been referenced and his source;

with the aim of maintenance, stats, memory purge, or even technical optimizations (to enhance the searches, usually is a good technique set the most used words first).



Frames

Based on the idea of frame described by Marvin Minsky and developed by Roger Schank.


It has chosen the following a data structure for represent the general concepts, their characteristics and relations:


Concept lemma of common noun.

Every noun identified in each analyzed sentence will have frame.


As the important is the concept not the word which describes, every word that represents the same concept will be treat as the same.


Words: Cats, cat, CAT, cats → concept: cat (his lemma)


o Parents: inheritance relation.

It's a list of lexicon references of nouns with their correspondent tendency and source.

This implies, the characteristics of the parent will be taken as their own for the concept

With this list the system can categorize the concepts and also mark hierarchies

E.g. cats are animals → animal is parent of cat

Assuming animal is alive then the system can decide that also cat is alive

In this list are inserted the processed internal language with ISA code.


o Features: characteristics that describes the concept.

It's a list of lexicon references of adjectives with their correspondent tendency and source.

E.g. cats are nice → nice is a descriptive feature of cat

In this list are inserted the processed internal language with IS code.


o Attributes: properties or components which defines the concept.

It's a list of lexicon references of nouns with their correspondent tendency and source.

Attributes also has the characteristic of be enumerated, then there will be also a list of quantities associated to every attribute.

E.g.: cats has 4 legs → leg is a part of cat / "leg of cat" → 4

In this list are inserted the processed internal language with HAVE code.


o Skills: skills that concepts possess.

It's a list (of action verbs, no modal, no auxiliary) of the abilities that the concept can do.

E.g. cats runs→ cat can run

And even the relation with the concept that suffer the action (interactions)

E.g. the cat jump that fence and the stone → cat can jump fence*1&1/stone*1&1

In this list are inserted the processed internal language with CAN code.


o Affected Actions: actions that can be applied to the concept.

Actions that can affect the concept.

It's a list of action verbs, nor modals, nor auxiliary verbs.

E.g.: I jumped the cat → cat can be jump

In this list are inserted the processed internal language with CANBE code.

Note: due to the language relation processing in this list can't exists negative relations.



Notes:

     - Each item of every list in the frame, it's a lexicon reference with tendency and source


     - Although adverbs doesn't have semantic value, has the function of change or qualify the meaning of the word that references as adjectives or verbs.

       For future porpoises adverbs will be saved along the feature or skill which references.


Each "noun" type defined lexicon entry, it will have a frame associated.



Weight

To each frame it has been associated a "weight"; that is a number calculated from the relations the frame has. This number determines how important this frame inside the total memory is. It's a useful measure to determine the validity of their content or for disambiguate the sense of a concept.


The idea is to giving more importance to frames with a big number of relations and different sources, than ones with a fewer relations but big tendency.

Also given relevance to the dispersion of the relations through the characteristics.

E.g. Given the following frames:

     A = parents: parent1, parent2, parent3 / features: (none) / attributes: (none) / ...

     B = parents: parent1 / features: feature1 / attributes: attribute1 / ...

→ then better Frame B than frame A


The applied formula is:

                 Created using HostMath - A online LaTeX formula editor and math equation editor


When, for each characteristic list and the interactions:

  • rel = number of relations
  • dsrc = number of different sources in each of the relations of the characteristic list
  • tnd = sum of all the tendencies of each relation in the list (omitting the sign)

There is a final adjust of the formula to obtain a value 1 for frames with only one relation with tendency = 1 and a unique origin.


Examples:

Frame A (the most basic scenario, only 1 relation with tendency = 1 and unique source)

     pars: a/1/1 → 1 * 1 * (1+sqrt(log(1)) = 1 * 1 * (1 + 0) = 1

     feats: (none) → 0

     attrs: (none) → 0

     skills: (none) → 0

     inters: (none) → 0

     affs: (none) → 0

     ==> 1/6 + 0/6 + 0/6 + 0/6 + 0/6 + 0/6 = 0.166666666 → round(0.166666666 * 100 / 16) = 1


Frame B

     pars: a/1/1, b/-2/2, c/10/3      note: The sign of the trend is omitted

     feats: d/3/1

     attrs:(none)

     skills: e/4/4, f/2/2

     inters: {f*1*8,g*-1*8}, {}      note: interactions are counted as independent relations

     affs: (none)

     ==> (3*2*(1+1.055))/6 +      note: [origin 1 + origin 2 + origin 3=1+2 = 2 different origins]

             (1*1*(1+0.69))/6 +

             0/6 +

             (2 * 2 * (1 + 0.88))/6 +

             (2 * 1 * (1 + 0.54))/6 +      note: [origin 8 + origin 8 = 1 different origin] / sqrt(log(abs(1) + abs(-1))) = 0.54

             0/6

     ==> (2,055 + 0,28166 + 0 + 1,2533 + 0,51333 + 0) * 100 / 16 ==> round(25,64583333) = 26

This is interpreted as the second frame is 26 times "stronger" than the first one.



Sets

Basically it's a structure used by facilitate the searches of the concepts when a group question is made.


Every element will have a list of lexicon references referring to each characteristic: parent, feature, attribute, skill or affected action.

An element is added to the correspondent list when in a frame any characteristic sets his tendency greater than zero, and it's removed otherwise.

There is also the correspondent lists for the negative relations, which have the inverse behavior as describe above.

This kind of lists provide a considerable performance in searches when is necessary to explore the memory searching wich concepts does not have some characteristics, especially in object guessing.


Every entry in the lexicon has the correspondent entry in the sets (being the same position in both lists).

In fact, the sets structure could be joined to the lexicon structure (it would be a single structure), but it was decided to split them for the clarification of the conceptualization.



Information Retrieval

Check the content of the memory or retrieve knowledge is made by answering the questions (also indirectly by @order show term).


For the system answering questions is the same as check if the element("characteristic") is in the correspondent list ("key") of the "concept" (Frame or Set, depending of the type of the question "affirmative" or "group"). The answer will be directly related with the value of the tendency of that element.


For the affirmative questions (searches are applied in the frame list), returns:

For group questions (searches will be run throughout the sets)

Remind the system returns Misunderstand if there is any grammar problem when the question is formulated.


Through affirmative questions that provides a key (in which list: parent, features, . . .) of the concept (initial frame to search) and the characteristic (element to search in the list and return the answer depending of his tendency). Depending of the key, the characteristic will be searched in one or other type of list:

key characteristic type list to search
ISA noun parent
IS adjective feature
(verb BE) noun & adjective Ambiguous ISAIS
HAVE noun attribute
CAN verb skill
CANBE verb affected actions


Taking the following graphical schema of an example of memory content:

Every rectangle represents a frame

   the concept is highlighted in bold

   <> means a parent relation (show with the blue arrows)

   () features

   [] attributes (boxes in red represents the related frame)

   {} skills

   // affected actions

   NOT implies negative tendency in the association


Note: yellow is noun and adjective, and bear is a noun and also a verb.


Let's see some simple examples that will help to clarify how the system does the searches:


   - "is(key=be) dog(concept) black(characteristic)?" As the frame dog has black with positive tendency in its features (key is IS) list, then the answer is Yes.


   - The question "is mouse live?" will be Unknown as it does not exist live in the list of features of the mouse frame.


   - And finally "is tiger mammal?" will be answered as No.


   - Take in mind that when you asked using the verb "BE" (keys ISA and IS) for a word that is noun and adjective, then there is no way to determinate if is asking for a parent or for a feature

[in the example, in the frame lion, yellow is working as adjective(features list) and as noun(parent list)]


In this scenario, the system uses the ambiguous ISAIS strategy, that consist in retrieve the response of ask by parent and by feature and then deduce the answer:

   - With this memory scenario, if you ask the question "is lion yellow?" the answer will be Unknown.

[Yellow in the parent list it has negative tendency (No) + Yellow in the feature list it has positive tendency (Yes)]


   - Attribute Properties(boxes in red in the figure) are also frames and the systems deal with them in the same way as regular frames.

can legs of cats hit? Yes (due to hit has positive tendency in the skill list (key = can) of the frame cat%leg


   - In case of asking by an attribute using a number (Numbered Attributes), as for example "does pets have 4 legs?"

    it will return Yes if the characteristic leg exist as attribute (key =have) of the frame pet and the indicated number is in the attribute numbered list of that characteristic. Answers No otherwise.


Note: If you show the term "bore" note that is interesting how the model mixed the facts learned from "born (action/verb) " and "bear (animal/noun)". Check "test.cpp" in the source code.


Take also in mind the self identity scenario.



DEEP SEARCH


Searches also could be done in a deep mode (order @mode deepsearch), that means using the inheritance property, in other words assuming the characteristics of their parents (and also grandparents) as own. This means that if the element is not found in the frame, the element will be also searched in the frames referenced in his parent list.


Considerations:


   - Search first in depth through the parents of the initial node (concept)

Example: can cat born?

o born is not into cat.

o get their parents: pet and feline.

o get the checks the first that is pet; as not exists expand their parents: animal and multicellular.

o checks and expands the next that is animal, after will check multicellular, after feline, and finally the characteristic is found in their parent mammal.


   - Do not analyze nor expand nodes have already processed.

For example, from pet you can reach multicellular directly or through animal; multicellular would be check only once time.


   - There aren't limits in the depth.

A characteristic could be searched for the entire memory if the concepts are related by parent association.

For example the characteristics of multicellular can be inherited by lion that implies 4 degrees of kinship.


   - If the question is about a skill (key = can) it will also search that characteristic in their attributes.

For example, "can pet hit?" In their skill list hit does not exist, but it has leg as attribute, and this has hit in their skill list, therefore Yes (the system consider that "pets can hit").


Some examples of deep searches:


   by parent:

   by feature:

   by attribute:

   by skill:

   by affected actions:


Through group questions you can explore the same "knowledge space" but in a reverse way, moving around the sets.

As the lists of the sets are populated only with elements with positive tendency, the answer of a group questions is reduced to return the correspondent list.


If the deep search is active then also add the concepts referenced in their parent lists and so on; with not limit but not adding duplicated elements.


In case the question provides a concept (is optional) the elements of the results will be removed (filter) if they don't have the concept as parent.


For example (using the same data and knowledge as is shown above in the example figure):


   by parent:

   by feature:

   by attribute:

   by skill:

   by affected actions:



OBJECT GUESSING


After modify the question group graph to allow multiple conditions, the system allow made more complex searches.

Therefore the memory can be queried to discover which concepts fulfils some conditions.

This mechanism could be very useful in disambiguation tasks.


Exact search

@mode deepsearch on

@mode guessing OFF

what is feline or pet or bird and have legs? dog, cat

        Concepts has to fulfill every condition to be returned as results in the answer.


        when an or logical operator, append the results of both sets (removing the repeated ones)

        and operator, applies a set intersection operation (leaving only those elements which are present on both lists)



        [1 OR 2 → cat, tiger, lion, dog] [OR 3 → cat, tiger, lion, dog, eagle, pigeon] AND 4 → cat and dog are the unique concepts on the example dataset which fulfill every condition


Approximation search

@mode deepsearch on

@mode guessing ON

what animal is wild and have legs and is live or white and can hit and not eat and is a pet?

dog(87), cat(62), bear(50), pet(50), mammal(37), bird(37), eagle(37), pigeon(37), elephant(37), mouse(37), lion(37), feline(25)

          Every concept returned in the answer has associated a fulfilment percentage


conditions
dog
cat
bear
pet
mammal
bird
eagle
pigeon
elephant
mouse
lion
feline
1 - parent animal
y
y
y
y
y
y
y
y
y
y
y
y
2 - feature wild
y
y
y
y
y
y
y
y
y
y
y
y
3 - attribute leg
y
y
y
y
-
-
n
-
-
-
-
-
4 - feature live
y
n
-
-
y
y
n
y
y
y
y
n
5 - feature white
y
-
-
-
-
-
-
-
-
-
-
-
6 - skill hit
y
y
y
y
-
-
-
-
-
-
-
-
7 - negative skill eat
-
-
y
-
-
-
n
-
-
-
-
y
8 - parent pet
y
y
-
-
-
-
-
-
-
-
-
-
              percentage
87
62
50
37
37
37
37
37
37
37
37
25

          The percentage is (cf / nc) * 100 (removing decimals)

              - cf is the number of conditions fulfilled for the concept

              - nc in the total number of conditions the question has

          E.g. for "dog": 7/8 = 0.875; * 100 = 87.5; = 87% success rate, or the probability that this concept is the searched one


          In this case, there is no difference between logical operators

             - Which pet is black AND white? dog(100), cat(66)

             - Which pet is black OR white? dog(100), cat(66)


          The results can be managed using the guessing threshold and max results orders.


The results obtained through approximation search, when the value of the threshold is 100, are the same than the obtained using exact search.

But the exact search is a pretty fastest method, due to not always is necessary to apply a memory search for every condition.




NUMBERED ATTRIBUTES QUESTIONS


Using the kind of question described here, you can ask about the values (numbers) of attributes of any concept.


Using the example dataset schema showed above:


The alternative is check their frame content. Or asking for every number using the numbered affirmative questions.

As for example: does a cat have four legs? Yes; have cats got 5 legs? No; have...

The search is made by applying the following algorithm:


Let's see some examples using the following example data set:


scenario 1

- how many legsdoes the person have?2,4→ positive tendency relation with values in the numbered attribute
- how many limbdoes the person have?Any→ positive tendency relation, but no values declared
- how many armdoes the person have?1→ explicit value
- how many wingsdoes the person have?None→ negative tendency relation
- how many handsdoes the person have?None→ no relation


scenario 2

- how many clawsdoes the person have?10 → from the attribute "limb". The "claws of wings" are not taken in count due to the attribute "wing" has negative tendency,

neither the " claws of mammal " because the relation has been found in their attributes, so the parents are not explored

- how many fingersdoes the person have?1-5 → 1 from the attribute "arm" (when the search it has more than one value, empty numbered list takes the value of one),

2,3,4 from "claw of limb ", and 5 from the attribute "leg" (also 3, but duplicate elements are removed from the answer)

- how many toesdoes the person have?Any→ from the attribute "claw of limb of person"


scenario 3

- how many eyesdoes the person have?6 → from the parent "mammal" (as the relation has been found, the parents and sub-attributes of this concept are not expanded, so the concept "animal" is not check, and the values of "eye of animal" are not added)
- how many necksdoes the person have?1→ from the parent "mammal"
- how many furdoes the person have?Any→ explicit value




INTERACTIONS


Using the kind of question described here and here, you can ask about the relations between concepts.


The searches made to answer this kind of questions are the same described for the frame/set search (taking in mind the deep search and filters).


In a brief words:


Let's see some examples using the following example data set:


* deep search off
* all filters off
- can catsjump a fence?Yes→ positive interaction tendency
- can catsjump sky?No→ negative interaction tendency
- can catsjump the forest?No→ no interaction found
- can catsrun a forest?No→ no skill found
- can catsfly sky?No→ negative skill
- can catsjump dog?Yes→ by attribute inheritance
- can catseat animals?No→ no skill found
* deep search on
- can catseat animals?Yes→ by parent inheritance
- can catsfly sky?No→ by parent inheritance yes, but this skill relation is explicitly negative in the frame of cat
- can catsblow sky?Yes→ by attribute inheritance of the parent inheritance


- what does cat eat?animal → by mammal
- what does cat run?None → no frame has the relation
- what can jump wall?mammal, cat → cat is for the parent inheritance
* tendency filter = 2
- what does cat jump?sky, wall → by mammal; the jump relation has 1 as tendency so it's filtered
* tendency filter = 3
- what can jump wall?mammal → the parent relation has 2 as tendency so it's filtered
* tendency filter off
* multiple source filter on
- what does cat jump?None → all the relations has been created with only 1 source; so any active source filter will purge any relation




FILTERS


As the system is feed through English sentences, it may occur that processing a lot of them, some could be misinterpreted or not having enough different mentions to be considered valid.

E.g.: "the blue cat" it's mentioned 3 times in only one text, but "the black cat" thousand of times and "the white cat" by 6 different sources.

        So if you ask about what color is a cat, the system will answer blue, black, and white.


For this reason, it has been created a mechanism to indicate some search criteria to discard those weak relations without remove them from the memory.

Then all the searches can be tuned for discarding those relations that are not quite strong to be considered true.


Let's see some examples using the following example data set: