NlpSaftToken

GoogleApi.ContentWarehouse.V1.Model.NlpSaftToken


Table of Contents ▼

Jump to a specific part of the page:

Description

A document token marks a span of bytes in the document text as a token or word. Next available index: 16.

Attributes List

This module has the following attributes (case-insensitive ascending order):

View Attributes

Attributes

  1. breakLevel (type: String.t, default: nil)
    -
  2. breakSkippedText (type: boolean(), default: nil)
    - Whether the break skipped over non-tag text (excluding script/style).
  3. category (type: String.t, default: nil)
    - Coarse-grained word category for token. See README.categories for category inventory.
  4. end (type: integer(), default: nil)
    -
  5. head (type: integer(), default: nil)
    - Head of this token in the dependency tree: the id of the token which has an arc going to this one. If it is the root token of a sentence, then it is set to -1.
  6. info (type: GoogleApi.ContentWarehouse.V1.Model.Proto2BridgeMessageSet, default: nil)
    - Annotation for this token.
  7. label (type: String.t, default: nil)
    - Label for dependency relation between this token and its head. See README.labels for label inventory.
  8. lemma (type: String.t, default: nil)
    - Word lemma. This is only filled if the lemma is different from the word form.
  9. morph (type: GoogleApi.ContentWarehouse.V1.Model.NlpSaftMorphology, default: nil)
    - Morphology information.
  10. scriptCode (type: String.t, default: nil)
    - A string representation (typically four letters, sometimes longer) of the token's Unicode script code, based on BCP 47/CLDR, capitalized according to ISO 15924. See i18n/identifiers/scriptcode.h for details.
  11. start (type: integer(), default: nil)
    - [start, end] describe the inclusive byte range of the UTF-8 encoded token in document.text. End gives the index of the last byte, which may be a UTF-8 continuation byte, and the length in bytes is end - start + 1. begin/end options are for goldmine AnnotationsFinder to locate the offsets of saft tokens. Start is inclusive by default and end is marked.
  12. tag (type: String.t, default: nil)
    - Part-of-speech tag for token. See README.tags for tag inventory.
  13. tagConfidence (type: number(), default: nil)
    - Confidence score for the tag prediction -- should be interpreted as a probability estimate that the tag is correct.
  14. textProperties (type: integer(), default: nil)
    -
  15. word (type: String.t, default: nil)
    - Token word form. This may not be identical to the original. For example, in goldmine annotation we do UTF-8 normalization and punctuation normalization. The punctuation normalization includes inferring the directionality of straight doublequotes -- that is, we map " to open quote (``) or close quote (''), and sometimes we get it wrong. SAFT processing in other contexts (such as queries in qrewrite) involves different normalizations.

Type

@type t() :: %GoogleApi.ContentWarehouse.V1.Model.NlpSaftToken{
breakLevel: String.t() | nil,
breakSkippedText: boolean() | nil,
category: String.t() | nil,
end: integer() | nil,
head: integer() | nil,
info: GoogleApi.ContentWarehouse.V1.Model.Proto2BridgeMessageSet.t() | nil,
label: String.t() | nil,
lemma: String.t() | nil,
morph: GoogleApi.ContentWarehouse.V1.Model.NlpSaftMorphology.t() | nil,
scriptCode: String.t() | nil,
start: integer() | nil,
tag: String.t() | nil,
tagConfidence: number() | nil,
textProperties: integer() | nil,
word: String.t() | nil
}

Function

@spec decode(struct(), keyword()) :: struct()

Data sourced from HexDocs : GoogleApi.ContentWarehouse.V1.Model.NlpSaftToken