GoogleApi.ContentWarehouse.V1.Model.NlpSaftToken
Table of Contents ▼
Jump to a specific part of the page:
Description
A document token marks a span of bytes in the document text as a token or word. Next available index: 16.
Attributes List
This module has the following attributes (case-insensitive ascending order):
Attributes
-
breakLevel
(type:String.t
, default:nil
)
- -
breakSkippedText
(type:boolean()
, default:nil
)
- Whether the break skipped over non-tag text (excluding script/style). -
category
(type:String.t
, default:nil
)
- Coarse-grained word category for token. See README.categories for category inventory. -
end
(type:integer()
, default:nil
)
- -
head
(type:integer()
, default:nil
)
- Head of this token in the dependency tree: the id of the token which has an arc going to this one. If it is the root token of a sentence, then it is set to -1. -
info
(type:GoogleApi.ContentWarehouse.V1.Model.Proto2BridgeMessageSet
, default:nil
)
- Annotation for this token. -
label
(type:String.t
, default:nil
)
- Label for dependency relation between this token and its head. See README.labels for label inventory. -
lemma
(type:String.t
, default:nil
)
- Word lemma. This is only filled if the lemma is different from the word form. -
morph
(type:GoogleApi.ContentWarehouse.V1.Model.NlpSaftMorphology
, default:nil
)
- Morphology information. -
scriptCode
(type:String.t
, default:nil
)
- A string representation (typically four letters, sometimes longer) of the token's Unicode script code, based on BCP 47/CLDR, capitalized according to ISO 15924. See i18n/identifiers/scriptcode.h for details. -
start
(type:integer()
, default:nil
)
- [start, end] describe the inclusive byte range of the UTF-8 encoded token in document.text. End gives the index of the last byte, which may be a UTF-8 continuation byte, and the length in bytes is end - start + 1. begin/end options are for goldmine AnnotationsFinder to locate the offsets of saft tokens. Start is inclusive by default and end is marked. -
tag
(type:String.t
, default:nil
)
- Part-of-speech tag for token. See README.tags for tag inventory. -
tagConfidence
(type:number()
, default:nil
)
- Confidence score for the tag prediction -- should be interpreted as a probability estimate that the tag is correct. -
textProperties
(type:integer()
, default:nil
)
- -
word
(type:String.t
, default:nil
)
- Token word form. This may not be identical to the original. For example, in goldmine annotation we do UTF-8 normalization and punctuation normalization. The punctuation normalization includes inferring the directionality of straight doublequotes -- that is, we map " to open quote (``) or close quote (''), and sometimes we get it wrong. SAFT processing in other contexts (such as queries in qrewrite) involves different normalizations.
Type
@type t() :: %GoogleApi.ContentWarehouse.V1.Model.NlpSaftToken{
breakLevel: String.t() | nil,
breakSkippedText: boolean() | nil,
category: String.t() | nil,
end: integer() | nil,
head: integer() | nil,
info: GoogleApi.ContentWarehouse.V1.Model.Proto2BridgeMessageSet.t() | nil,
label: String.t() | nil,
lemma: String.t() | nil,
morph: GoogleApi.ContentWarehouse.V1.Model.NlpSaftMorphology.t() | nil,
scriptCode: String.t() | nil,
start: integer() | nil,
tag: String.t() | nil,
tagConfidence: number() | nil,
textProperties: integer() | nil,
word: String.t() | nil
}
breakLevel: String.t() | nil,
breakSkippedText: boolean() | nil,
category: String.t() | nil,
end: integer() | nil,
head: integer() | nil,
info: GoogleApi.ContentWarehouse.V1.Model.Proto2BridgeMessageSet.t() | nil,
label: String.t() | nil,
lemma: String.t() | nil,
morph: GoogleApi.ContentWarehouse.V1.Model.NlpSaftMorphology.t() | nil,
scriptCode: String.t() | nil,
start: integer() | nil,
tag: String.t() | nil,
tagConfidence: number() | nil,
textProperties: integer() | nil,
word: String.t() | nil
}
Function
@spec decode(struct(), keyword()) :: struct()Data sourced from HexDocs : GoogleApi.ContentWarehouse.V1.Model.NlpSaftToken