TokenProperties
is used to encapsulate information about the characters occuring in a token (for example, upper and lower).
At the centre it is a bitset, but with inline member functions for convenient access. This has to be filled by each compliant tokenizer and stored with each token. Example:
Public Member Functions | |
Constructors | |
TokenProperties (void) | |
Constructs an object, initializing all bit values to zero. | |
TokenProperties (const icu::UnicodeString &ustrInputString) | |
Constructs an object from a UString, computing the bit values for the string. | |
TokenProperties (const UnicodeStringRef &ulstrInputString) | |
Constructs an object from a UString, computing the bit values for the string. | |
TokenProperties (const UChar *cpucCurrent, const UChar *cpucEnd) | |
Constructs an object from a two pointers, computing the bit values for the string. | |
TokenProperties (WORD32 w32Val) | |
initializes bits to value of w32Val | |
Properties | |
bool | hasLeadingUpper (void) const |
true if the first char in the token is upper case | |
void | setLeadingUpper (bool bSetOn=true) |
sets the hasLeadingUpper() property to bSetOn | |
bool | hasTrailingUpper (void) const |
true if some char after the first char in the token is upper case | |
void | setTrailingUpper (bool bSetOn=true) |
sets the hasTrailingUpper() property to bSetOn | |
bool | hasUpper (void) const |
true if the token has upper case chars (leading or trailing) | |
bool | hasLower (void) const |
true if the token has lower case chars | |
void | setLower (bool bSetOn=true) |
sets the hasLower() property to bSetOn | |
bool | hasNumeric (void) const |
true if the token has numeric chars | |
void | setNumeric (bool bSetOn=true) |
sets the hasNumeric() property to bSetOn | |
bool | hasSpecial (void) const |
true if the token has special chars (e.g. hyphen, period etc.) | |
void | setSpecial (bool bSetOn=true) |
sets the hasSpecial() property to bSetOn | |
Miscellaneous | |
bool | isPlainWord () const |
true if not hasSpecial() and not hasNumeric() | |
bool | isAllUppercaseWord (void) const |
true if only hasUpper() | |
bool | isAllLowercaseWord (void) const |
true if only hasLower() | |
bool | isInitialUppercaseWord (void) const |
true if only hasLeadingUpper() and hasTrailingUpper() | |
bool | isPlainNumber () const |
true if hasNumeric() && !(hasLower() || hasUpper()) Note: this might have decimal point and sign | |
bool | isPureNumber () const |
unlike isPlainNumber() this only allows for digits (no sign and point) | |
bool | isPureSpecial () const |
true if hasSpecail() && !(hasLower() || hasUpper() || hasNumeric()) Note: this might have decimal point and sign | |
void | reset (void) |
Resets all bits in *this, and returns *this. | |
void | initFromString (const UChar *cpucCurrent, const UChar *cpucEnd) |
Resets all bits and reinitializes from the string. | |
std::string | to_string (void) const |
Returns an object of type string, N characters long. | |
unsigned long | to_ulong (void) const |
Returns the integral value corresponding to the bits in *this. |
uima::TokenProperties::TokenProperties | ( | void | ) | [inline] |
Constructs an object, initializing all bit values to zero.
uima::TokenProperties::TokenProperties | ( | const icu::UnicodeString & | ustrInputString | ) |
Constructs an object from a UString, computing the bit values for the string.
uima::TokenProperties::TokenProperties | ( | const UnicodeStringRef & | ulstrInputString | ) |
Constructs an object from a UString, computing the bit values for the string.
uima::TokenProperties::TokenProperties | ( | const UChar * | cpucCurrent, | |
const UChar * | cpucEnd | |||
) |
Constructs an object from a two pointers, computing the bit values for the string.
Note: cpucEnd points beyond the end of the string
uima::TokenProperties::TokenProperties | ( | WORD32 | w32Val | ) | [inline] |
initializes bits to value of w32Val
bool uima::TokenProperties::hasLeadingUpper | ( | void | ) | const [inline] |
void uima::TokenProperties::setLeadingUpper | ( | bool | bSetOn = true |
) | [inline] |
bool uima::TokenProperties::hasTrailingUpper | ( | void | ) | const [inline] |
true if some char after the first char in the token is upper case
References UIMA_TOKEN_PROP_TRAILING_UPPER.
void uima::TokenProperties::setTrailingUpper | ( | bool | bSetOn = true |
) | [inline] |
bool uima::TokenProperties::hasUpper | ( | void | ) | const [inline] |
true if the token has upper case chars (leading or trailing)
References UIMA_TOKEN_PROP_LEADING_UPPER, and UIMA_TOKEN_PROP_TRAILING_UPPER.
bool uima::TokenProperties::hasLower | ( | void | ) | const [inline] |
void uima::TokenProperties::setLower | ( | bool | bSetOn = true |
) | [inline] |
bool uima::TokenProperties::hasNumeric | ( | void | ) | const [inline] |
void uima::TokenProperties::setNumeric | ( | bool | bSetOn = true |
) | [inline] |
bool uima::TokenProperties::hasSpecial | ( | void | ) | const [inline] |
void uima::TokenProperties::setSpecial | ( | bool | bSetOn = true |
) | [inline] |
bool uima::TokenProperties::isPlainWord | ( | void | ) | const [inline] |
true if not hasSpecial() and not hasNumeric()
References UIMA_TOKEN_PROP_LEADING_UPPER, UIMA_TOKEN_PROP_LOWER, and UIMA_TOKEN_PROP_TRAILING_UPPER.
bool uima::TokenProperties::isAllUppercaseWord | ( | void | ) | const [inline] |
true if only hasUpper()
References UIMA_TOKEN_PROP_LEADING_UPPER, and UIMA_TOKEN_PROP_TRAILING_UPPER.
bool uima::TokenProperties::isAllLowercaseWord | ( | void | ) | const [inline] |
bool uima::TokenProperties::isInitialUppercaseWord | ( | void | ) | const [inline] |
true if only hasLeadingUpper() and hasTrailingUpper()
References UIMA_TOKEN_PROP_LEADING_UPPER, and UIMA_TOKEN_PROP_LOWER.
bool uima::TokenProperties::isPlainNumber | ( | ) | const [inline] |
true if hasNumeric() && !(hasLower() || hasUpper()) Note: this might have decimal point and sign
References UIMA_TOKEN_PROP_NUMERIC, and UIMA_TOKEN_PROP_SPECIAL.
bool uima::TokenProperties::isPureNumber | ( | ) | const [inline] |
unlike isPlainNumber() this only allows for digits (no sign and point)
References UIMA_TOKEN_PROP_NUMERIC.
bool uima::TokenProperties::isPureSpecial | ( | ) | const [inline] |
true if hasSpecail() && !(hasLower() || hasUpper() || hasNumeric()) Note: this might have decimal point and sign
References UIMA_TOKEN_PROP_SPECIAL.
void uima::TokenProperties::reset | ( | void | ) | [inline] |
Resets all bits in *this, and returns *this.
void uima::TokenProperties::initFromString | ( | const UChar * | cpucCurrent, | |
const UChar * | cpucEnd | |||
) |
Resets all bits and reinitializes from the string.
std::string uima::TokenProperties::to_string | ( | void | ) | const |
Returns an object of type string, N characters long.
Each position in the new string is initialized with a character ('0' for zero and '1' for one), representing the value stored in the corresponding bit position of this. Character position N - 1 corresponds to bit position 0. Subsequent decreasing character positions correspond to increasing bit positions.
unsigned long uima::TokenProperties::to_ulong | ( | void | ) | const [inline] |
Returns the integral value corresponding to the bits in *this.