Archives.org Latin Toolkit Documentation¶
Classes¶
-
class
archives_org_latin_toolkit.
Metadata
(csv_file)[source]¶ Bases:
object
Metadata object for a file
Parameters: csv_file (str) – Path to the CSV file to parse
-
class
archives_org_latin_toolkit.
Text
(file, metadata=None, lowercase=False)[source]¶ Bases:
object
Text reading object for archive_org
Parameters: Variables: -
clean
¶ Clean version of the text : normalized space, remove new line, dehyphenize, remove punctuation and number.
-
composed
¶
-
find_embedding
(*strings, window=50, ignore_center=False, memory_efficient=True)[source]¶ Check if given string is in the file
Parameters: - strings – Strings as multiple arguments
- window – Number of lines to retrieve
- ignore_center – Remove the word found from the embedding
-
has_strings
(*strings)[source]¶ Check if given string is in the file
Parameters: strings – Strings as multiple arguments Returns: If found, return True Return type: bool
-
name
¶
-
random_embedding
(grab, window=50, avoid=None, memory_efficient=True, _taken=None, _generator=True)[source]¶ Search for random sentences in the text. Can avoid certain words
Parameters: Returns: Generator with random texts
Note
Right now, new window found are not added to _taken, which is problematic
-
raw
¶
-
-
class
archives_org_latin_toolkit.
Repo
(directory, metadata=None, lowercase=False)[source]¶ Bases:
object
Repo reading object for archive_org
Parameters: -
find
(*strings, multiprocess=None, memory_efficient=True)[source]¶ Find files who contains given strings
Parameters: Returns: Files who are matching the strings
Return type: generator
-
get
(identifier)[source]¶ Get the Text object given its identifier
Parameters: identifier (str) – Filename or identifier Returns: Text object Return type: Text
-
metadata
¶
-
Helpers¶
-
archives_org_latin_toolkit.
period
(x)[source]¶ Parse a period in metadata. If there is multiple dates, returns the mean
Parameters: x (str) – Value to parse Returns: Parsed numeral Return type: int
-
archives_org_latin_toolkit.
bce
(x)[source]¶ Format A BCE string
Parameters: x (str) – Value to parse Returns: Parsed numeral Return type: str
-
archives_org_latin_toolkit.
__window__
(array, window, i)[source]¶ Compute embedding using i
Parameters: - strings –
- window – Number of word to take left, then right [ len(result) = (2*window)+1 ]
- i – Index of the word
- memory_efficient (bool) – Drop the content of files to avoid filling the ram with unused content
Returns: List of words