Archives.org Latin Toolkit Documentation¶

Classes¶

class archives_org_latin_toolkit.Metadata(csv_file)[source]¶

Bases: object

Metadata object for a file

Parameters:	csv_file (str) – Path to the CSV file to parse

getDate(identifier)[source]¶

Get the date of a text given its identifier

Parameters:	identifier (str) – Filename or identifier
Returns:	Date of composition
Return type:	int

class archives_org_latin_toolkit.Text(file, metadata=None, lowercase=False)[source]¶

Bases: object

Text reading object for archive_org

Parameters:	file (str) – File path metadata (Metadata) – Metadata registry lowercase (bool) – Clean Text will be in lowercase
Variables:	name – Name of the file composed – Date of composition

clean¶: Clean version of the text : normalized space, remove new line, dehyphenize, remove punctuation and number.

cleanUp()[source]¶: Clean textual information and free RAM

composed¶

find_embedding(*strings, window=50, ignore_center=False, memory_efficient=True)[source]¶

Check if given string is in the file

Parameters:	strings – Strings as multiple arguments window – Number of lines to retrieve ignore_center – Remove the word found from the embedding

has_strings(*strings)[source]¶

Check if given string is in the file

Parameters:	strings – Strings as multiple arguments
Returns:	If found, return True
Return type:	bool

name¶

random_embedding(grab, window=50, avoid=None, memory_efficient=True, _taken=None, _generator=True)[source]¶

Search for random sentences in the text. Can avoid certain words

Parameters:	grab (int) – Number of random sequence to retrieve window (int) – Number of lines to retrieve avoid – List of lemmas NOT TO be included in random _taken – Used internally to check we do not sample with the same element again _generator – If set to True, returns the window and its index in the text
Returns:	Generator with random texts

Note

Right now, new window found are not added to _taken, which is problematic

raw¶

class archives_org_latin_toolkit.Repo(directory, metadata=None, lowercase=False)[source]¶

Bases: object

Repo reading object for archive_org

Parameters:	file (str) – File path metadata (Metadata) – Metadata registry lowercase (bool) – Clean Text will be in lowercase

find(*strings, multiprocess=None, memory_efficient=True)[source]¶

Find files who contains given strings

Parameters:	strings – Strings as multiple arguments multiprocess (int) – Number of process to spawn memory_efficient (bool) – Drop the content of files to avoid filling the ram with unused content
Returns:	Files who are matching the strings
Return type:	generator

get(identifier)[source]¶

Get the Text object given its identifier

Parameters:	identifier (str) – Filename or identifier
Returns:	Text object
Return type:	Text

metadata¶

Helpers¶

archives_org_latin_toolkit.period(x)[source]¶

Parse a period in metadata. If there is multiple dates, returns the mean

Parameters:	x (str) – Value to parse
Returns:	Parsed numeral
Return type:	int

archives_org_latin_toolkit.bce(x)[source]¶

Format A BCE string

Parameters:	x (str) – Value to parse
Returns:	Parsed numeral
Return type:	str

archives_org_latin_toolkit.__window__(array, window, i)[source]¶

Compute embedding using i

Parameters:	strings – window – Number of word to take left, then right [ len(result) = (2window)+1 ] i – Index of the word memory_efficient* (bool) – Drop the content of files to avoid filling the ram with unused content
Returns:	List of words

archives_org_latin_toolkit.__find_multiprocess__(args)[source]¶

Find files who contains given strings

Parameters:	args – Tuple where first element are Strings as list and second element is list of file objects
Returns:	Files who are matching the strings
Return type:	list