Archives.org Latin Toolkit Documentation¶

Classes¶

class archives_org_latin_toolkit.Metadata(csv_file)[source]¶

Metadata object for a file

Parameters:	csv_file (str) – Path to the CSV file to parse

getDate(identifier)[source]¶

Get the date of a text given its identifier

Parameters:	identifier (str) – Filename or identifier
Returns:	Date of composition
Return type:	int

class archives_org_latin_toolkit.Text(file, metadata=None, lowercase=False)[source]¶

Text reading object for archive_org

Parameters:	file (str) – File path metadata (Metadata) – Metadata registry lowercase (bool) – Clean Text will be in lowercase
Variables:	name – Name of the file composed – Date of composition

clean¶: Clean version of the text : normalized space, remove new line, dehyphenize, remove punctuation and number.

find_embedding(*strings, window=50, ignore_center=False, memory_efficient=True)[source]¶

Check if given string is in the file

Parameters:	strings – Strings as multiple arguments window – Number of lines to retrieve ignore_center – Remove the word found from the embedding

has_strings(*strings)[source]¶

Check if given string is in the file

Parameters:	strings – Strings as multiple arguments
Returns:	If found, return True
Return type:	bool

class archives_org_latin_toolkit.Repo(directory, metadata=None, lowercase=False)[source]¶

Repo reading object for archive_org

Parameters:	file (str) – File path metadata (Metadata) – Metadata registry lowercase (bool) – Clean Text will be in lowercase

find(*strings, multiprocess=None, memory_efficient=True)[source]¶

Find files who contains given strings

Parameters:	strings – Strings as multiple arguments multiprocess (int) – Number of process to spawn memory_efficient (bool) – Drop the content of files to avoid filling the ram with unused content
Returns:	Files who are matching the strings
Return type:	generator

get(identifier)[source]¶

Get the Text object given its identifier

Parameters:	identifier (str) – Filename or identifier
Returns:	Text object
Return type:	Text

archives_org_latin_toolkit.period(x)[source]¶

Parse a period in metadata. If there is multiple dates, returns the mean

archives_org_latin_toolkit.bce(x)[source]¶

Format A BCE string

archives_org_latin_toolkit.__window__(array, window, i)[source]¶

Compute embedding using i

Parameters:	strings – window – Number of word to take left, then right [ len(result) = (2window)+1 ] i – Index of the word memory_efficient* (bool) – Drop the content of files to avoid filling the ram with unused content
Returns:	List of words

archives_org_latin_toolkit.__find_multiprocess__(args)[source]¶

Find files who contains given strings

Parameters:	args – Tuple where first element are Strings as list and second element is list of file objects
Returns:	Files who are matching the strings
Return type:	list