Archives.org Latin Toolkit Documentation

Classes

class archives_org_latin_toolkit.Metadata(csv_file)[source]

Bases: object

Metadata object for a file

Parameters:csv_file (str) – Path to the CSV file to parse
getDate(identifier)[source]

Get the date of a text given its identifier

Parameters:identifier (str) – Filename or identifier
Returns:Date of composition
Return type:int
class archives_org_latin_toolkit.Text(file, metadata=None, lowercase=False)[source]

Bases: object

Text reading object for archive_org

Parameters:
  • file (str) – File path
  • metadata (Metadata) – Metadata registry
  • lowercase (bool) – Clean Text will be in lowercase
Variables:
  • name – Name of the file
  • composed – Date of composition
clean

Clean version of the text : normalized space, remove new line, dehyphenize, remove punctuation and number.

cleanUp()[source]

Clean textual information and free RAM

composed
find_embedding(*strings, window=50, ignore_center=False, memory_efficient=True)[source]

Check if given string is in the file

Parameters:
  • strings – Strings as multiple arguments
  • window – Number of lines to retrieve
  • ignore_center – Remove the word found from the embedding
has_strings(*strings)[source]

Check if given string is in the file

Parameters:strings – Strings as multiple arguments
Returns:If found, return True
Return type:bool
name
raw
class archives_org_latin_toolkit.Repo(directory, metadata=None, lowercase=False)[source]

Bases: object

Repo reading object for archive_org

Parameters:
  • file (str) – File path
  • metadata (Metadata) – Metadata registry
  • lowercase (bool) – Clean Text will be in lowercase
find(*strings, multiprocess=None, memory_efficient=True)[source]

Find files who contains given strings

Parameters:
  • strings – Strings as multiple arguments
  • multiprocess (int) – Number of process to spawn
  • memory_efficient (bool) – Drop the content of files to avoid filling the ram with unused content
Returns:

Files who are matching the strings

Return type:

generator

get(identifier)[source]

Get the Text object given its identifier

Parameters:identifier (str) – Filename or identifier
Returns:Text object
Return type:Text

Helpers

archives_org_latin_toolkit.period(x)[source]

Parse a period in metadata. If there is multiple dates, returns the mean

Parameters:x (str) – Value to parse
Returns:Parsed numeral
Return type:int
archives_org_latin_toolkit.bce(x)[source]

Format A BCE string

Parameters:x (str) – Value to parse
Returns:Parsed numeral
Return type:str
archives_org_latin_toolkit.__window__(array, window, i)[source]

Compute embedding using i

Parameters:
  • strings
  • window – Number of word to take left, then right [ len(result) = (2*window)+1 ]
  • i – Index of the word
  • memory_efficient (bool) – Drop the content of files to avoid filling the ram with unused content
Returns:

List of words

archives_org_latin_toolkit.__find_multiprocess__(args)[source]

Find files who contains given strings

Parameters:args – Tuple where first element are Strings as list and second element is list of file objects
Returns:Files who are matching the strings
Return type:list