This extension allows you to spot-check extracted data versus source PDFs. It is very useful when developing an OCR/document analysis pipeline using Python/Pandas. It does so by providing a custom editor that shows a record on the left half and a PDF on the right half.
Features
Take random samples from a Python script and show them on a custom editor:
Run the command Open with Spot Check while your Python script is opened.
Requirements
Python >= 3.8
Extension Settings
This extension contributes the following settings:
spot-check.pythonInterpreterPath: path to the Python interpreter this extension uses to run your Python script. Defaults to python.
spot-check.pythonPaths: array of paths to add to PYTHONPATH during script execution. For any path, you can substitute the variable workspaceFolder.
spot-check.cwd: current working directory during script execution. You can also substitute the variable workspaceFolder here.
Python library reference
print_samples
Print random samples from a dataframe. This function only works if the first shell argument to the script is "printSamples" (which is how the extension invokes this script), so you don't need to comment and then uncomment this function when running the script for a different purpose.
Arguments:
data (pandas.DataFrame): the data to sample
resolve_source_path (func(pandas.Series) -> str): given a row from the data, this function must return the absolute path to source PDF file
resolve_pageno (pandas.Series) -> int): optional. Given a row from the data, this function must return the page number of the PDF.
number_of_samples (int): number of samples to produce with each incantation. Defaults to 100.
sort (bool): sort the samples according to the original row order. Defaults to True.
exit_on_success (bool): exit the script after this function prints samples successfully. It prevents any code that comes after this function from running to reduce side effects. Defaults to True.