Background
I'm using fuzz.partial_ratio_alignment() to align each paragraph of a novel to a timestamped transcription from the novel's corresponding audiobook. This will allow me to know how far into the book I've "read" whilst listening to its audiobook.
Approach
For each paragraph in the novel, I take a substring of ~128 characters to query against the full audio transcription (around 200,000 characters).
Here's the general idea, in code:
for paragraph in charlesDickensNovel:
alignment = fuzz.partial_ratio_alignment(
paragraph[0:128],
charlesDickensFullAudioTranscription
)
Problem
Most of the lines are successfully aligned within milliseconds (or nanoseconds), but perhaps 1% of these queries hang for several seconds.
I suspect that a smaller query would produce a quicker alignment, but I do need to query at least about 128 characters to reduce false-positives. There's also no way around the fact that I have to match against the whole audio transcription.
Requested solution
I actually don't need to match up every line; just a large proportion would be good enough for me to build a progress map. So I'd be happy to cancel any matches that take longer than 100 milliseconds to execute and just move onto the next one. Would it be possible to add an optional time limit onto each of the RapidFuzz APIs? Something like this timeout parameter:
fuzz.partial_ratio_alignment(
s1=line[0:128],
s2=charlesDickensFullAudioTranscription,
timeout=0.1
)
If the timeout is reached, it could simply return None without explanation, or (if it feels more Pythonic) throw an error.
Background
I'm using
fuzz.partial_ratio_alignment()to align each paragraph of a novel to a timestamped transcription from the novel's corresponding audiobook. This will allow me to know how far into the book I've "read" whilst listening to its audiobook.Approach
For each paragraph in the novel, I take a substring of ~128 characters to query against the full audio transcription (around 200,000 characters).
Here's the general idea, in code:
Problem
Most of the lines are successfully aligned within milliseconds (or nanoseconds), but perhaps 1% of these queries hang for several seconds.
I suspect that a smaller query would produce a quicker alignment, but I do need to query at least about 128 characters to reduce false-positives. There's also no way around the fact that I have to match against the whole audio transcription.
Requested solution
I actually don't need to match up every line; just a large proportion would be good enough for me to build a progress map. So I'd be happy to cancel any matches that take longer than 100 milliseconds to execute and just move onto the next one. Would it be possible to add an optional time limit onto each of the RapidFuzz APIs? Something like this
timeoutparameter:If the timeout is reached, it could simply return
Nonewithout explanation, or (if it feels more Pythonic) throw an error.