Fix UnicodeEncodeError on Windows when video titles contain emoji#34
Open
drlee91 wants to merge 1 commit into
Open
Fix UnicodeEncodeError on Windows when video titles contain emoji#34drlee91 wants to merge 1 commit into
drlee91 wants to merge 1 commit into
Conversation
… Windows On Windows, Python's stdout/stderr default to the locale code page (typically cp1252). When a video title contains emoji or any other non-Latin-1 character, printing the report raises UnicodeEncodeError and watch.py exits 1 after all the work (download, frames, transcript) has already succeeded - the frame paths are never printed. Reconfigure both streams to UTF-8 with errors='replace' at startup. No-op on macOS/Linux where the streams are UTF-8 already; guarded for exotic streams that don't support reconfigure(). Repro (any video with emoji in the title, e.g. on a German locale): python scripts/watch.py https://www.youtube.com/watch?v=pl3n9o_ZR9M -> UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f92f' Complements bradautomates#4, which fixed the same class of issue for config files.
|
Independent confirmation on a separate machine — Windows 11, German locale One addition worth noting: it covers a second trigger beyond emoji titles. After |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Windows, Python's stdout/stderr default to the locale code page (typically cp1252). When a video title contains emoji or any character outside that code page, printing the report raises
UnicodeEncodeErrorandwatch.pyexits 1 — after the download, frame extraction, and transcription have already succeeded. The frame paths are never printed, so the whole run is lost.Emoji in titles are common on YouTube, so on a default Windows setup this breaks a large share of videos.
Repro (Windows, e.g. German locale / cp1252)
(The title is
Anfängerin überrascht ALLE! 🤯 Angeln lernen von A bis Z 🎣— the 🤯 kills the report.)Fix
Reconfigure
sys.stdout/sys.stderrto UTF-8 witherrors="replace"at startup of the entry point. This is a no-op on macOS/Linux (streams are UTF-8 already) and is guarded with try/except for exotic streams that don't supportreconfigure().Since
watch.pyimports all other modules into the same process, this covers the whole pipeline.Verification
On Windows 11 (cp1252 console):
python -c "print('🤯')"still crashes in the same shell, confirming the environment itself reproduces the bug and the fix is what resolves it.Complements #4, which fixed the same class of issue for config file I/O.