رفع مجموعهای از باگهای درستی و بهبود پایداری در ماژولهای اصلی#200
Open
Milomilo777 wants to merge 1 commit into
Open
رفع مجموعهای از باگهای درستی و بهبود پایداری در ماژولهای اصلی#200Milomilo777 wants to merge 1 commit into
Milomilo777 wants to merge 1 commit into
Conversation
A comprehensive audit fixing real, verified defects across metrics, trainer, tokenizers, datasets, models, preprocessors and utils. No public API breaks (legacy enum spellings kept as backward-compatible aliases). Trainer: - SequenceLabeling metrics crashed on eval batch size 1 (bare np.squeeze collapsed the batch dim); use argmax(-1) without squeeze. - AttributeError on resume from a named checkpoint dir (referenced self.state before assignment). - Objective resolution failed for metrics whose output key differs from the metric name (e.g. seqeval -> f1). - OptimizerType "sdg"->"sgd" and LRSchedulerType "cosine_anealing"-> "cosine_annealing" typos; CONSTANT scheduler now mapped; unknown optimizer/scheduler now raise clear errors (old values still accepted). - Final partial accumulation group now divides by the real micro-batch count. - Distributed eval loss reduced with .mean() before .item(). Tokenizers: - Wrong "input_ids" key -> "token_ids" (overflow mapping and OCR processor). - User-supplied `stride` was ignored. - decode() converts tensors/arrays before the int check; empty-offsets guard. - pad_encoded_batch no longer mutates the caller's exclude_keys list. - Whisper ASR: decode()[0] for language tokens, offset text as string. Metrics: - n_decimals=0 / normalize=False now honored (None-sentinel instead of `or`). - BLEU wraps string references into a list-of-references for NLTK. - ROUGE honors use_aggregator and n_decimals; seqeval drops dead format(). Models / preprocessors: - Whisper: don't overwrite user decoder_input_ids; keep mono numpy audio; fix generation_config override (dataclass has no .update()). - GPT2: forward attention_mask and stop polluting the persistent gen config. - CRAFT: invert the forward resize ratio for box coordinates; sync get_ratio overrides; honor 0.0 thresholds. - TextNormalizer: per-call nfkd/nfkc/replace_patterns overrides now work. - CharLevelOCR collator: no in-place mutation of sample label lists. - AudioFeatureExtractorConfig gains the return_attention_mask field. Utils: - spectrogram() honors the dtype contract for non-log spectrograms. - normalize_image() promotes to float (no uint8 truncation / div-by-zero). - list_repo_files() filters subfolders by path prefix, not substring. - get_state_dict_from_hub() uses map_location="cpu" (torch>=2.6 safe). - Dataset.load() no longer leaks per-call cache_dir onto the class. Validated with ruff (project config) and full byte-compilation; pure-python logic covered by standalone tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
این PR حاصل یک ممیزی عمیق و سیستماتیک روی کل کدبیس است. در مجموع ۵۶ نقص واقعی شناسایی، راستیآزمایی و برطرف شدهاند. تمرکز بر درستی منطق، پایداری و رفع باگهای نهفته بوده است.
سازگاری رو به عقب حفظ شده و هیچ شکست API عمومیای وجود ندارد (املای قدیمیِ مقادیر شمارشی بهصورت ناممستعار پذیرفته میشود).
باگهای بحرانی
squeezeself.stateپیش از مقداردهی)input_idsبهجایtoken_idsdecoder_input_idsکاربر در ویسپر بهخاطرorبهجایandCRAFT(نسبت تغییر اندازه وارون نمیشد)ماژولهای بهینهشده
trainersgdوcosine_annealingconstantو خطای روشن برای مقدار نامعتبرseqevalوf1)metricsn_decimals=0وnormalize=False(الگویorاین مقادیر را بیاثر میکرد)bleuبرای ورودی رشتهایuse_aggregatorوn_decimalsدرrougepreprocessors/tokenizersstride؛ تبدیل تنسور پیش از بررسی نوع درdecodeexclude_keysnfkdوnfkcوreplace_patternsدر نرمالساز متنmodelsupdateروی دیتاکلاس)gpt2: انتقالِattention_maskو توقفِ آلودگیِ پایدارِ پیکربندیِ تولیدmask-fillingو مقدار بازگشتیِsavedataRangedSampler.__len__اکنون تعداد واقعیِ نمونههای پیمایششده را برمیگرداندcache_dirروی صفتِ کلاس و پشتیبانی از مسیرهای محلیِ حاوی دونقطه (ویندوز)labels_max_lengthmax_length=Noneدر دیتاستِOCRو رفعِ جهشِ درجایِ برچسبها در کلکتورutilsdtypeدرspectrogramnormalize_imageبرای تصاویرِ صحیح (uint8)map_locationو سازگاری با نسخههای جدیدِ تورچاعتبارسنجی
ruffبا پیکربندی پروژه: بدون خطا🤖 Generated with Claude Code