Do you have updated guidance for handling large datasets (>100 samples and/or >500,000 cells)?
A previous issue addressed this topic nearly 4 years ago (#108), but I'm hoping you have more insights now. The clearest recommendation from that discussion was: "for very large datasets with many samples, use large k~[50, 100] and small prop~[0.01, 0.1] to reduce neighborhood redundancy."
The main concerns are:
- Sample representation: How do I ensure each neighborhood captures enough cells from each sample? Should
k scale with sample size?
- Computational constraints: Does adjusting
prop address memory/computation limits, or are other strategies needed?
Do you have updated guidance for handling large datasets (>100 samples and/or >500,000 cells)?
A previous issue addressed this topic nearly 4 years ago (#108), but I'm hoping you have more insights now. The clearest recommendation from that discussion was: "for very large datasets with many samples, use large k~[50, 100] and small prop~[0.01, 0.1] to reduce neighborhood redundancy."
The main concerns are:
kscale with sample size?propaddress memory/computation limits, or are other strategies needed?