Signal #93002POSITIVE

Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

70

arXiv:2605.00837v1 Announce Type: new Abstract: Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log-domain, enabling robust computation for regularization parameters as small as epsilon = 10^{-4} where standard-domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely-used POT library and 5.9x speedup over GPU-accelerated PyTorch baselines, while consuming only 256 MB of GPU memory. We validate our solver on image...

arXiv ML Latestabout 3 hours ago
Read Full Article

Explore with AI-Powered Tools

View All Signals

Explore more AI intelligence

Want to discover more AI signals like this?

Explore Steek
Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions | Steek AI Signal | Steek