|
| 1 | +<!DOCTYPE html> |
| 2 | +<html lang="zh-CN"> |
| 3 | +<head> |
| 4 | + <title>TIL: 用 parallel 加速 rsync 迁移海量小文件 - 暗无天日</title> |
| 5 | + <meta charset="utf-8" /> |
| 6 | + <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| 7 | + <meta name="viewport" |
| 8 | + content="width=device-width, initial-scale=1, maximum-scale=5"> |
| 9 | + <meta name="author" content="lujun9972,Claude Code" /> |
| 10 | + <meta name="description" content="rsync 单线程处理海量小文件时瓶颈在文件数而非数据量。用 GNU parallel 把目录拆分给多个 rsync 并行跑,能明显缩短迁移时间。" /> |
| 11 | + <meta name="keywords" content="rsync,parallel,小文件,存储迁移,TIL" /> |
| 12 | + <meta name="theme-color" content="#FAFAF7"> |
| 13 | + <link rel="preconnect" href="https://fonts.googleapis.com"> |
| 14 | + <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> |
| 15 | + <link href="https://fonts.googleapis.com/css2?family=Cormorant+Garamond:ital,wght@0,400;0,600;0,700;1,400&family=Source+Sans+3:ital,wght@0,400;0,500;0,600;1,400&family=JetBrains+Mono:wght@400&display=swap" rel="stylesheet"> |
| 16 | + <link rel="stylesheet" href="../../../../../media/css/org-src-fontify.css" type="text/css"> |
| 17 | + <link rel="stylesheet" href="../../../../../media/css/kdComment.css" type="text/css"> |
| 18 | + <link rel="stylesheet" href="../../../../../media/css/main.css" type="text/css"> |
| 19 | + <script async type="text/javascript" |
| 20 | + src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> |
| 21 | + </script> |
| 22 | + <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/gitalk@1/dist/gitalk.css"> |
| 23 | + <script defer src="https://cdn.jsdelivr.net/npm/gitalk@1/dist/gitalk.min.js"></script> |
| 24 | +</head> |
| 25 | + |
| 26 | + <body class="container"> |
| 27 | +<header class="masthead"> |
| 28 | + <h1 class="masthead-title"><a href="../../../../../">暗无天日</a></h1> |
| 29 | + <p>=============>DarkSun的个人博客</p> |
| 30 | + <nav class="site-nav"> |
| 31 | + <ul class="trigger"> |
| 32 | + <li><a href="../../../../../years/">Years</a></li> |
| 33 | + <li><a href="../../../../../tags/">Tags</a></li> |
| 34 | + <li><a href="../../../../../authors/">Authors</a></li> |
| 35 | + <li><a href="../../../../../about/">About</a></li> |
| 36 | + <li><a href="https://github.com/lujun9972/lujun9972.github.com">Github</a></li> |
| 37 | + <li><a href="../../../../../rss.xml">RSS</a></li> |
| 38 | + </ul> |
| 39 | + </nav> |
| 40 | + <form method="get" id="searchform" action="http://www.google.com/search"> |
| 41 | + <input type="text" class="field" name="q" id="s" placeholder="Search..."> |
| 42 | + <input type="hidden" name="as_sitesearch" value="lujun9972.github.io"> |
| 43 | + </form> |
| 44 | +</header> |
| 45 | + |
| 46 | +<div> |
| 47 | +<article class="post"> |
| 48 | +<h1 class="title">TIL: 用 parallel 加速 rsync 迁移海量小文件</h1> |
| 49 | +<div id="table-of-contents" role="doc-toc"> |
| 50 | +<h2>目录</h2> |
| 51 | +<div id="text-table-of-contents" role="doc-toc"> |
| 52 | +<ul> |
| 53 | +<li><a href="#orga6de942">是什么</a></li> |
| 54 | +<li><a href="#org0924b5a">为什么有效</a></li> |
| 55 | +<li><a href="#org2f8c165">注意事项</a></li> |
| 56 | +</ul> |
| 57 | +</div> |
| 58 | +</div> |
| 59 | +<p> |
| 60 | +存储迁移时,真正头疼的不是几个大文件,而是目录树里塞满的几万、几十万个小文件。 <code>rsync</code> 可以用来进行文件迁移,但它是单线程的,也就是说一个文件处理完才处理下一个。小文件的问题在于每个文件都要 <code>stat</code> 、比较、传输,这套固定开销跟文件大小无关。文件越小,传输本身越快,但 stat 和校验的时间一点没少,结果就是大部分时间花在排队上,磁盘反而闲着。TecMint 的<a href="https://www.tecmint.com/copy-large-files-faster-linux/">一篇文章</a>给出了用 GNU parallel 把 rsync 并行化的方案。 |
| 61 | +</p> |
| 62 | + |
| 63 | +<div id="outline-container-orga6de942" class="outline-2"> |
| 64 | +<h2 id="orga6de942">是什么</h2> |
| 65 | +<div class="outline-text-2" id="text-orga6de942"> |
| 66 | +<p> |
| 67 | +核心思路:把源目录的顶层子目录分给多个 <code>rsync</code> 同时跑,让磁盘 I/O 队列不空等。 |
| 68 | +</p> |
| 69 | + |
| 70 | +<p> |
| 71 | +本地拷贝: |
| 72 | +</p> |
| 73 | + |
| 74 | +<div class="org-src-container"> |
| 75 | +<pre class="src src-shell">find /source/directory -mindepth 1 -maxdepth 1 -type d | <span class="org-sh-escaped-newline">\</span> |
| 76 | + parallel -j 4 rsync -a {} /destination/directory/ |
| 77 | +</pre> |
| 78 | +</div> |
| 79 | + |
| 80 | +<p> |
| 81 | +跨网络同步(存储迁移更常见的场景): |
| 82 | +</p> |
| 83 | + |
| 84 | +<div class="org-src-container"> |
| 85 | +<pre class="src src-shell">find /source/directory -mindepth 1 -maxdepth 1 -type d | <span class="org-sh-escaped-newline">\</span> |
| 86 | + parallel -j 4 rsync -az {} user@remote:/destination/directory/ |
| 87 | +</pre> |
| 88 | +</div> |
| 89 | + |
| 90 | +<p> |
| 91 | +pipeline 分三步: |
| 92 | +</p> |
| 93 | +<ol class="org-ol"> |
| 94 | +<li><code>find -mindepth 1 -maxdepth 1 -type d</code> 列出源目录的顶层子目录</li> |
| 95 | +<li><code>parallel -j 4</code> 同时启动 4 个 <code>rsync</code> 进程, <code>{}</code> 被替换为每个子目录路径</li> |
| 96 | +<li><code>rsync -a</code> 以归档模式同步每个子目录到目标( <code>-z</code> 额外启用压缩,远程传输时有用)</li> |
| 97 | +</ol> |
| 98 | +</div> |
| 99 | +</div> |
| 100 | + |
| 101 | +<div id="outline-container-org0924b5a" class="outline-2"> |
| 102 | +<h2 id="org0924b5a">为什么有效</h2> |
| 103 | +<div class="outline-text-2" id="text-org0924b5a"> |
| 104 | +<p> |
| 105 | +rsync 单线程处理大文件时,瓶颈在磁盘 I/O 读写速度,大部分时间花在实际传输数据上。但处理小文件时,大部分时间花在文件系统的元数据操作(stat、open、close)和 rsync 自己的校验逻辑上,真正传数据的时间反而少。并行化让多个 rsync 同时处理不同子目录,stat 等一个文件的时候另一个 rsync 已经在传数据了,磁盘闲不下来。 |
| 106 | +</p> |
| 107 | +</div> |
| 108 | +</div> |
| 109 | + |
| 110 | +<div id="outline-container-org2f8c165" class="outline-2"> |
| 111 | +<h2 id="org2f8c165">注意事项</h2> |
| 112 | +<div class="outline-text-2" id="text-org2f8c165"> |
| 113 | +<ol class="org-ol"> |
| 114 | +<li><code>-j</code> 的值没有万能公式,从 4 开始试,跑的时候看 <code>iostat</code> 的 <code>%util</code> 和 <code>await</code> ,磁盘利用率没满就往上加,满了就减。机械盘(HDD)通常扛不住太多并发,SSD 和网络存储可以高一些</li> |
| 115 | +<li><p> |
| 116 | +这个方案按顶层子目录分片。如果文件全平铺在一个目录里(没有子目录), <code>find</code> 只会返回一条结果,等于没有并行。这种情况按文件名分片: |
| 117 | +</p> |
| 118 | +<div class="org-src-container"> |
| 119 | +<pre class="src src-shell">find /source -type f | split -l 1000 - /tmp/chunk. |
| 120 | +parallel -j 4 <span class="org-string">'rsync -a --files-from={} / /destination/'</span> ::: /tmp/chunk.* |
| 121 | +</pre> |
| 122 | +</div></li> |
| 123 | +<li>迁移完成后跑一次 <code>rsync -avnc</code> (dry-run + checksum 模式),它会逐文件校验但不动数据,只报差异。比 <code>sha256sum</code> 快得多,几十万文件也不怕。如果想更省时间,抽检关键目录也行</li> |
| 124 | +</ol> |
| 125 | +</div> |
| 126 | +</div> |
| 127 | + |
| 128 | +</article> |
| 129 | +</div> |
| 130 | + |
| 131 | +<div> |
| 132 | + <div class="post-meta"> |
| 133 | + <span class="post-info">2026-05-14</span> |
| 134 | + <span class="post-info">2026-05-14</span> |
| 135 | + <a href="../../../../../tags/rsync">rsync</a> : <a href="../../../../../tags/parallel">parallel</a> : <a href="../../../../../tags/小文件">小文件</a> : <a href="../../../../../tags/存储迁移">存储迁移</a> : <a href="../../../../../tags/til">TIL</a> |
| 136 | + <span class="post-info">lujun9972,Claude Code</span> |
| 137 | + </div> |
| 138 | + <script src="../../../../../media/js/jquery-2.1.3.min.js"></script> |
| 139 | + <script src="../../../../../media/js/md5.min.js"></script> |
| 140 | + <link href="https://yiyechat.com/open-source/build/content-static/css/main.css" rel="stylesheet"> |
| 141 | + <script src="https://yiyechat.com/open-source/build/content-static/js/main.js"></script> |
| 142 | + <section> |
| 143 | + <div id="gitalk-container"></div> |
| 144 | + <script type="text/javascript"> |
| 145 | + var gitalk = new Gitalk({ |
| 146 | + clientID: 'fdcb5d9da3f4acb4862c', |
| 147 | + clientSecret: 'dd0a16312a206782cb7669dcec5e96874ab48170', |
| 148 | + repo: 'lujun9972.github.com', |
| 149 | + owner: 'lujun9972', |
| 150 | + admin: ['lujun9972'], |
| 151 | + id: md5(location.pathname), |
| 152 | + distractionFreeMode: false |
| 153 | + }) |
| 154 | + gitalk.render('gitalk-container') |
| 155 | + </script> |
| 156 | + </section> |
| 157 | + <script src="../../../../../media/js/kdComment.js"></script> |
| 158 | + <script> |
| 159 | + var _hmt = _hmt || []; |
| 160 | + (function() { |
| 161 | + var hm = document.createElement("script"); |
| 162 | + hm.src = "https://hm.baidu.com/hm.js?7bac4fd0247f69c27887e0d4e3aee41e"; |
| 163 | + var s = document.getElementsByTagName("script")[0]; |
| 164 | + s.parentNode.insertBefore(hm, s); |
| 165 | + })(); |
| 166 | + </script> |
| 167 | + <footer class="footer"> |
| 168 | + <p>Generated by <a href="http://www.gnu.org/software/emacs/">Emacs</a> 29.x(<a href="http://orgmode.org">Org mode</a> 9.x)</p> |
| 169 | + <p> |
| 170 | + Copyright © 2014 - <span id="footerYear"></span> <a href="mailto:runner <at> runnervmrw5os <dot> 5im3poowmk4e3it2xdho1gde0e <dot> cx <dot> internal <dot> cloudapp <dot> net">lujun9972,Claude Code</a> |
| 171 | + · |
| 172 | + Powered by <a href="https://github.com/emacs-china/EGO" target="_blank">EGO</a> |
| 173 | + · |
| 174 | + Themed with <a href="https://github.com/kuangdash/emacs_love" target="_blank">emacs_love</a> |
| 175 | + </p> |
| 176 | + <p> |
| 177 | + <a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license"><img src="https://licensebuttons.net/l/by-sa/3.0/88x31.png" style="border-width:0" alt="Creative Commons License" class="center"></a> |
| 178 | + </p> |
| 179 | + <script type="text/javascript">document.getElementById("footerYear").innerHTML = (new Date()).getFullYear();</script> |
| 180 | + </footer> |
| 181 | +</div> |
| 182 | + |
| 183 | + </body> |
| 184 | +</html> |
0 commit comments