Skip to content

Commit a35c364

Browse files
committed
build: add --no-streaming-save flag for in-RAM save path
Currently the build CLI picks between two paths: -o without --check -> dictionary::build_streaming_save (spilled components are stitched into the output via the streaming saver; `dict` is not query-ready afterward) -o with --check (or no -o) -> dictionary::build (spilled components are materialized back into `dict`, then optionally essentials::save) For users with plenty of RAM who don't want the streaming-save tmp-file concatenation (and don't need --check), expose the in-RAM save path explicitly via --no-streaming-save. When set, the build does build() + essentials::save: peak RSS at save time briefly equals the in-RAM index size, but the save is a single pass over `dict` rather than a stitched concatenation. Useful when the user already pays the memory cost (e.g., to query the dict immediately afterward in another tool, or just prefers the simpler save path). Both flows produce byte-identical output files; the flag only affects the save path. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
1 parent b3e49c9 commit a35c364

1 file changed

Lines changed: 19 additions & 3 deletions

File tree

tools/build.cpp

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,13 @@ int build(int argc, char** argv) {
4646
true);
4747
parser.add("check", "Check correctness after construction.", "--check", false, true);
4848
parser.add("verbose", "Verbose output during construction.", "--verbose", false, true);
49+
parser.add("no_streaming_save",
50+
"Force the in-RAM save path even with -o: build, materialize the dictionary in RAM, "
51+
"then write it via essentials::save. Peak RSS at save time briefly equals the "
52+
"in-RAM index size; useful when the user has plenty of memory and wants a single "
53+
"save call rather than the streaming-save tmp-file concatenation. Implied by "
54+
"--check (which always materializes for query).",
55+
"--no-streaming-save", false, true);
4956

5057
if (!parser.parse()) return 0;
5158

@@ -74,19 +81,28 @@ int build(int argc, char** argv) {
7481
// build_config.print();
7582

7683
bool check = parser.get<bool>("check");
84+
bool no_streaming_save = parser.get<bool>("no_streaming_save");
7785
bool has_output = parser.parsed("output_filename");
7886

7987
dictionary_type dict;
8088

81-
if (has_output && !check) {
89+
if (has_output && !check && !no_streaming_save) {
8290
/* Streaming-save path: keeps peak RAM bounded by the build phase
83-
(the strings bit-vector is never fully in RAM). After this returns
84-
`dict` is not query-ready; reload from disk to query. */
91+
(the strings bit-vector and the spilled compact_vectors / MPHFs
92+
are never fully in RAM). After this returns `dict` is not
93+
query-ready; reload from disk to query. */
8594
auto output_filename = parser.get<std::string>("output_filename");
8695
essentials::logger("building data structure (streaming save)...");
8796
dict.build_streaming_save(input_filename, build_config, output_filename);
8897
essentials::logger("DONE");
8998
} else {
99+
/* In-RAM save path. The build still spills internally for
100+
bounded-RAM construction, but at the end every spilled
101+
component is materialized back into `dict` so it's
102+
query-ready. Used whenever --check is requested (queries need
103+
`dict` populated) or when the user explicitly opts in via
104+
--no-streaming-save. Peak RSS briefly hits the full index
105+
size at save time. */
90106
essentials::logger("building data structure...");
91107
dict.build(input_filename, build_config);
92108

0 commit comments

Comments
 (0)