Skip to content

feat: add StringSequenceToEmbedding transformer and layer#47

Open
mruiyangyou wants to merge 2 commits into
ExpediaGroup:mainfrom
mruiyangyou:add-string-sequence-to-embedding
Open

feat: add StringSequenceToEmbedding transformer and layer#47
mruiyangyou wants to merge 2 commits into
ExpediaGroup:mainfrom
mruiyangyou:add-string-sequence-to-embedding

Conversation

@mruiyangyou

Copy link
Copy Markdown

Summary

Adds a new StringSequenceToEmbedding transformer/layer pair that parses a delimited string of pre-computed embedding vectors into a dense (seq_len, embedding_dim) float tensor.

  • Spark transformer (StringSequenceToEmbeddingTransformer) and Keras layer (StringSequenceToEmbeddingLayer) with parity behaviour.
  • Input format: vectors separated by sequence_separator (default ,), floats within a vector by separator (default |). Example: "1|2|3,4|5|6" with seq_len=2, embedding_dim=3[[1,2,3],[4,5,6]].
  • Pads short sequences with pad_value (default "0"); truncates long ones.
  • Optional reverse mode reverses only the non-pad portion of each sequence (useful for chronological → recency-first ordering).
  • Layer drops a trailing size-1 input axis to match the StringToStringListLayer convention, so (None, 1, 1) inputs produce (None, 1, seq_len, embedding_dim) outputs without a downstream squeeze.
  • Empty tokens (from leading/trailing/repeated separators or fully empty cells) are replaced with pad_value before tf.strings.to_number to avoid StringToNumberOp failures at graph execution.
  • README updated with a row in the supported layers table.

Test plan

  • tests/kamae/tensorflow/layers/test_string_sequence_to_embedding.py — 9 unit tests covering default/custom separators, padding, truncation, reverse, trailing-1 squeeze behaviour, empty/malformed inputs, config round-trip, and invalid args.
  • tests/kamae/spark/transformers/test_string_sequence_to_embedding.py — 6 tests including Spark/TF parity across separators, padding, and reverse modes.
  • tests/kamae/tensorflow/test_layer_serialisation.py — added the new layer to the serialisation matrix; passes test_all_layers_tested_for_serialisation.
  • No regressions in existing string_to_string_list tests.

🤖 Generated with Claude Code

@mruiyangyou mruiyangyou requested a review from a team as a code owner May 18, 2026 09:58
@georyetti

Copy link
Copy Markdown
Contributor

Hey @mruiyangyou - this looks good at a glance, but we are gearing up to do a big new kamae v3 release in the next few weeks or so. Can this wait till after these? I can aid migrating this to keras 3.

Parses a delimited string of pre-computed embedding vectors into a
(seq_len, embedding_dim) float tensor, with optional reversal of the
non-pad portion of each sequence. Includes the Spark transformer, the
TensorFlow-only Keras 3 layer, unit tests, Spark/TF parity tests, and
serialisation + JIT-compatibility registry entries.

Authored on top of v3.0.0 (Keras 3 multi-backend migration): the layer
lives in kamae.keras.tensorflow.layers, subclasses
kamae.keras.core.base.BaseLayer, declares supported_backends =
TENSORFLOW_ONLY and jit_compatible = False, and the transformer wraps it
via get_keras_layer().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mruiyangyou mruiyangyou force-pushed the add-string-sequence-to-embedding branch from 7bc4ed9 to 2bc89ff Compare June 16, 2026 13:36

@georyetti georyetti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One main comment on padding value being assumed to be zero.

Comment on lines +150 to +152
# A row is considered padding iff all of its components are 0.
row_norms = tf.reduce_sum(tf.abs(result), axis=-1)
seq_lengths = tf.reduce_sum(tf.cast(row_norms > 0, tf.int32), axis=-1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pad value is configurable, so this only works if pad value is 0 no? If pad value is -1, then row_norms > 0 and its considered non padded?

Comment on lines +258 to +266
# Count the number of non-pad vectors (a vector is pad iff all
# of its components are zero). Reverse only that prefix.
abs_sums = F.transform(
vectors,
lambda v: F.aggregate(
v,
F.lit(0.0),
lambda acc, value: acc + F.abs(value),
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, pad value = -1 breaks this I think

Comment on lines +106 to +110
layer = StringSequenceToEmbeddingLayer(
name="reverse",
seq_len=4,
embedding_dim=2,
reverse=True,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for a non zero pad value

@georyetti

Copy link
Copy Markdown
Contributor

Also since pad value is a string, what happens if the user specific pad_value = "hello"? We should validate that its a numeric string on setting in Spark and in constructor of keras layer

…_value

Addresses PR ExpediaGroup#47 review (georyetti):
- reverse no longer infers padding from values (norm > 0), which broke
  for non-zero pad values (e.g. -1). It now counts the supplied vectors
  positionally (non-empty groups in the original input, capped at
  seq_len) and reverses only that prefix, leaving appended padding at
  the tail. Applied to both the Keras layer and the Spark transformer.
- validate that pad_value is a numeric string in the Keras layer
  constructor and the Spark setPadValue/_transform; a non-numeric value
  like "hello" now raises ValueError instead of producing NaNs.
- add tests for a non-zero pad value and for non-numeric pad_value
  rejection; rewrite the reverse tests to use genuinely padded inputs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants