Skip to content

Support for <img> in SMIL#2919

Draft
HadrienGardeur wants to merge 2 commits into
mainfrom
img-smil
Draft

Support for <img> in SMIL#2919
HadrienGardeur wants to merge 2 commits into
mainfrom
img-smil

Conversation

@HadrienGardeur

@HadrienGardeur HadrienGardeur commented Feb 5, 2026

Copy link
Copy Markdown
Member

This PR closes #2883.

It's currently a draft PR and not ready to be reviewed yet.

Changes:

  • Added a new section for the img element
  • Added img to the content model for par (set to 0 or 1)
  • Added Media Fragment URI to normative references
  • Added a new paragraph in "Referencing document fragments"
  • Added new example for comics in "Structural semantics in overlays"
  • Added new entry in changelog

Links:


Preview | Diff

Comment on lines 8319 to 8328
<ul class="nomark">
<li>
<p> [^text^] <code>[exactly 1]</code>
</p>
</li>
<li>
<p> [^audio^] <code>[0 or 1]</code>
</p>
</li>
</ul>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new content model for par needs to be specified. Is img + audio a separate combination, since relying on tts for images likely shouldn't be an option, or is img optionally allowed and text remains required so all three elements can be used together? (But is showing both an image and a text fragment realistic if the viewport is occupied by the fxl image?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the latter would be a new case of img and audio required and text optional, not img as an optional element in the current model. Probably doesn't make sense to always require text with img.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Apologies if you're already working on this, but I assumed you were moving on to the RS aspects by opening the draft.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I'm not done with anything at this point, I just prefer to open a draft PR as early as possible.

For me, all combinations are valid:

  • img + text (textual alternatives for regions, for example description of a panel)
  • img + audio (audio narration for regions)
  • img + audio + text (this would allow someone to either listen to the pre-recorded or use for example a Braille tablet by consuming the textual content)

Even img on its own could be a valid use case and result in a panel-by-panel navigation for example in comics.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go through the section that you highlighted yet, but IMO text, img or audio would all become 0 or 1 with at least one of them present.

@HadrienGardeur HadrienGardeur Feb 6, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that's how smil expects it to work by default, though, if I understand you correctly (that you pick the applicable format to synchronize). I'll preface by saying I'm not the expert on this, but my understanding is that if you specify all three in a single par then all three are expected to be synchronized.

This is definitely meant to synchronize all three media together.

You can open the following Google Slides in presenter mode to see a demo of what this could feel like: https://docs.google.com/presentation/d/1LGHRIN_vHl-H-bgXsHkhqrL0owMCy8qx854x9YU3d8w/edit?usp=sharing

That said, even with Media Overlays today you're free to use what you want:

  • just text
  • just audio
  • or text and audio together

Multiple apps already offer the ability to consume EPUB with Media Overlays like an audiobook with just a player interface on screen.

In the specific case of comics and highly illustrated content, the ideal scenario would be to customize things to your needs:

  • for example a dyslexic user could enable audio on captions and speech bubbles (either using <audio> or with TTS on the content of <text>) but skip descriptions using skippability (this would need a specific role that could be identified) but display text below the image fragment (this is what my example in Google Slides illustrates)
  • a blind user could go full audio (once again using either <audio>, with TTS on the content of <text> or using a textual view with a screen reader) without skipping anything at all
  • a user reading on a small screen could just use these image fragments to read more easily on their device and turn audio on just for speech bubbles

This is potentially problematic as it means you could validly have only an image listed, but what happens then?

That's the last use case that I've described above, this would give you region by region navigation.

With text, the reading system is supposed to tts the content before moving to the next par, but if someone only lists an image does it load and unload instantaneously?

Frankly, there's more of a use case for <img> or <audio> on their own than having just <text>.

What's the point of a SMIL with just <text> when you can just use TTS? Skippability and escapibility can be achieved without SMIL, the only use case I think of is to guide you through places in the publication.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely meant to synchronize all three media together.

Okay, I'll have to see the new model for rendering the content before I comment any more on this. Having text content synced with a roll image, or an image placed into reflowable text, seems complex to spec out.

That said, even with Media Overlays today you're free to use what you want:

I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.

What's the point of a SMIL with just <text> when you can just use TTS?

The only advantage is that you don't have to change playback modes. If the body is professionally narrated you could sync the backmatter for TTS without having to prerecord it.

But that's not the issue. There's still a timing sequence for text if you push the rendering out to TTS. The reading system will present/highlight the text for as long as it takes to render it as speech.

If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?

It's a little weird to use a synchronization markup language to not synchronize anything. An audio-only par element is equally strange from a synchronization standpoint, but at least the timing of the clip gives it a duration to play.

But, these are just my immediate concerns. I can wait until the draft is in a more complete state before commenting any more so we don't add a lot of noise to the pull request.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.

I was mostly talking about the UX from a RS perspective, of course the current spec requires text.

Our current spec is really written in a way where we assume that:

  • the main way this will be handled is by displaying text on the screen with audio playback in the background
  • and the use of authored CSS is very much skewed towards FXL

In a reflowable EPUB where users are free to select different themes, using authored CSS for highlighting is a potential accessibility hazard, since you could end up with major contrast issues.

In practice, a user could just listen to an EPUB with Media Overlays without displaying anything on a screen, even if the SMIL includes text and/or img. This would be indistinguishable from an audiobook from a UX perspective.

If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?

I don't think that this is any different from <text> which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud). That's also why the dur attribute exists in SMIL but we don't have it in EPUB.

From a UX perspective, it's important to keep in mind that even with <text> and <audio> the playback can either be:

  • continuous, where you go through the entire publication
  • per page or per spread, where the current page/spread is read and the RS waits for the user to switch to another one before playback continues
  • or handled element by element

With this last one, you have something that makes perfect sense for img on its own. For example:

  • whole page displayed
  • then just the first panel
  • then the second panel
  • then a zoom into part of the second panel to showcase a character and a bubble
  • etc.

Once you sync text or audio to each of these image fragments, then playback can also become continuous or based on page/spread.

@mattgarrish mattgarrish Feb 10, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is any different from <text> which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud)

Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present. TTS provides the duration for text content. We took out the section on embedded media last revision and advise people that referencing that kind of content from text will have unpredictable results, but text was never meant to link to content for which no duration could be established.

If you go back to 3.2, before we took that section out, the embedded media that text referred to had to have an audio component that could be played back:

When a text element references embedded media that contains audio, the audio sibling element is OPTIONAL.

Referring to images was always problematic because the text content to TTS wasn't as straightforward as getting the text content of a typical html element, but there was still the possibility of using alt.

That's why having img reference images outside of an xhtml wrapper as the only element of a par contradicts all expectations we've ever had for media overlays to synchronize content with audio.

Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable:

  • img + text = duration through tts.
  • img + audio = duration through audio playback.
  • img + text + audio = duration through audio playback

But img on its own has no duration and no audio, so why do we even need it? It's like region-based nav but if you take away playback control from the user and expect the reading system to meaningfully automate it. If you drop that one case, the internal conflicts with what we have are greatly reduced.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there's a use case for <img> on its own, but as you've correctly pointed out it can also be implemented using the lesser known region-based navigation.

Some resources related to this use case in the Kindle ecosystem include:

Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present.

It feels very vague quite frankly, because it's impossible to estimate the duration of a SMIL that's <text> only. Based on the voice and speed that I use, the duration will be very different.

Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable

I completely understand your point here and for the sake of maximizing compatibility, I'm willing to focus strictly on two use cases right now:

  • img + text
  • and img + text + audio

With this approach, text (1 exactly) and audio (0 or 1) would keep their current content model. img would become optional (0 or 1) just like audio.

I think that this somehow raises the bar for what we require from content creators, but given the focus on accessibility and specialized libraries, it's a trade off that we can work with.

If this works out, we could always relax our approach in a future revision to allow img + audio or img on its own.

@HadrienGardeur

HadrienGardeur commented Feb 6, 2026

Copy link
Copy Markdown
Member Author

I'd like to create an example for this PR as well. This will take a bit of work but I'd like to convert what I created for Readium that's currently available at: https://github.com/readium/guided-navigation/tree/main/examples/comics

This is a CC-licensed comic so it's a pretty good example to work with.

Here's what I have in mind:

@iherman

iherman commented Feb 6, 2026

Copy link
Copy Markdown
Member

I'd like to create an example for this PR as well. This will take a bit of work but I'd like to convert what I created for Readium that's currently available at:

Alternatively, or in addition to, it would be great to have this example added to the test suite. I am happy to help to convert it into a bona fide test when the time comes (there are some metadata requirements).

@iherman iherman added Spec-EPUB3 The issue affects the core EPUB 3.X Recommendation Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.X Recommendation labels Feb 6, 2026
@HadrienGardeur

Copy link
Copy Markdown
Member Author

@iherman I'd like to create a full example (an entire chapter) but it could be easily shortened to a page or two for the test suite.

@HadrienGardeur

HadrienGardeur commented Feb 7, 2026

Copy link
Copy Markdown
Member Author

Here's the WIP for this example: https://github.com/HadrienGardeur/accessible-epub-comics

[UPDATE]: @iherman I'm done with the first page so there's probably enough content for a test file now.

@iherman

iherman commented Feb 9, 2026

Copy link
Copy Markdown
Member

[UPDATE]: @iherman I'm done with the first page so there's probably enough content for a test file now.

Thanks. I will look at this at some point, but I would prefer to wait until this PR gets indeed consensus, ie, get merged, before doing this.

@HadrienGardeur

HadrienGardeur commented Feb 9, 2026

Copy link
Copy Markdown
Member Author

I'm done with a first version of a full publication with <img> in SMIL.

Of course, epubcheck is unhappy with this example:

  • it doesn't like that an image in the spine has a Media Overlay
  • it doesn't like that I'm using <img> in SMIL
  • it doesn't like that I'm using <seq epub:type="panel"> to group <par> for each panel without using epub:textref
  • and it doesn't like seeing multiple references to script.xhtml in different SMIL files using <text>

By the way, this example also illustrates how images in spine + Media Overlays can be more accessible for comics than just wrapping up images in XHTML.

@iherman

iherman commented Feb 10, 2026

Copy link
Copy Markdown
Member

I'm done with a first version of a full publication with <img> in SMIL.

Of course, epubcheck is unhappy with this example:

  • it doesn't like that an image in the spine has a Media Overlay
  • it doesn't like that I'm using <img> in SMIL
  • it doesn't like that I'm using <seq epub:type="panel"> to group <par> for each panel without using epub:textref
  • and it doesn't like seeing multiple references to script.xhtml in different SMIL files using <text>

By the way, this example also illustrates how images in spine + Media Overlays can be more accessible for comics than just wrapping up images in XHTML.

I can understand the first two entries, obviously. The third sounds like sg. that must be specified in the spec if we go ahead with images. But the fourth entry is weird. Is it an epubcheck error?

cc @mattgarrish @rdeltour

@rdeltour

Copy link
Copy Markdown
Member

But the fourth entry is weird. Is it an epubcheck error?

I suppose it comes from Media overlay document requirements (section 9.3.2.1 in the current draft) stating

more than one media overlay document MUST NOT reference the same EPUB content document.

@rdeltour

Copy link
Copy Markdown
Member

For clarification, EPUBCheck is likely wrong here. script.xhtml is not referred from the spine so it may not be considered a content document and the statement above would not apply.

But out of memory we do not verify that a document is referenced from the spine before applying checks, so basically any XHTML document found in the container is considered an XHTML Content Document by EPUBCheck, and constraints of content documents are applied.

@mattgarrish

Copy link
Copy Markdown
Member

Ya, this is a pretty radical departure from current media overlays where the xhtml content document is the driver. You can have multiple content documents that refer to the same media overlay, but a single content document can't be referred to from multiple media overlay documents because you can only specify one in the media-overlay attribute. (It could become a real headache to figure out when media overlays are valid moving forward.)

I'm assuming the media overlays section will need a pretty radical rewrite to take focus off it being largely bound to xhtml with audio sync capabilities. I'm not even sure how syncing images and content documents works, beyond even the display in the viewport issue. It presumably makes the text content a top-level content document which will require these text documents to be in the spine, but how does that work with rolls?

I was also thinking media overlays may not belong under aural rendering since it sounds like audio sync may not be a requirement anymore. You might have to pull it back out and maybe make aural rendering informative with a trimmed down explanation of how media overlays work for audio sync.

(But I'm trying to focus on getting the accessibility metadata guide wrapped up, as we need to be able to reference it from the techniques document, so I haven't had the time to keep up.)

@iherman

iherman commented Feb 10, 2026

Copy link
Copy Markdown
Member

Hm. I am feeling more and more uncomfortable with amount of change triggered by the introduction of the <img> element at this point. Honestly, I thought it would be a simple change, and I was obviously wrong (and Matt was right, who warned us on the call...).

I wonder whether it is indeed a good idea to do this at this point in the game. I would prefer to re-discuss it on the call to be sure we are still o.k. with this.

Sorry @HadrienGardeur

cc @w3c/w3c-group-145018-members

@HadrienGardeur

HadrienGardeur commented Feb 10, 2026

Copy link
Copy Markdown
Member Author

Some additional thoughts:

  • I've used a single out-of-spine HTML for the whole script because this felt like the right thing to do, but I could write a script per page instead
  • one limitation that I can think of, is that in some cases, you need to go back and forth between the document positioned in the left of the spread and the one in the right of the spread (that's an issue with our current approach for Media Overlays that extends beyond this PR)
  • we could keep the restriction of having a single text element, but this would require a textual alternative/script for such files, which might be fine
  • IMO, we should allow SMIL on images in the spine, I don't see a good reason to limit this to XHTML/SVG with the addition of <img>
  • I could drop the <seq> elements with no epub:textref but we would lose the possibility of navigating panel by panel (which is quite useful)

As you can see, I can potentially work around some of these epubcheck issues at the cost of some features.

@HadrienGardeur

HadrienGardeur commented Feb 10, 2026

Copy link
Copy Markdown
Member Author

I was also thinking media overlays may not belong under aural rendering since it sounds like audio sync may not be a requirement anymore.

That's a different matter, unrelated to this PR.

With our current model, we only require <text> while <audio> is optional, which means that labeling the whole section "Aural rendering" feels like a misnomer.

If we also require 0 or 1 <img> element, this won't change the fact that you could have a SMIL with just <par> and <text>.

@HadrienGardeur

HadrienGardeur commented Feb 10, 2026

Copy link
Copy Markdown
Member Author

Working on this example, seeing the current limitations with epubcheck and having these discussions all feel like a very fruitful exercise to me.

Based on my recent comment (#2919 (comment)), I'll update my example to use a script per page instead of a global one.

I'm not changing yet how I use <seq> because it's worth discussing IMO, but if it's too much of a hassle, it's easy enough to just flatten the structure.

For the other epubcheck errors, I think that they should be amended eventually:

  • having Media Overlays on images is IMO a good thing for compatibility, RS who can't work with images in spine will fallback to the HTML variant and won't be exposed to <img> in SMIL
  • errors related to <img> would naturally vanish if we update epubcheck to match the spec

[UPDATE]: and that's done. As expected I'm still receiving the three errors pointed out above, but I could easily get rid of the one related to <seq>.
[UPDATE 2]: Instead of flattening the structure of the SMIL, I've added IDs for each panel in the script of each page and added references on to them using epub:textref. We're down to two errors in epubcheck.

@HadrienGardeur

Copy link
Copy Markdown
Member Author

@iherman give me another week to continue working on this before we discuss it in a call again.

I'm done with the example for now, which means that I can go back to the PR.

@HadrienGardeur

Copy link
Copy Markdown
Member Author

Just a heads up to say that we'll start implementing this feature in the Readium Swift toolkit next week:

  • we don't work with SMIL as-is, it gets converted to an internal model where we already support images and videos
  • which means that parsing <img> in SMIL is very easy for us
  • we'll use my example to support img + audio on images in spine
  • I'll most likely create a variant that drops audio references to also test our fallback to TTS using text
  • for now, we won't use image fragments, this will come later this year
  • we expect to have this ready by the end of March and available in our beta for Thorium Mobile
  • I'll most likely do a demo at an accessibility conference organized in Oslo in June

In terms of UX, this initial support will offer two options for users:

  • when you decide to "read", it will display the full page with audio playback in the background (either continuous or page by page)
  • or when you decide to "listen", it will provide an audiobook-like experience

For screen reader users, we might also open the script instead of images but I'm not 100% sure about this one yet.

@HadrienGardeur

HadrienGardeur commented Feb 19, 2026

Copy link
Copy Markdown
Member Author

I'm back to this task again.

As expected, adding a SMIL parser that supports <img> was trivial and we've already done so in the Readium Swift toolkit. We'll even add support for <video> in case it's ever used in the future (for sign language for example).

Working with the Media Overlays section of the spec and on an example for this PR, I've identified a number of core issues that I've reported separately from this PR:

The first two are directly relevant for this PR, the last one not so much.

Looking at the rest of the work necessary in the main specification:

@HadrienGardeur

HadrienGardeur commented Feb 26, 2026

Copy link
Copy Markdown
Member Author

To reflect the content model of this PR, I've created a variant of my previous example that doesn't use any <audio> element.

Both examples are still in the same repo:

I've also created a variant of my full example with a single script and SMIL.

@iherman

iherman commented Feb 26, 2026

Copy link
Copy Markdown
Member

This was discussed during the pmwg meeting on 26 February 2026.

View the transcript

<img> in SMIL

#2883 w3c/epub-specs#2919

Hadrien: Summary: provided examples we need something like this or based on this. move fwd with "At Risk"
… working on a PR
… need full examples
… created a CC comic 8 pages full script full audio we have panels and textual / audio descriptions speech bubbles etc.
… accessible comic at this point.
… epubCheck or in Spec. the more I dig into the spec the more I found issues in MO in general. images in spine I am not worried about the rest is what is concerning the content model.
… image in SMIL, at EDRLab iOS implementation for MO so I see this also to implement this in general. I am worried, about the current spec than the PR. ref. to the issues filed.
… PR 1 text element with 0,1 audio element.
… 2 combos text or text/audio now you can have text/audio, text/image, text/audio/image so 2 new additions.
… in Comic you can have descriptions of a panel or speech bubble.
… image/audio/text so you can have pre-recorded audio.
… braille tablet could use the text. or you can use TTS using the text.
… 2 issues from epubCheck image is not authorized which is expected. in my example would work best to keep script "text" eq. as a separate file and it shouts at me cause I am using MO on an image.
… one is expected the other less so.
… maybe without images in spine but that would be weird then script in same page and try to hide it. what I have done is more accessible than trying to hide things using css
… I opened them in multiple RS, thorium Apple books, very few that support MO anyways.
… Reading Systems I am not worried about that spec.
… now I have 4 examples, image text/audio, image with text equivalent. single script in a single file, and Media overlays in a spread.
… general impression we wrote the spec with FXL in mind but we forgot about spreads or content drawn across spreads, Reflowable with MO.
… matt you were worried about this in Reading System spec, I am not worried myself.

mgarrish: I haven't gotten a chance to look at this again. sure we can put an element, but what additional authoring req. say for split screen, not sure how it will work yet. roles needed where you have things in the spine etc.
… I don't have strong opinions on your other issues. Daniel and Marisa were the ones at DAISY taking charge on this work.
… very focused on XHTML document and need pros need to be reworked and implications with SVGs etc.

Hadrien: timing issue unrelated to this PR. bigger issue with timing if you don't have audio what does it mean. introducing images doesn't affect timing issues.
… I haven't changed the content model for text.
… I don't think we can include duration and text and I don't think this PR makes it worse its already there.
… the spec is very open, doesn't limit fragment ID any fragment if you use any other fragment it is unclear.
… compatibility the current RS are broken, spreads are broken, desktop gets into a loop.

mgarrish: without TTS, time to show the text is problematic if there is no TTS. if we prerecord the TTS and have the RS do the TTS instead of the pre-recorded. We need some timing base. WHat happens when there is no audio. user needs to hit fwd to make the content move.
… images on there own can be problematic here.

Hadrien: in the PR you can't just have images alone.
… timing I don't see the usefulness of the duration of each SMIL file as a RS will throw away. TTS on Text this PR makes a case for it.
… comic with textual equivalent can be displayed or sent to SR or Braille Tablet, using TTS with the text element this PR is a much stronger use case.

wendyreid: all good points.
… we need to reach out to Daniel and Marisa's take on this. as our MO Expert
… she has enhanced synchronized content. I will reach out to her.

ivan: My question / worry the problems you found in MO is it problems in SMIL original spec or how we took SML into EPUB?
… that spec is mainly used in EPUB. we don't own that SMIL spec. there is a problem there I see if there is a real problem with SML then we need to break the ties and implement independently.

Hadrien: the latter, there only one case where issue there is content across the spread we may be limited by the SMIL syntax, all the other issues are in our spec.
… we may be able to tackle that another way, submitted an issue for that.
… there is a lot of interest in this space. this is part of our spec we haven't touched for a while. Serving the community I think doing additional work for that part of the spec will help everyone.

sueneu: I know CSS is pushing into audio, can we use that in addition to SMIL?

Hadrien: not really thats more for TTS and is a very complex discussion.

wendyreid: we need more review on this.

Hadrien: last question, what should I do next?
… I can work on the PR to handle the remaining things or documenting things I am finding. open question.
… I wasn't expecting to open all these issues initially.

wendyreid: we will go through the issues. so document things as you run into them. we need review from Marisa on the PR / issues filed.
… and Daniel.

ivan: biyearly issues with Time change 3 week timezone changes.
… Europe this means 1 hour earlier. Japan 1 hour earlier as well.
… March 12 1 hour earlier, 2nd April goes back to normal.
… emails in the agenda will flag it.


@marisademeglio

Copy link
Copy Markdown
Contributor

I have read the PR linked above, and I have some questions about this proposal.

  1. I see that each par will always have a text element, and optionally can have an img. How does the reading system know where to place the image?

  2. What is the image's duration? Equal to the text? If every par does not have an image, then does the image hang around until the next par with an image comes along the timeline? Or does it disappear as soon as its parent par is over?

  3. Is there anywhere to put accessibility information about the image? Is it assumed that the text of the parent par contains the accessible image description? This would be a departure from the land of HTML accessibility, and could be an interesting requirement on EPUB accessibility checkers.

I had a look at #2936 and I wonder about inserting accessibility information in a playback timeline. It's different from an accessible UX where the user chooses to dive into the description or not, and would require a lot of specialization of SMIL playback rules.

  1. How would one apply CSS to the image? I'm thinking of scaling, effects, clip-path.

  2. What does this approach do that referencing an image element in an HTML file could not?

@HadrienGardeur

Copy link
Copy Markdown
Member Author

I see that each par will always have a text element, and optionally can have an img. How does the reading system know where to place the image?

I see two main use cases:

  • when the <spine> contains images (accessible comics for example, like in my example)
  • or when the <spine> contains HTML documents

For the first one:

  • without using Media Overlays, the RS would display a page, spread or roll
  • when Media Overlays are used, there would be some kind of focus on the region documented in <img>
  • <text> would either be read aloud, displayed on the screen or ignored
  • and <audio> would be played, if present

Users would be free to use what works best for them in terms of image presentation (full page or focus on regions), audio (pre-recorded or TTS) and text (always displayed, only displayed when it's not a description).

The second use case could be useful for example in a reflowable book showing a map, where you could focus on specific parts of the map and provide narration related to this image fragment.

This means that:

  • without using Media Overlays you would see a screen-worth of content with reflowable EPUB or a page/spread in FXL
  • while using Media Overlays, this would be the typical experience most of the time with text highlighted and audio playback
  • but when you encounter an image, there would be a temporary focus on one or more regions of that image before going back to the usual MO experience

Focus on a region could be be done using a variety of techniques:

  • showing the entire page with a rectangle or some overlay for that region
  • zooming on the region
  • cropping the rest of the image and only displaying the region

What is the image's duration? Equal to the text?

First of all, I think that we tend to focus a little too much on continuous playback but various RS out there offer other options as well.
In the case of Apple Books, there's also a reading mode that plays all parallels for what's currently in the view (page or spread) and then stops until the user turns the page.
In the case of Thorium Desktop, there's also a step by step reading mode that plays a single element and then wait for the user to move to the next one.

Ideally, we should consider all three because they all make sense.

To answer your question:

  • in continuous playback mode, this would be equal to the text (TTS) or audio (if present and preferred by the user)
  • in resource-based playback mode, this would work the same way but it would stop at the end of the resource
  • but in the step by step reading mode, it would display the image fragment, play the audio and/or display the text and then wait for the user (this could be handled with a tap, swipe, click on a button and/or keyboard interaction)

If every par does not have an image, then does the image hang around until the next par with an image comes along the timeline? Or does it disappear as soon as its parent par is over?

It would go back to what is normally displayed while following the <spine> in this case:

  • for images in spine, this means a page, spread or roll.
  • for HTML, this means a screen-worth of content with reflowable EPUB or a page/spread in FXL

Is there anywhere to put accessibility information about the image? Is it assumed that the text of the parent par contains the accessible image description?

This would be based on epub:type. I've opened a separate issue (#2936) to add description to epub:type since this could be useful with or without this PR.

In my example, you can see that some <par> elements contain a description, while others contain the text of a speech bubble (balloon in epub:type).

I had a look at #2936 and I wonder about inserting accessibility information in a playback timeline. It's different from an accessible UX where the user chooses to dive into the description or not, and would require a lot of specialization of SMIL playback rules.

We definitely require more settings for SMIL playback rules in general, this one is pretty easy to handle on the RS side and IMO makes quite a bit of sense (do I need image descriptions or not?).

How would one apply CSS to the image? I'm thinking of scaling, effects, clip-path.

I haven't covered this one and I'm not a fan of CSS authored highlights anyway. I think that they can be harmful in reflowable EPUB (#2933) and the RS should let the user decides what works best for them (for example zooming might be problematic for people with vestibular disorder but work great for others).

What does this approach do that referencing an image element in an HTML file could not?

It allows you to point to a region and also works with images in spine, which is the preferred way of authoring comics in many markets.

Most RS with specialized support for comics use native API for images rather than a webview.

It also has the benefit of keeping <text> available for text (as it should be), unlike an approach where <text> is used to reference the image itself.

It also allows specialized libraries to create a SMIL without changing the HTML, which is increasingly becoming a requirement with tools that are used post-production.

@HadrienGardeur

HadrienGardeur commented May 19, 2026

Copy link
Copy Markdown
Member Author

With the on-going discussions about a recharter and a focus on Media Overlays in EPUB 3.5, I expect support for <img> in SMIL to be one of the experimental features that we'll tackle in this next revision cycle.

In previous comments, @mattgarrish said that he's not sure how this would affect the reading experience and I wanted to go back to this question to prepare for 3.5.

I'll focus specifically on comics/manga for now.

With this addition, in terms of access modes, a user could combine things in the following ways:

  • visual only (classic reading mode for comics/manga)
  • text only (useful for a screen reader)
  • audio only (audiobook-style, this is possible either with TTS using text or pre-recorded audio)
  • visual and audio
  • visual and text
  • visual, text and audio

Previously, without <img> in SMIL for EPUB Media Overlays, only the following combinations were available:

  • visual only
  • audio only (limited to pre-recorded audio)
  • visual and audio (limited to pre-recorded audio and full page for visual)

In addition to these access modes, the following options could be made available for a user:

  • automated or step-by-step playback
  • region magnification (using a zoom or by excluding the rest of the page somehow)
  • settings for region magnification (if I suffer from vestibular disorder, it's best for me to avoid transitions/movements)
  • skippability (useful for descriptions)
  • display preferences for text
  • pre-recorded or TTS for audio

Let's illustrate what this means with an advanced use case. A dyslexic user could for example use:

  • visual, text and audio together
  • with region magnification
  • step by step playback
  • skippability on descriptions
  • pre-recorded audio
  • and Atkinson Hyperlegible with good letter/word spacing for text

This would provide the kind of user experience that we cannot offer today:

  • the playback would wait for a tap, swipe or keyboard event to move to the next element in SMIL
  • with text displayed as well, region magnification would only display the current region and display text below it
  • descriptions would not be read aloud, it would just show the visual (each panel in my example)
  • but for each speech bubble, it would play the pre-recorded audio and display the equivalent text below the region using my display preferences for text

With this ability to combine access modes and the reading options that I've described above, I believe that we could offer very compelling and flexible user experiences. This goes beyond anything possible in EPUB or DAISY today.

I expect that we'll be able to illustrate some of these new capabilities before the end of the year and iterate from there to provide more examples and implementations throughout 2027.

@mattgarrish

Copy link
Copy Markdown
Member

My concern is still with the playback complexity that this pushes onto users. Instead of getting 'read aloud' playback as they've become accustomed to, they have to know all the possible permutations of manual/automatic playback that could be dumped on them, as well as how to configure them all.

I'm also concerned that for all the complexity we may not be budging the needle on accessibility. This leads to a single playback mode that is sort of accessible but may not appeal to all users; it's not universally accessible content. Just the idea of smil driving playback is likely a regression that leaves synthetic speech playback offering a different inaccessible experience (e.g., cross-spread).

But have any major publishers expressed interest in producing this kind of content? The technical details of making it work are only one consideration. I get the sinking feeling we're going to end up on an IDPF-like fishing expedition to create things that will never lure anyone in.

And then there's my concern that the deeper we dig into smil to try and make these experiences work the more we diverge from where the web is going. We joined with W3C to try and put an end to epub-only features. If webvtt is the future, for example, whatever we do should consider future alignment with it.

@HadrienGardeur

HadrienGardeur commented May 19, 2026

Copy link
Copy Markdown
Member Author

My concern is still with the playback complexity that this pushes onto users. Instead of getting 'read aloud' playback as they've become accustomed to, they have to know all the possible permutations of manual/automatic playback that could be dumped on them, as well as how to configure them all.

I've listed all the options for the sake of providing a complete picture of what's possible, but of course RS will handle this in a way that's manageable for users (it's their job after all).

Many RS already provide separate reading modes for:

  • full pages/spreads view
  • and panel/guided view (which we can't truly do in EPUB without this addition or using region-based navigation)

It's not terribly difficult to add a contextual "read aloud" option to these views along with a few options. This would already cover a good range of what's possible, with very little added complexity for end users (a single action).

I have a pretty clear path forward for implementing this step by step in Thorium Reader for example.

I'm also concerned that for all the complexity we may not be budging the needle on accessibility. This leads to a single playback mode that is sort of accessible but may not appeal to all users; it's not universally accessible content.

Given the complete lack of support for accessible comics right now, I have a hard time hearing this argument.

I'd love to hear counter proposals that strike the same balance:

  • minimal addition to the spec
  • and unlock new ways to create and consume accessible comics

Just the idea of smil driving playback is likely a regression that leaves synthetic speech playback offering a different inaccessible experience (e.g., cross-spread).

How would that be a regression compared to what we have today when we have absolutely nothing viable for comics? Also, with what I'm proposing, cross-spread TTS is definitely possible (if you mean going back and forth between left and right pages).

But have any major publishers expressed interest in producing this kind of content? The technical details of making it work are only one consideration. I get the sinking feeling we're going to end up on an IDPF-like fishing expedition to create things that will never lure anyone in.

Specialized libraries have expressed their interest, not trade publishers. They have zero options right now aside from rolling out less capable alternatives to what I've described above (they usually have to create their own format or hack Media Overlays in much more intrusive ways).

Producing accessible comics remains more complex and costly than novels, but it doesn't mean that we shouldn't offer an option for them. Once we do so, I do expect a number of publishers to eventually jump on board (the same way that some publishers started producing natively accessible EPUB files before it became a legal requirement), but this likely won't be a requirement under EAA (it will fall under both "undue burden" and "fundamental alteration").

I get the sinking feeling we're going to end up on an IDPF-like fishing expedition to create things that will never lure anyone in.

Unlike the IDPF, we have pretty clear rules for implementations. With an EPUB 3.5 recharter and a timeline that extends to early 2028, we'll have enough time to see if there's truly interest enough in this feature before we put the final stamp of approval on it.

I know that several organizations are already adopting this feature or planning to test it out, which makes me quite confident about the outcome.

@mattgarrish

Copy link
Copy Markdown
Member

Framing the options as accept your proposal or accept inaccessible comics isn't going to sway me. My concerns are still with the knock-on effects that dramatically changing media overlays for all content types will have.

But the sense I keep getting here is that your focus is on the problems images in the spine create, specifically with the introduction of rolls. To try and break the impasse I think we're at, is it reasonable to propose that this new content model be a profile of media overlays that only applies to rolls? That would at least keep media overlays as everyone knows them untouched for the cases they've already been applied to.

It also means that when we have to try and defend these new "accessible comics" to the accessibility people at W3C, which we will have to do, we can explain that this is a solution to a specific problem and not a claim that we've now solved comic/fxl accessibility by breathing life into a mostly forgotten technology.

It also limits the "damage", so to speak, of not having a pathway forward to web friendlier ways of doing things.

@HadrienGardeur

Copy link
Copy Markdown
Member Author

Framing the options as accept your proposal or accept inaccessible comics isn't going to sway me.

My latest comment (#2919 (comment)) is not about a technical solution, it's entirely about UX.

This could be powered by anything, this is not about SMIL vs WebVTT vs whatever (I care very little about how this is serialized). So that's definitely not what I've said.

If you're unhappy with the proposed UX and/or technical solution, that's fine, but please provide an alternate solution to make this discussion a bit more constructive.

But the sense I keep getting here is that your focus is on the problems images in the spine create, specifically with the introduction of rolls.

Not true either.

The problem is that we do not have a way to talk about fragments of an image, and IMO we need one to create an accessible UX for comics or complex images in general.

To try and break the impasse I think we're at, is it reasonable to propose that this new content model be a profile of media overlays that only applies to rolls?

I don't think that's a good idea, it would be extremely useful for paginated comics (pre-paginated) as well and even for the kind of images that we've been discussing in the light novel issue (#2985).

I'd be happy to test if this truly breaks anything though, I've already tried it in a few RS that simply ignored <img> when present but this could be expanded to cover a wider range of RS.

@mattgarrish

Copy link
Copy Markdown
Member

Again, it's not the theory of how the reading progression might work that concerns me. It's the potential havoc it's going to play with users by overloading media overlays to achieve it.

We're giving users a magic button. How do they know that the reading system is now waiting for them to swipe forward to move to a next image if region navigation is all that's being provided? What happens if a screen reader user drops out of img/text/audio playback to use tts and the text/audio have disappeared? They have to figure out that the text is hidden behind a fallback that they don't have access to, or that they didn't know they had to find a configuration option somewhere to force the text to be the default spine presentation? What happens if someone gets the idea to mix region-based navigation of an image (no audio) into what is otherwise audio/text sync playback? Does the user realize the audio hasn't broken but the reading system is waiting on their input?

We shouldn't be pushing the complexity of playback onto users in order to make life a little simpler for authoring.

@marisademeglio

marisademeglio commented May 20, 2026

Copy link
Copy Markdown
Contributor

How would a publication like what is being described here be accessed by a screen reader user? SMIL is not parsed by screen readers, to the best of my knowledge and imagination.

Do very many reading systems parse SMIL for their TTS-based Read Aloud feature?

Is it possible to produce this type of publication in a way that could meet EPUB Accessibility requirements? If it is possible and just not done, then that's the publisher's problem. But if there is not a way to do it, then that's our problem.

I feel like these issues are ultimately FXL problems, not something we will solve without addressing it there.

@HadrienGardeur

Copy link
Copy Markdown
Member Author

Again, it's not the theory of how the reading progression might work that concerns me. It's the potential havoc it's going to play with users by overloading media overlays to achieve it.

"Weaking havoc" feels like an overstatement for a feature that has already been available in both Apple Books and Thorium Reader for a while now:

  • in Apple Books, you can turn off automatic page turns in FXL with Media Overlays, which will pause the playback until the user goes forward or backward in the reading progression
  • in Thorium Reader on desktop, you can "disable continuous play" which will automatically pause the playback after each element (you can hit the next button or keyboard shortcut to resume the playback)

These are useful features that users have been asking for and they haven't wreaked havoc at all, quite the contrary.

I agree that the "default experience" should probably remain automatic playback, but it's perfectly fine to have a setting giving control back to the user as well.

What happens if a screen reader user drops out of img/text/audio playback to use tts and the text/audio have disappeared? They have to figure out that the text is hidden behind a fallback that they don't have access to, or that they didn't know they had to find a configuration option somewhere to force the text to be the default spine presentation? What happens if someone gets the idea to mix region-based navigation of an image (no audio) into what is otherwise audio/text sync playback? Does the user realize the audio hasn't broken but the reading system is waiting on their input?

These are RS concerns and it's up to them to make this as easy as possible for every user.

These questions are not really new either, even with today's implementation of Media Overlays, we need to make sure that users can fluidly move between various reading modes:

  • audiobook-like experience (either powered by Media Overlays or TTS)
  • text only
  • text and audio together

For example a user might start reading, push the so-called "magic button" that gives them text and audio together, then put their phone in their pocket for an audio-only experience, step into their car and continue to listen from their car's entertainment system.
I'm not making this complex on purpose, that's exactly what users expect from a modern RS.

For example, in the upcoming 3.5 version of Thorium Reader for desktop, we're bringing a lot of features that were previously available for TTS only to Media Overlays as well.

I'm not sure how this is relevant for the overall discussion here, that's not the kind of guidance that we provide in our RS specs.

@mattgarrish

Copy link
Copy Markdown
Member

Is it possible to produce this type of publication in a way that could meet EPUB Accessibility requirements?

My guess is they wouldn't. It sounds more like the current classification of overlay-dependent publications as optimized publications.

And we know that reading systems generally don't process fallbacks, and not in a way that allows users to pick a preference. That immediately affects anyone who wants to access synthetic speech outside of overlay playback.

There's also a known hole in the fallback process with fallbacks presumably having to be layout-consistent with the primary spine item because you can't specify different properties (i.e., if an image in spine is not supported in a roll sensibly the reading system should assume the fallback is also an fxl document it can use in place). That doubly makes putting descriptions as fallbacks problematic - how to get at them and that they could be rendered the wrong way potentially cutting off visual access to the text.

But fallbacks generally are ill-defined. So what happens in the roll case if a publisher only puts an image in spine with the description as a fallback file? Does that mean that anyone whose reading system doesn't support images in spine can only read the descriptions of the images, even though they may not have visual issues? Or are we now moving to fallbacks of fallbacks, and how likely is that to ever work in general?

Or what if they create different description files for each region that the user will reach. Is there a chain of "fallbacks" that are in fact not fallbacks but all needed by anyone not using media overlays playback?

These are the kinds of bugs with epub that will limit the usability of these books to media overlay playback mode only in reading systems that support the change. If the fallback problems don't get addressed and widely implemented, then presumably you're out of luck, and that's not a good way to introduce a new accessibility feature.

@HadrienGardeur

HadrienGardeur commented May 21, 2026

Copy link
Copy Markdown
Member Author

I feel like these issues are ultimately FXL problems, not something we will solve without addressing it there.
Is it possible to produce this type of publication in a way that could meet EPUB Accessibility requirements? If it is possible and just not done, then that's the publisher's problem. But if there is not a way to do it, then that's our problem.

Right now, there isn't a way to produce FXL files that can comply with EAA, and I agree with you, that makes it our problem.

One of the biggest problem in general with FXL is tied to text. Due to the way that text is positioned in FXL, it makes it impossible for the user to change the font size, use a different font, tweak spacing or use different colors/themes.

Things are even worse in comics because text is very rarely real text (which would make this textOnVisual) and often uses handwritten text that can be even harder to read for a lot of users.

Comics can also be very rich in what they contain with a lot of visual information and conventions that need to be conveyed to the user in textual form to fully understand what's going on.

Let's use an example based on what I've created for this PR.

Here's the first page.

Page 1 in full

Let's use the third panel specifically.

Page 1, Third panel

To understand this panel, we need to:

  • identify the fragment of the page that corresponds to the panel
  • describe it
  • provide a transcription of each speech bubble that indicates which character is speaking and emotions tied to their expression
  • and optionally provide an audio equivalent of this

This is what the proposed SMIL looks like:

<seq epub:type="panel" epub:textref="script.xhtml#page1-panel3">
  <par epub:type="description">
    <img src="page1.jpg#xywh=percent:35.5,50.3,60.4,21.5"/>
    <text src="script.xhtml#page1-panel3-description"/>
    <audio src="audio/page1-panel3-description.mp3"/>
  </par>
  <par epub:type="balloon">
    <img src="page1.jpg#xywh=percent:35.5,50.3,60.4,21.5"/>
    <text src="script.xhtml#page1-panel3-bubble1"/>
    <audio src="audio/page1-panel3-bubble1.mp3"/>
  </par>
  <par epub:type="balloon">
    <img src="page1.jpg#xywh=percent:35.5,50.3,60.4,21.5"/>
    <text src="script.xhtml#page1-panel3-bubble2"/>
    <audio src="audio/page1-panel3-bubble2.mp3"/>
  </par>
 </seq>

Here's the corresponding text in HTML:

<section id="page1-panel3" epub:type="panel">
  <p id="page1-panel3-description">Pepper frowns and looks angrily over her shoulder.</p>
  <p id="page1-panel3-bubble1">Pepper shouts: "NO! I'M LEAVING!!"</p>
  <p id="page1-panel3-bubble2">Pepper says: "You don't teach real witchcraft! I'm going - to the witches of Ah!"</p>
</section>

Compared to what we can do today, this solves a lot of issues:

  • we can identify a region of an image and say what it actually is (a panel)
  • this also allows us to provide navigation and/or skippability (go to next panel, go to next page)
  • we have a textual description of the panel, properly identified as such (allowing skippability)
  • we have real text for what would otherwise be text on visual for speech bubbles

This brings us much closer to what we need for an accessible experience, with multiple access modes (visual, text and audio), proper navigation (page and panel level), potential skippability and real text that we can leverage in many different ways.

Even better: you don't really need to change anything in these images to make this work, it's a progressive enhancement that you can add next to the image, which is very important when you consider how adaptations are handled by specialized libraries.

How would a publication like what is being described here be accessed by a screen reader user? SMIL is not parsed by screen readers, to the best of my knowledge and imagination.

Reading systems can expose textual content to screen readers in various ways using SMIL as an input.

For example:

As an alternative, screen reader users could also trigger the Read Aloud feature as well if they rely on audio for their screen reader (rather than braille for instance).

Do very many reading systems parse SMIL for their TTS-based Read Aloud feature?

Not currently, because there isn't a good reason to do so with our current content model which only allows:

  • <text> on its own
  • or <text> plus <audio>

This would be a much more compelling reason to use the content of the SMIL for a TTS-based Read Aloud feature.

@HadrienGardeur

HadrienGardeur commented Jun 3, 2026

Copy link
Copy Markdown
Member Author

Ahead of the meeting in Oslo, I've been authorized by Minori IKEDA to share an early implementation of the accessible version of Pepper & Carrot that I've created for this PR.

This is based on the BlinkGTK Project from Fuse Network, Inc.

Here's a video: https://drive.google.com/file/d/1OessKnLi45qZGwvTQ1jaK-CJ0RXQz2xP/view?usp=sharing

Based on the criteria that I've documented in #2919 (comment) this implementation ticks the following boxes:

  • visual and auditory
  • using pre-recorded audio
  • no skippability on descriptions
  • continuous playback
  • region magnification using zoom

I've already exchanged a few messages with him and he confirmed that implementing step by step playback and skippability would be very straightforward with this approach.

@m-abs

m-abs commented Jun 8, 2026

Copy link
Copy Markdown

Ahead of the meeting in Oslo, I've been authorized by Minori IKEDA to share an early implementation of the accessible version of Pepper & Carrot that I've created for this PR.

This is great, it is very similar to what I have been thinking for the future of Nota-service's* comics.

Since 2013 we have provided narrated comics mainly for our dyslexic users through our own custom addition to DAISY 2.02**). This utiliise SMIL 1.0 and some creative HTML+CSS+JavaScript in our own player.

We are a library not a publisher, so we process existing comics, by scanning the pages, manually mark up the panels + the reading order and have humans narrate them. We don’t add descriptions of the panels, which makes them less accessible for our visually impaired users.

We are interested in a standard format to distribute comics instead of our custom format and have the ability to provide optional panel descriptions, background sounds etc. to make them truly accessible.
It would also be great to make markup for TTS narration of our comics. This would make it possible to provide weekly comics magazines to our younger users.

*) Royal Danish Library's Nota Service, we provide reading materials for people with visual and reading disabilities.
**) Free example comic here: https://nota.dk/bibliotek/bog/n%C3%A5r-tegneserien-bliver-digital#audio

@HadrienGardeur

Copy link
Copy Markdown
Member Author

Here's a new video shared by Minori IKEDA that implements a number of the features that I've described in #2919 (comment) : https://drive.google.com/file/d/195ei6BWz3fWlE5aQ0sNz61h5OeyyzRLr/view?usp=sharing

@HadrienGardeur

Copy link
Copy Markdown
Member Author

It would also be great to make markup for TTS narration of our comics. This would make it possible to provide weekly comics magazines to our younger users.

This is fully supported by this proposal. I've created two examples in the dedicated repo:

  • one with visual and text, where text could be used for TTS for example
  • and another one with visual, text and audio

Currently the text always contextualizes who's speaking ("Pepper says", "Cayenne shouts" etc.) which wouldn't make it ideal to generate the kind of multi-voice rendition that I've used for audio, but this is something that we could explore as well.

To handle multi-voice rendition properly, you need to identify the person speaking (consistently, which could prove tricky) and also provide a description that could be used to prompt a modern TTS model and generate a voice.

You also mention background sounds @m-abs, which is another thing worth exploring. For sound effects, they could be handled linearly during the normal reading progression, but for background ones (let's say crickets in the summer or the sound of the waves crashing on the shore) they would need to be played in parallel of panel descriptions, captions and speech bubbles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Spec-EPUB3 The issue affects the core EPUB 3.X Recommendation Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.X Recommendation Topic-MediaOverlays

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Support for <img> in SMIL

6 participants