Support for <img> in SMIL#2919
Conversation
| <ul class="nomark"> | ||
| <li> | ||
| <p> [^text^] <code>[exactly 1]</code> | ||
| </p> | ||
| </li> | ||
| <li> | ||
| <p> [^audio^] <code>[0 or 1]</code> | ||
| </p> | ||
| </li> | ||
| </ul> |
There was a problem hiding this comment.
The new content model for par needs to be specified. Is img + audio a separate combination, since relying on tts for images likely shouldn't be an option, or is img optionally allowed and text remains required so all three elements can be used together? (But is showing both an image and a text fragment realistic if the viewport is occupied by the fxl image?)
There was a problem hiding this comment.
I guess the latter would be a new case of img and audio required and text optional, not img as an optional element in the current model. Probably doesn't make sense to always require text with img.
There was a problem hiding this comment.
(Apologies if you're already working on this, but I assumed you were moving on to the RS aspects by opening the draft.)
There was a problem hiding this comment.
No I'm not done with anything at this point, I just prefer to open a draft PR as early as possible.
For me, all combinations are valid:
img+text(textual alternatives for regions, for example description of a panel)img+audio(audio narration for regions)img+audio+text(this would allow someone to either listen to the pre-recorded or use for example a Braille tablet by consuming the textual content)
Even img on its own could be a valid use case and result in a panel-by-panel navigation for example in comics.
There was a problem hiding this comment.
I didn't go through the section that you highlighted yet, but IMO text, img or audio would all become 0 or 1 with at least one of them present.
There was a problem hiding this comment.
I don't believe that's how smil expects it to work by default, though, if I understand you correctly (that you pick the applicable format to synchronize). I'll preface by saying I'm not the expert on this, but my understanding is that if you specify all three in a single par then all three are expected to be synchronized.
This is definitely meant to synchronize all three media together.
You can open the following Google Slides in presenter mode to see a demo of what this could feel like: https://docs.google.com/presentation/d/1LGHRIN_vHl-H-bgXsHkhqrL0owMCy8qx854x9YU3d8w/edit?usp=sharing
That said, even with Media Overlays today you're free to use what you want:
- just text
- just audio
- or text and audio together
Multiple apps already offer the ability to consume EPUB with Media Overlays like an audiobook with just a player interface on screen.
In the specific case of comics and highly illustrated content, the ideal scenario would be to customize things to your needs:
- for example a dyslexic user could enable audio on captions and speech bubbles (either using
<audio>or with TTS on the content of<text>) but skip descriptions using skippability (this would need a specific role that could be identified) but display text below the image fragment (this is what my example in Google Slides illustrates) - a blind user could go full audio (once again using either
<audio>, with TTS on the content of<text>or using a textual view with a screen reader) without skipping anything at all - a user reading on a small screen could just use these image fragments to read more easily on their device and turn audio on just for speech bubbles
This is potentially problematic as it means you could validly have only an image listed, but what happens then?
That's the last use case that I've described above, this would give you region by region navigation.
With text, the reading system is supposed to tts the content before moving to the next par, but if someone only lists an image does it load and unload instantaneously?
Frankly, there's more of a use case for <img> or <audio> on their own than having just <text>.
What's the point of a SMIL with just <text> when you can just use TTS? Skippability and escapibility can be achieved without SMIL, the only use case I think of is to guide you through places in the publication.
There was a problem hiding this comment.
This is definitely meant to synchronize all three media together.
Okay, I'll have to see the new model for rendering the content before I comment any more on this. Having text content synced with a roll image, or an image placed into reflowable text, seems complex to spec out.
That said, even with Media Overlays today you're free to use what you want:
I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.
What's the point of a SMIL with just
<text>when you can just use TTS?
The only advantage is that you don't have to change playback modes. If the body is professionally narrated you could sync the backmatter for TTS without having to prerecord it.
But that's not the issue. There's still a timing sequence for text if you push the rendering out to TTS. The reading system will present/highlight the text for as long as it takes to render it as speech.
If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?
It's a little weird to use a synchronization markup language to not synchronize anything. An audio-only par element is equally strange from a synchronization standpoint, but at least the timing of the clip gives it a duration to play.
But, these are just my immediate concerns. I can wait until the draft is in a more complete state before commenting any more so we don't add a lot of noise to the pull request.
There was a problem hiding this comment.
I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.
I was mostly talking about the UX from a RS perspective, of course the current spec requires text.
Our current spec is really written in a way where we assume that:
- the main way this will be handled is by displaying text on the screen with audio playback in the background
- and the use of authored CSS is very much skewed towards FXL
In a reflowable EPUB where users are free to select different themes, using authored CSS for highlighting is a potential accessibility hazard, since you could end up with major contrast issues.
In practice, a user could just listen to an EPUB with Media Overlays without displaying anything on a screen, even if the SMIL includes text and/or img. This would be indistinguishable from an audiobook from a UX perspective.
If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?
I don't think that this is any different from <text> which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud). That's also why the dur attribute exists in SMIL but we don't have it in EPUB.
From a UX perspective, it's important to keep in mind that even with <text> and <audio> the playback can either be:
- continuous, where you go through the entire publication
- per page or per spread, where the current page/spread is read and the RS waits for the user to switch to another one before playback continues
- or handled element by element
With this last one, you have something that makes perfect sense for img on its own. For example:
- whole page displayed
- then just the first panel
- then the second panel
- then a zoom into part of the second panel to showcase a character and a bubble
- etc.
Once you sync text or audio to each of these image fragments, then playback can also become continuous or based on page/spread.
There was a problem hiding this comment.
I don't think that this is any different from
<text>which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud)
Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present. TTS provides the duration for text content. We took out the section on embedded media last revision and advise people that referencing that kind of content from text will have unpredictable results, but text was never meant to link to content for which no duration could be established.
If you go back to 3.2, before we took that section out, the embedded media that text referred to had to have an audio component that could be played back:
When a text element references embedded media that contains audio, the audio sibling element is OPTIONAL.
Referring to images was always problematic because the text content to TTS wasn't as straightforward as getting the text content of a typical html element, but there was still the possibility of using alt.
That's why having img reference images outside of an xhtml wrapper as the only element of a par contradicts all expectations we've ever had for media overlays to synchronize content with audio.
Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable:
img+text= duration through tts.img+audio= duration through audio playback.img+text+audio= duration through audio playback
But img on its own has no duration and no audio, so why do we even need it? It's like region-based nav but if you take away playback control from the user and expect the reading system to meaningfully automate it. If you drop that one case, the internal conflicts with what we have are greatly reduced.
There was a problem hiding this comment.
I think that there's a use case for <img> on its own, but as you've correctly pointed out it can also be implemented using the lesser known region-based navigation.
Some resources related to this use case in the Kindle ecosystem include:
- https://kdp.amazon.com/en_US/help/topic/G9GSTY4LTRT39D4Z
- https://kdp.amazon.com/en_US/help/topic/GJMRD9F78MS9F43R
Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present.
It feels very vague quite frankly, because it's impossible to estimate the duration of a SMIL that's <text> only. Based on the voice and speed that I use, the duration will be very different.
Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable
I completely understand your point here and for the sake of maximizing compatibility, I'm willing to focus strictly on two use cases right now:
img+text- and
img+text+audio
With this approach, text (1 exactly) and audio (0 or 1) would keep their current content model. img would become optional (0 or 1) just like audio.
I think that this somehow raises the bar for what we require from content creators, but given the focus on accessibility and specialized libraries, it's a trade off that we can work with.
If this works out, we could always relax our approach in a future revision to allow img + audio or img on its own.
|
I'd like to create an example for this PR as well. This will take a bit of work but I'd like to convert what I created for Readium that's currently available at: https://github.com/readium/guided-navigation/tree/main/examples/comics This is a CC-licensed comic so it's a pretty good example to work with. Here's what I have in mind:
|
Alternatively, or in addition to, it would be great to have this example added to the test suite. I am happy to help to convert it into a bona fide test when the time comes (there are some metadata requirements). |
|
@iherman I'd like to create a full example (an entire chapter) but it could be easily shortened to a page or two for the test suite. |
|
Here's the WIP for this example: https://github.com/HadrienGardeur/accessible-epub-comics
[UPDATE]: @iherman I'm done with the first page so there's probably enough content for a test file now. |
Thanks. I will look at this at some point, but I would prefer to wait until this PR gets indeed consensus, ie, get merged, before doing this. |
|
I'm done with a first version of a full publication with Of course, epubcheck is unhappy with this example:
By the way, this example also illustrates how images in spine + Media Overlays can be more accessible for comics than just wrapping up images in XHTML. |
I can understand the first two entries, obviously. The third sounds like sg. that must be specified in the spec if we go ahead with images. But the fourth entry is weird. Is it an epubcheck error? |
I suppose it comes from Media overlay document requirements (section 9.3.2.1 in the current draft) stating
|
|
For clarification, EPUBCheck is likely wrong here. But out of memory we do not verify that a document is referenced from the spine before applying checks, so basically any XHTML document found in the container is considered an XHTML Content Document by EPUBCheck, and constraints of content documents are applied. |
|
Ya, this is a pretty radical departure from current media overlays where the xhtml content document is the driver. You can have multiple content documents that refer to the same media overlay, but a single content document can't be referred to from multiple media overlay documents because you can only specify one in the media-overlay attribute. (It could become a real headache to figure out when media overlays are valid moving forward.) I'm assuming the media overlays section will need a pretty radical rewrite to take focus off it being largely bound to xhtml with audio sync capabilities. I'm not even sure how syncing images and content documents works, beyond even the display in the viewport issue. It presumably makes the text content a top-level content document which will require these text documents to be in the spine, but how does that work with rolls? I was also thinking media overlays may not belong under aural rendering since it sounds like audio sync may not be a requirement anymore. You might have to pull it back out and maybe make aural rendering informative with a trimmed down explanation of how media overlays work for audio sync. (But I'm trying to focus on getting the accessibility metadata guide wrapped up, as we need to be able to reference it from the techniques document, so I haven't had the time to keep up.) |
|
Hm. I am feeling more and more uncomfortable with amount of change triggered by the introduction of the I wonder whether it is indeed a good idea to do this at this point in the game. I would prefer to re-discuss it on the call to be sure we are still o.k. with this. Sorry @HadrienGardeur cc @w3c/w3c-group-145018-members |
|
Some additional thoughts:
As you can see, I can potentially work around some of these epubcheck issues at the cost of some features. |
That's a different matter, unrelated to this PR. With our current model, we only require If we also require 0 or 1 |
|
Working on this example, seeing the current limitations with epubcheck and having these discussions all feel like a very fruitful exercise to me. Based on my recent comment (#2919 (comment)), I'll update my example to use a script per page instead of a global one.
For the other epubcheck errors, I think that they should be amended eventually:
[UPDATE]: and that's done. As expected I'm still receiving the three errors pointed out above, but I could easily get rid of the one related to |
|
@iherman give me another week to continue working on this before we discuss it in a call again. I'm done with the example for now, which means that I can go back to the PR. |
|
Just a heads up to say that we'll start implementing this feature in the Readium Swift toolkit next week:
In terms of UX, this initial support will offer two options for users:
For screen reader users, we might also open the script instead of images but I'm not 100% sure about this one yet. |
|
I'm back to this task again. As expected, adding a SMIL parser that supports Working with the Media Overlays section of the spec and on an example for this PR, I've identified a number of core issues that I've reported separately from this PR:
The first two are directly relevant for this PR, the last one not so much. Looking at the rest of the work necessary in the main specification:
|
|
To reflect the content model of this PR, I've created a variant of my previous example that doesn't use any Both examples are still in the same repo: I've also created a variant of my full example with a single script and SMIL. |
|
This was discussed during the pmwg meeting on 26 February 2026. View the transcript<img> in SMILHadrien: Summary: provided examples we need something like this or based on this. move fwd with "At Risk" mgarrish: I haven't gotten a chance to look at this again. sure we can put an element, but what additional authoring req. say for split screen, not sure how it will work yet. roles needed where you have things in the spine etc. Hadrien: timing issue unrelated to this PR. bigger issue with timing if you don't have audio what does it mean. introducing images doesn't affect timing issues. mgarrish: without TTS, time to show the text is problematic if there is no TTS. if we prerecord the TTS and have the RS do the TTS instead of the pre-recorded. We need some timing base. WHat happens when there is no audio. user needs to hit fwd to make the content move. Hadrien: in the PR you can't just have images alone. wendyreid: all good points. ivan: My question / worry the problems you found in MO is it problems in SMIL original spec or how we took SML into EPUB? Hadrien: the latter, there only one case where issue there is content across the spread we may be limited by the SMIL syntax, all the other issues are in our spec. sueneu: I know CSS is pushing into audio, can we use that in addition to SMIL? Hadrien: not really thats more for TTS and is a very complex discussion. wendyreid: we need more review on this. Hadrien: last question, what should I do next? wendyreid: we will go through the issues. so document things as you run into them. we need review from Marisa on the PR / issues filed. ivan: biyearly issues with Time change 3 week timezone changes. |
|
I have read the PR linked above, and I have some questions about this proposal.
I had a look at #2936 and I wonder about inserting accessibility information in a playback timeline. It's different from an accessible UX where the user chooses to dive into the description or not, and would require a lot of specialization of SMIL playback rules.
|
I see two main use cases:
For the first one:
Users would be free to use what works best for them in terms of image presentation (full page or focus on regions), audio (pre-recorded or TTS) and text (always displayed, only displayed when it's not a description). The second use case could be useful for example in a reflowable book showing a map, where you could focus on specific parts of the map and provide narration related to this image fragment. This means that:
Focus on a region could be be done using a variety of techniques:
First of all, I think that we tend to focus a little too much on continuous playback but various RS out there offer other options as well. Ideally, we should consider all three because they all make sense. To answer your question:
It would go back to what is normally displayed while following the
This would be based on In my example, you can see that some
We definitely require more settings for SMIL playback rules in general, this one is pretty easy to handle on the RS side and IMO makes quite a bit of sense (do I need image descriptions or not?).
I haven't covered this one and I'm not a fan of CSS authored highlights anyway. I think that they can be harmful in reflowable EPUB (#2933) and the RS should let the user decides what works best for them (for example zooming might be problematic for people with vestibular disorder but work great for others).
It allows you to point to a region and also works with images in spine, which is the preferred way of authoring comics in many markets. Most RS with specialized support for comics use native API for images rather than a webview. It also has the benefit of keeping It also allows specialized libraries to create a SMIL without changing the HTML, which is increasingly becoming a requirement with tools that are used post-production. |
|
With the on-going discussions about a recharter and a focus on Media Overlays in EPUB 3.5, I expect support for In previous comments, @mattgarrish said that he's not sure how this would affect the reading experience and I wanted to go back to this question to prepare for 3.5. I'll focus specifically on comics/manga for now. With this addition, in terms of access modes, a user could combine things in the following ways:
Previously, without
In addition to these access modes, the following options could be made available for a user:
Let's illustrate what this means with an advanced use case. A dyslexic user could for example use:
This would provide the kind of user experience that we cannot offer today:
With this ability to combine access modes and the reading options that I've described above, I believe that we could offer very compelling and flexible user experiences. This goes beyond anything possible in EPUB or DAISY today. I expect that we'll be able to illustrate some of these new capabilities before the end of the year and iterate from there to provide more examples and implementations throughout 2027. |
|
My concern is still with the playback complexity that this pushes onto users. Instead of getting 'read aloud' playback as they've become accustomed to, they have to know all the possible permutations of manual/automatic playback that could be dumped on them, as well as how to configure them all. I'm also concerned that for all the complexity we may not be budging the needle on accessibility. This leads to a single playback mode that is sort of accessible but may not appeal to all users; it's not universally accessible content. Just the idea of smil driving playback is likely a regression that leaves synthetic speech playback offering a different inaccessible experience (e.g., cross-spread). But have any major publishers expressed interest in producing this kind of content? The technical details of making it work are only one consideration. I get the sinking feeling we're going to end up on an IDPF-like fishing expedition to create things that will never lure anyone in. And then there's my concern that the deeper we dig into smil to try and make these experiences work the more we diverge from where the web is going. We joined with W3C to try and put an end to epub-only features. If webvtt is the future, for example, whatever we do should consider future alignment with it. |
I've listed all the options for the sake of providing a complete picture of what's possible, but of course RS will handle this in a way that's manageable for users (it's their job after all). Many RS already provide separate reading modes for:
It's not terribly difficult to add a contextual "read aloud" option to these views along with a few options. This would already cover a good range of what's possible, with very little added complexity for end users (a single action). I have a pretty clear path forward for implementing this step by step in Thorium Reader for example.
Given the complete lack of support for accessible comics right now, I have a hard time hearing this argument. I'd love to hear counter proposals that strike the same balance:
How would that be a regression compared to what we have today when we have absolutely nothing viable for comics? Also, with what I'm proposing, cross-spread TTS is definitely possible (if you mean going back and forth between left and right pages).
Specialized libraries have expressed their interest, not trade publishers. They have zero options right now aside from rolling out less capable alternatives to what I've described above (they usually have to create their own format or hack Media Overlays in much more intrusive ways). Producing accessible comics remains more complex and costly than novels, but it doesn't mean that we shouldn't offer an option for them. Once we do so, I do expect a number of publishers to eventually jump on board (the same way that some publishers started producing natively accessible EPUB files before it became a legal requirement), but this likely won't be a requirement under EAA (it will fall under both "undue burden" and "fundamental alteration").
Unlike the IDPF, we have pretty clear rules for implementations. With an EPUB 3.5 recharter and a timeline that extends to early 2028, we'll have enough time to see if there's truly interest enough in this feature before we put the final stamp of approval on it. I know that several organizations are already adopting this feature or planning to test it out, which makes me quite confident about the outcome. |
|
Framing the options as accept your proposal or accept inaccessible comics isn't going to sway me. My concerns are still with the knock-on effects that dramatically changing media overlays for all content types will have. But the sense I keep getting here is that your focus is on the problems images in the spine create, specifically with the introduction of rolls. To try and break the impasse I think we're at, is it reasonable to propose that this new content model be a profile of media overlays that only applies to rolls? That would at least keep media overlays as everyone knows them untouched for the cases they've already been applied to. It also means that when we have to try and defend these new "accessible comics" to the accessibility people at W3C, which we will have to do, we can explain that this is a solution to a specific problem and not a claim that we've now solved comic/fxl accessibility by breathing life into a mostly forgotten technology. It also limits the "damage", so to speak, of not having a pathway forward to web friendlier ways of doing things. |
My latest comment (#2919 (comment)) is not about a technical solution, it's entirely about UX. This could be powered by anything, this is not about SMIL vs WebVTT vs whatever (I care very little about how this is serialized). So that's definitely not what I've said. If you're unhappy with the proposed UX and/or technical solution, that's fine, but please provide an alternate solution to make this discussion a bit more constructive.
Not true either. The problem is that we do not have a way to talk about fragments of an image, and IMO we need one to create an accessible UX for comics or complex images in general.
I don't think that's a good idea, it would be extremely useful for paginated comics ( I'd be happy to test if this truly breaks anything though, I've already tried it in a few RS that simply ignored |
|
Again, it's not the theory of how the reading progression might work that concerns me. It's the potential havoc it's going to play with users by overloading media overlays to achieve it. We're giving users a magic button. How do they know that the reading system is now waiting for them to swipe forward to move to a next image if region navigation is all that's being provided? What happens if a screen reader user drops out of img/text/audio playback to use tts and the text/audio have disappeared? They have to figure out that the text is hidden behind a fallback that they don't have access to, or that they didn't know they had to find a configuration option somewhere to force the text to be the default spine presentation? What happens if someone gets the idea to mix region-based navigation of an image (no audio) into what is otherwise audio/text sync playback? Does the user realize the audio hasn't broken but the reading system is waiting on their input? We shouldn't be pushing the complexity of playback onto users in order to make life a little simpler for authoring. |
|
How would a publication like what is being described here be accessed by a screen reader user? SMIL is not parsed by screen readers, to the best of my knowledge and imagination. Do very many reading systems parse SMIL for their TTS-based Read Aloud feature? Is it possible to produce this type of publication in a way that could meet EPUB Accessibility requirements? If it is possible and just not done, then that's the publisher's problem. But if there is not a way to do it, then that's our problem. I feel like these issues are ultimately FXL problems, not something we will solve without addressing it there. |
"Weaking havoc" feels like an overstatement for a feature that has already been available in both Apple Books and Thorium Reader for a while now:
These are useful features that users have been asking for and they haven't wreaked havoc at all, quite the contrary. I agree that the "default experience" should probably remain automatic playback, but it's perfectly fine to have a setting giving control back to the user as well.
These are RS concerns and it's up to them to make this as easy as possible for every user. These questions are not really new either, even with today's implementation of Media Overlays, we need to make sure that users can fluidly move between various reading modes:
For example a user might start reading, push the so-called "magic button" that gives them text and audio together, then put their phone in their pocket for an audio-only experience, step into their car and continue to listen from their car's entertainment system. For example, in the upcoming 3.5 version of Thorium Reader for desktop, we're bringing a lot of features that were previously available for TTS only to Media Overlays as well. I'm not sure how this is relevant for the overall discussion here, that's not the kind of guidance that we provide in our RS specs. |
My guess is they wouldn't. It sounds more like the current classification of overlay-dependent publications as optimized publications. And we know that reading systems generally don't process fallbacks, and not in a way that allows users to pick a preference. That immediately affects anyone who wants to access synthetic speech outside of overlay playback. There's also a known hole in the fallback process with fallbacks presumably having to be layout-consistent with the primary spine item because you can't specify different properties (i.e., if an image in spine is not supported in a roll sensibly the reading system should assume the fallback is also an fxl document it can use in place). That doubly makes putting descriptions as fallbacks problematic - how to get at them and that they could be rendered the wrong way potentially cutting off visual access to the text. But fallbacks generally are ill-defined. So what happens in the roll case if a publisher only puts an image in spine with the description as a fallback file? Does that mean that anyone whose reading system doesn't support images in spine can only read the descriptions of the images, even though they may not have visual issues? Or are we now moving to fallbacks of fallbacks, and how likely is that to ever work in general? Or what if they create different description files for each region that the user will reach. Is there a chain of "fallbacks" that are in fact not fallbacks but all needed by anyone not using media overlays playback? These are the kinds of bugs with epub that will limit the usability of these books to media overlay playback mode only in reading systems that support the change. If the fallback problems don't get addressed and widely implemented, then presumably you're out of luck, and that's not a good way to introduce a new accessibility feature. |
Right now, there isn't a way to produce FXL files that can comply with EAA, and I agree with you, that makes it our problem. One of the biggest problem in general with FXL is tied to text. Due to the way that text is positioned in FXL, it makes it impossible for the user to change the font size, use a different font, tweak spacing or use different colors/themes. Things are even worse in comics because text is very rarely real text (which would make this Comics can also be very rich in what they contain with a lot of visual information and conventions that need to be conveyed to the user in textual form to fully understand what's going on. Let's use an example based on what I've created for this PR. Here's the first page. Let's use the third panel specifically. To understand this panel, we need to:
This is what the proposed SMIL looks like: <seq epub:type="panel" epub:textref="script.xhtml#page1-panel3">
<par epub:type="description">
<img src="page1.jpg#xywh=percent:35.5,50.3,60.4,21.5"/>
<text src="script.xhtml#page1-panel3-description"/>
<audio src="audio/page1-panel3-description.mp3"/>
</par>
<par epub:type="balloon">
<img src="page1.jpg#xywh=percent:35.5,50.3,60.4,21.5"/>
<text src="script.xhtml#page1-panel3-bubble1"/>
<audio src="audio/page1-panel3-bubble1.mp3"/>
</par>
<par epub:type="balloon">
<img src="page1.jpg#xywh=percent:35.5,50.3,60.4,21.5"/>
<text src="script.xhtml#page1-panel3-bubble2"/>
<audio src="audio/page1-panel3-bubble2.mp3"/>
</par>
</seq>Here's the corresponding text in HTML: <section id="page1-panel3" epub:type="panel">
<p id="page1-panel3-description">Pepper frowns and looks angrily over her shoulder.</p>
<p id="page1-panel3-bubble1">Pepper shouts: "NO! I'M LEAVING!!"</p>
<p id="page1-panel3-bubble2">Pepper says: "You don't teach real witchcraft! I'm going - to the witches of Ah!"</p>
</section>Compared to what we can do today, this solves a lot of issues:
This brings us much closer to what we need for an accessible experience, with multiple access modes (visual, text and audio), proper navigation (page and panel level), potential skippability and real text that we can leverage in many different ways. Even better: you don't really need to change anything in these images to make this work, it's a progressive enhancement that you can add next to the image, which is very important when you consider how adaptations are handled by specialized libraries.
Reading systems can expose textual content to screen readers in various ways using SMIL as an input. For example:
As an alternative, screen reader users could also trigger the Read Aloud feature as well if they rely on audio for their screen reader (rather than braille for instance).
Not currently, because there isn't a good reason to do so with our current content model which only allows:
This would be a much more compelling reason to use the content of the SMIL for a TTS-based Read Aloud feature. |
|
Ahead of the meeting in Oslo, I've been authorized by Minori IKEDA to share an early implementation of the accessible version of Pepper & Carrot that I've created for this PR. This is based on the BlinkGTK Project from Fuse Network, Inc. Here's a video: https://drive.google.com/file/d/1OessKnLi45qZGwvTQ1jaK-CJ0RXQz2xP/view?usp=sharing Based on the criteria that I've documented in #2919 (comment) this implementation ticks the following boxes:
I've already exchanged a few messages with him and he confirmed that implementing step by step playback and skippability would be very straightforward with this approach. |
This is great, it is very similar to what I have been thinking for the future of Nota-service's* comics. Since 2013 we have provided narrated comics mainly for our dyslexic users through our own custom addition to DAISY 2.02**). This utiliise SMIL 1.0 and some creative HTML+CSS+JavaScript in our own player. We are a library not a publisher, so we process existing comics, by scanning the pages, manually mark up the panels + the reading order and have humans narrate them. We don’t add descriptions of the panels, which makes them less accessible for our visually impaired users. We are interested in a standard format to distribute comics instead of our custom format and have the ability to provide optional panel descriptions, background sounds etc. to make them truly accessible. *) Royal Danish Library's Nota Service, we provide reading materials for people with visual and reading disabilities. |
|
Here's a new video shared by Minori IKEDA that implements a number of the features that I've described in #2919 (comment) : https://drive.google.com/file/d/195ei6BWz3fWlE5aQ0sNz61h5OeyyzRLr/view?usp=sharing |
This is fully supported by this proposal. I've created two examples in the dedicated repo:
Currently the text always contextualizes who's speaking ("Pepper says", "Cayenne shouts" etc.) which wouldn't make it ideal to generate the kind of multi-voice rendition that I've used for audio, but this is something that we could explore as well. To handle multi-voice rendition properly, you need to identify the person speaking (consistently, which could prove tricky) and also provide a description that could be used to prompt a modern TTS model and generate a voice. You also mention background sounds @m-abs, which is another thing worth exploring. For sound effects, they could be handled linearly during the normal reading progression, but for background ones (let's say crickets in the summer or the sound of the waves crashing on the shore) they would need to be played in parallel of panel descriptions, captions and speech bubbles. |


This PR closes #2883.
It's currently a draft PR and not ready to be reviewed yet.
Changes:
imgelementimgto the content model forpar(set to 0 or 1)Links:
Preview | Diff