Abstract

Pre-existing mechanisms such as {{MediaDevices/getDisplayMedia()}} allow Web applications to initiate screen-capture. If the user chooses to capture a tab, mechanisms such as [[mediacapture-region|Region Capture]] mutate the resulting video track and perform an operation on all subsequent frames produced. (In the example of [[mediacapture-region|Region Capture]], the operation consists of cropping frames to the frame's intersection with the bounding box of a target-element.)

Element Capture introduces a new mutation mechanism which we name "restriction". When an application "restricts" a video track to a given target-element, frames produced on the restricted video track only consist of information from the target-element and its descendants. Phrased differently, the track becomes a capture of the DOM sub-tree rooted at the target-element.

Use Cases

Generic Use-Case

[[mediacapture-region|Region Capture]] allows applications to crop captures. Assume some element TARGET is the restriction-target. What if other elements, which are not DOM-descendants of TARGET, draw in front of TARGET? Using [[mediacapture-region|Region Capture]], these other elements would also get captured, which is not always desirable. A mechanism is sought that would allow cropping to TARGET's bounding box, while also excluding from capture any of the content that is not a DOM-descendant.

Practical Use-Case #1: Recording part of an app

Consider an "editor" Web application (text-editor, image-editor, slides-editor or video-editor). Such applications often include a main content area, surrounded by various toolbars, drop-down menus and widgets which allow the local user to edit the content in the main content area.

Sometimes a Web application wishes to record only the main content area, and then either transmit it "live" to remote participants, or record it to disk. Such an application would not necessarily wish to expend storage, bandwidth, or remote participants' screen real-estate on anything outside of the main content area.

A mechanism such as [[mediacapture-region|Region Capture]] helps with cropping to the bounding box of the target-element, but what happens when drop-down lists temporarily draw over it?

Practical Use-Case #2: Collaborative tools during video-conferencing

Video-conferencing applications often arrange themselves using "tiles" - each remote participant's video is presented in a tile. Assume that a collaborative Web application, like a text editor or an image-editing application, is loaded in another iframe, and that this iframe is also presented as a tile.

Some remote participants would similarly load the same tool in a dedicated tile. But what if some users don't have the necessary permissions to load that tool? Or if they are joining from a platform that does not support the tool?

The video conferencing solution may then choose to have one of the participants who have loaded the tool successfully screen-share that tool's tile to the users who cannot load the tool, allowing them to at least view it, although not interact with it. This can be done using self-capture through {{MediaDevices/getDisplayMedia()}} and [[mediacapture-region|Region Capture]].

But such a solution introduces some problems. What happens if other elements ever draw on top of the tool tile, either briefly or permanently? Examples include:

Solution Overview

The Element Capture mechanism comprises two parts:

  1. [=RestrictionTarget production=]: A mechanism for tagging an {{Element}} as a potential target for the [=restriction mechanism=].
  2. [=Restriction mechanism=]: A mechanism for instructing the user agent to start restricting a video track to the bounding box of a previously [=tagging|tagged=] {{Element}}, or to stop such restriction and revert a track to its [=unrestricted=] state.

We define two restriction-states. restricted and unrestricted. Video tracks are always in one state or the other. Tracks start out [=unrestricted=], and may turn to [=restricted=] when {{BrowserCaptureMediaStreamTrack/restrictTo()}} is successfully called.

RestrictionTarget Production

Motivation for defining RestrictionTarget

The [=restriction mechanism=] presented in this document ({{BrowserCaptureMediaStreamTrack/restrictTo}}) relies on a {{RestrictionTarget}} token rather than on direct node references. This allows restriction by one document to a target element specified in another document.

Because {{BrowserCaptureMediaStreamTrack/cropTo()}} and {{BrowserCaptureMediaStreamTrack/restrictTo()}} use different token types - {{CropTarget}} and {{RestrictionTarget}}, respectively - it is possible for documents to limit the capabilities they bestow on documents that capture them.

RestrictionTarget Definition

RestrictionTarget is an intentionally empty, opaque identifier. Its purpose is to be handed to {{BrowserCaptureMediaStreamTrack/restrictTo}} as input.

          [Exposed=(Window,Worker), Serializable]
          interface RestrictionTarget {
            [Exposed=Window, SecureContext] static Promise<RestrictionTarget> fromElement(Element element);
          };
        
fromElement()

Calling {{RestrictionTarget/fromElement}} with an {{Element}} of a supported type associates that {{Element}} with a {{RestrictionTarget}}. This {{RestrictionTarget}} may be used as input to {{BrowserCaptureMediaStreamTrack/restrictTo}}. We define a valid RestrictionTarget as one returned by a call to {{RestrictionTarget.fromElement()}} in a document that is still active.

When {{RestrictionTarget/fromElement}} is called with a given |element|, the user agent [=create a RestrictionTarget|creates a RestrictionTarget=] with |element| as input. The user agent MUST return a {{Promise}} |p|. The user agent MUST resolve |p| only after it has finished all the necessary internal propagation of state associated with the new {{RestrictionTarget}}, at which point the user agent MUST be ready to receive the new {{RestrictionTarget}} as a valid parameter to {{BrowserCaptureMediaStreamTrack/restrictTo}}.

When cloning an {{Element}} on which {{RestrictionTarget/fromElement}} was previously called, the clone is not associated with any {{RestrictionTarget}}. If {{RestrictionTarget/fromElement}} is later called on the clone, a new {{RestrictionTarget}} will be assigned to it.

To create a RestrictionTarget with |element| as input, run the following steps:

  1. Let |restrictionTarget| be a new object of type {{RestrictionTarget}}.

  2. Set |restrictionTarget|.[[\Element]] to |element|.

{{RestrictionTarget}} objects are serializable. The [=serialization steps=], given |value|, |serialized|, and a boolean |forStorage|, are:

  1. If |forStorage| is true, throw with new {{DOMException}} object whose {{DOMException/name}} attribute has the value {{"DataCloneError"}}.

  2. Set |serialized|.[[\RestrictionTargetElement]] to |value|.{{RestrictionTarget/[[Element]]}}.

The [=deserialization steps=], given |serialized| and |value| are:

  1. Set |value|.{{RestrictionTarget/[[Element]]}} to |serialized|.[[\RestrictionTargetElement]].

Restriction Mechanism

Definitions

Restrictable tracks

We say that a {{MediaStreamTrack}} |T| is a restrictable MediaStreamTrack if and only if it fulfills all of the following conditions:

  • |T|.{{MediaStreamTrack/[[Restrictable]]}} is true.
  • |T| is associated with a browser display surface. (That is, if |T|.{{MediaStreamTrack/getSettings()}} were called, it would have returned a {{MediaTrackSettings}} dictionary containing the key {{MediaTrackSettings/displaySurface}} mapped to the value {{DisplayCaptureSurfaceType/"browser"}}.)
  • |T|.[[\Kind]] is "video".
  • |T|.[[\ReadyState]] is "live".

Elements eligible for restriction

We say that an {{Element}} |E| is eligible for restriction if and only if it fulfills all of the following conditions:

To ensure these conditions hold, developers may use CSS such as the following snippet:

              #target {
                isolation: isolate;     /* Forms a stacking context. */
                transform-style: flat;  /* Flattened. */
              }
            

Valid restriction targets

We say that an {{Element}} |E| is a valid restriction target for a {{MediaStreamTrack}} |T|, if and only if all of the following conditions hold:

Informally, this means that |T| is an active video track associated with tab-capture, and |E| is an Element [=connected=] to the DOM in the captured tab.

Note that whether an Element |E| is a [=valid restriction target=] for a {{MediaStreamTrack}} |T| may change either before or after a capture starts, as well as before or after restriction starts. Examples include:

  • |T| is stopped programmatically.
  • |T| is stopped by the user.
  • |T|.[[\Source]] changes due to user interaction with the user agent and/or operating system.
  • |E|'s set of CSS attributes change such that |E| is no longer [=eligible for restriction=].

Invalidity will suppress additional frames until validity is restored.

BrowserCaptureMediaStreamTrack extension

[[mediacapture-region|Region Capture]] introduced the {{BrowserCaptureMediaStreamTrack}} interface. We extend it with a new method, {{BrowserCaptureMediaStreamTrack/restrictTo}}.

          [Exposed = Window]
          partial interface BrowserCaptureMediaStreamTrack {
            Promise<undefined> restrictTo(RestrictionTarget? RestrictionTarget);
          };
        

All tasks queued below use the rendering task source associated with the same global object as the {{BrowserCaptureMediaStreamTrack}}.

restrictTo()

Calls to this method instruct the user agent to start/stop restrict a video track.

When invoked with |restrictionTarget| as the first parameter, the user agent MUST execute the following algorithm:

  1. If [=this=] is not a [=restrictable MediaStreamTrack=], return a {{Promise}} [=rejected=] with a new {{NotSupportedError}}.

  2. Let |p| be a new {{Promise}}.
  3. Run the following steps in parallel:

    1. Let |E| be |restrictionTarget|.{{RestrictionTarget/[[Element]]}}.

    2. Update [=this=] video track's crop-state to uncropped.

    3. Update [=this=] video track's [=restriction-state=] according to |restrictionTarget|:

      1. If |restrictionTarget| is NOT {{undefined}}, the user agent MUST set [=this=] video track's [=restriction-state=] to [=restricted=] and start [=applying the restriction transformation=] to all frames delivered to [=this=] video track with |restrictionTarget| as the target.
      2. If |restrictionTarget| is set to {{undefined}}, the user agent MUST set [=this=] video track's [=restriction-state=] to [=unrestricted=] and stop [=applying the restriction transformation=] to frames delivered to [=this=] video track.
    4. Call the track's state before this method invocation |preState|, and after this method invocation |postState|. The user agent MUST queue a global task to resolve |p| when it is guaranteed that no more frames [=restricted=] (or [=unrestricted=]) according to |preState| will be delivered to the application, and that any additional frames delivered to the application will therefore be [=restricted=] (or [=unrestricted=]) according to either |postState| or a later state.

  4. Return |p|.

Applying the restriction transformation

Whenever the user agent is about to produce a new |frame| for a video track |T| that is [=restricted=] to a given target |restrictionTarget|, the user agent MUST execute the following algorithm:

  1. Let |E| be |restrictionTarget|.{{RestrictionTarget/[[Element]]}}.
  2. If |E| is not a [=valid restriction target=] for |T|, abort without producing a new frame.
  3. Let |intersection| be the intersection of |E|'s bounding box and the captured surface's [=top-level browsing context=]'s viewport.
  4. If |intersection| is empty, abort without producing a new frame.
  5. A corollary of previous steps is that |E| forms a stacking context. Produce and deliver a frame consisting of an independent rendering of that stacking context, clipped to |intersection|.

The frame produced in the final step is constructed by rendering |E| and its descendants over an infinite transparent canvas, positioned so that the edges of the decorated bounding box are flush with the edges of the frame.

In some implementations, the underlying pixel format for the frame data will not be able to carry alpha channel information. In this case, the implementation can blend the rendered frame with an infinite canvas of black (`rgb(0,0,0)`).

Implementations may either re-use existing bitmap data generated for |E| or regenerate the display of the element to maximize quality at the frame's size (for example, if the implementation detects that the referenced element is an SVG fragment). However, the frame must look identical to |E| as rendered above, modulo rasterization quality.

Sample Code

Code in the capture-target:

          const mainContentArea = navigator.getElementById('mainContentArea');
          const restrictionTarget = await RestrictionTarget.fromElement(mainContentArea);
          sendRestrictionTarget(restrictionTarget);

          function sendRestrictionTarget(restrictionTarget) {
            // Either send the restriction-target using postMessage(),
            // or pass it on locally within the same document.
          }
        

Code in the capturing-document:

          async function startRestrictedCapture(RestrictionTarget) {
            const stream = await navigator.mediaDevices.getDisplayMedia();
            const [track] = stream.getVideoTracks();
            if (!!track.restrictTo) {
              handleError(stream);
              return;
            }
            await track.restrictTo(RestrictionTarget);
            transmitVideoRemotely(track);
          }
        

Privacy and Security Considerations

Benefits of this API

For non-malicious applications, the APIs introduced by this specifications should be a pure positive, as they allow responsible applications to pare down the information recorded. This has positive properties.

For example, using pre-existing mechanisms, video-conferencing applications can:

  1. Embed content in an iframe.

  2. Prompt the user to capture the current tab. (Using {{MediaDevices/getDisplayMedia()}}.)

  3. Crop the resulting capture to just the iframe that's intended for capture. (Using {{BrowserCaptureMediaStreamTrack/cropTo()}}.)

  4. Transmit the resulting pixels to remote participants. (Using RTCPeerConnection.)

However, this is risky, because any content that happens to be drawn in front of the content intended for capture will also be transmitted remotely. Even if this happens but briefly, remote users might notice. And such content might be highly private - for example, chat notifications, reminders, speaker notes...

The mechanisms introduced in this specification allow a responsible application to structure itself in a way that would completely guarantee that such issues are impossible. Such an application can more easily make and keep privacy guarantees to its users.

Concerns about this API

Reading cross-origin pixels

The mechanisms introduced by this specification all rely on self-capture being provided by some other means - typically {{MediaDevices/getDisplayMedia()}}. The main concern with these, is that they allow an application read-access to cross-origin content.

When a malicious application tricks the user to approve self-capture, it can then load cross-origin content in an invisible iframe and then bring the content to the forefront, allowing the attacker to read the content before the user can react. Such attacks are already possible without any of the mechanisms introduced by this specification.

Aggravating old attack vectors

The main concern is that the mechanisms we introduce in this specification should not aggravate the old attack vectors described above. One naturally worries that the mechanisms we introduce allow the old attacks to be carried out surreptitiously. We contend that the mechanisms introduced here do not increase an attacker's power to hide the attack; such attack-concealment was always possible using any of the following techniques:

  • Displaying the content briefly. Attackers could always flash content to the screen for a timespan of a single frame. This is long enough to record it, but not long enough for users to understand it.

  • Displaying the content piecemeal. Attackers could always display break the content up into multiple small pieces, even one pixel each, and display them in different locations and times. Users would not be able to observe this manipulation, but it is trivial for software to collect these pixels and reconstruct a picture from it.

  • Displaying the content at low opacity. Attackers could always display content at an opacity that is imperceptible for a user, but which machines can still read.

Any of these techniques is enough on its own, but through a combination of them, malicious applications were always able to conceal their attacks effectively and still read content efficiently.

Reading occluded content

One might worry that a malicious app be able to remove occlusions in cross-origin iframes, without opt-in from that content. The shape of the API prevents such attacks - the cross-origin iframe would have to produce a {{RestrictionTarget}} and pass it to the would-be attacker. As {{RestrictionTargets}} serve no purpose other than as part of the API introduced by this specification, the minting and passing of a {{RestrictionTarget}} proves the cross-origin iframe's permission for its occlusions to be removed.

Interaction with Region Capture

In designing the APIs introduced by this specification, a conscious decision was made to not reuse {{CropTarget}}, and define a dedicated token instead ({{CropTarget}}). This ensures that any existing Web applications that have previously been designed and implemented with {{BrowserCaptureMediaStreamTrack/cropTo()}} in mind, but not with {{BrowserCaptureMediaStreamTrack/restrictTo()}}, would not be effectively opting into allowing occlusions to be removed, as described in the previous section.