Pre-existing mechanisms such as {{MediaDevices/getDisplayMedia()}} allow Web applications to initiate screen-capture. If the user chooses to capture a tab, mechanisms such as [[mediacapture-region|Region Capture]] mutate the resulting video track and perform an operation on all subsequent frames produced. (In the example of [[mediacapture-region|Region Capture]], the operation consists of cropping frames to the frame's intersection with the bounding box of a target-element.)
Element Capture introduces a new mutation mechanism which we name "restriction". When an application "restricts" a video track to a given target-element, frames produced on the restricted video track only consist of information from the target-element and its descendants. Phrased differently, the track becomes a capture of the DOM sub-tree rooted at the target-element.
[[mediacapture-region|Region Capture]] allows applications to crop captures. Assume some
element TARGET
is the restriction-target. What if other elements, which are
not DOM-descendants of TARGET
, draw in front of TARGET
? Using
[[mediacapture-region|Region Capture]], these other elements would also get captured,
which is not always desirable. A mechanism is sought that would allow cropping to
TARGET
's bounding box, while also excluding from capture any of the content
that is not a DOM-descendant.
Consider an "editor" Web application (text-editor, image-editor, slides-editor or video-editor). Such applications often include a main content area, surrounded by various toolbars, drop-down menus and widgets which allow the local user to edit the content in the main content area.
Sometimes a Web application wishes to record only the main content area, and then either transmit it "live" to remote participants, or record it to disk. Such an application would not necessarily wish to expend storage, bandwidth, or remote participants' screen real-estate on anything outside of the main content area.
A mechanism such as [[mediacapture-region|Region Capture]] helps with cropping to the bounding box of the target-element, but what happens when drop-down lists temporarily draw over it?
Video-conferencing applications often arrange themselves using "tiles" - each remote participant's video is presented in a tile. Assume that a collaborative Web application, like a text editor or an image-editing application, is loaded in another iframe, and that this iframe is also presented as a tile.
Some remote participants would similarly load the same tool in a dedicated tile. But what if some users don't have the necessary permissions to load that tool? Or if they are joining from a platform that does not support the tool?
The video conferencing solution may then choose to have one of the participants who have loaded the tool successfully screen-share that tool's tile to the users who cannot load the tool, allowing them to at least view it, although not interact with it. This can be done using self-capture through {{MediaDevices/getDisplayMedia()}} and [[mediacapture-region|Region Capture]].
But such a solution introduces some problems. What happens if other elements ever draw on top of the tool tile, either briefly or permanently? Examples include:
The Element Capture mechanism comprises two parts:
We define two restriction-states. restricted and unrestricted. Video tracks are always in one state or the other. Tracks start out [=unrestricted=], and may turn to [=restricted=] when {{BrowserCaptureMediaStreamTrack/restrictTo()}} is successfully called.
The [=restriction mechanism=] presented in this document ({{BrowserCaptureMediaStreamTrack/restrictTo}}) relies on a {{RestrictionTarget}} token rather than on direct node references. This allows restriction by one document to a target element specified in another document.
Because {{BrowserCaptureMediaStreamTrack/cropTo()}} and {{BrowserCaptureMediaStreamTrack/restrictTo()}} use different token types - {{CropTarget}} and {{RestrictionTarget}}, respectively - it is possible for documents to limit the capabilities they bestow on documents that capture them.
RestrictionTarget is an intentionally empty, opaque identifier. Its purpose is to be handed to {{BrowserCaptureMediaStreamTrack/restrictTo}} as input.
[Exposed=(Window,Worker), Serializable] interface RestrictionTarget { [Exposed=Window, SecureContext] static Promise<RestrictionTarget> fromElement(Element element); };
Calling {{RestrictionTarget/fromElement}} with an {{Element}} of a supported type associates that {{Element}} with a {{RestrictionTarget}}. This {{RestrictionTarget}} may be used as input to {{BrowserCaptureMediaStreamTrack/restrictTo}}. We define a valid RestrictionTarget as one returned by a call to {{RestrictionTarget.fromElement()}} in a document that is still active.
When {{RestrictionTarget/fromElement}} is called with a given |element|, the user agent [=create a RestrictionTarget|creates a RestrictionTarget=] with |element| as input. The user agent MUST return a {{Promise}} |p|. The user agent MUST resolve |p| only after it has finished all the necessary internal propagation of state associated with the new {{RestrictionTarget}}, at which point the user agent MUST be ready to receive the new {{RestrictionTarget}} as a valid parameter to {{BrowserCaptureMediaStreamTrack/restrictTo}}.
When cloning an {{Element}} on which {{RestrictionTarget/fromElement}} was previously called, the clone is not associated with any {{RestrictionTarget}}. If {{RestrictionTarget/fromElement}} is later called on the clone, a new {{RestrictionTarget}} will be assigned to it.
To create a RestrictionTarget with |element| as input, run the following steps:
Let |restrictionTarget| be a new object of type {{RestrictionTarget}}.
Set |restrictionTarget|.[[\Element]] to |element|.
{{RestrictionTarget}} objects are serializable. The [=serialization steps=], given |value|, |serialized|, and a boolean |forStorage|, are:
If |forStorage| is true
, throw with new {{DOMException}} object whose
{{DOMException/name}} attribute has the value {{"DataCloneError"}}.
Set |serialized|.[[\RestrictionTargetElement]] to |value|.{{RestrictionTarget/[[Element]]}}.
The [=deserialization steps=], given |serialized| and |value| are:
Set |value|.{{RestrictionTarget/[[Element]]}} to |serialized|.[[\RestrictionTargetElement]].
We say that a {{MediaStreamTrack}} |T| is a restrictable MediaStreamTrack if and only if it fulfills all of the following conditions:
true
.We say that an {{Element}} |E| is eligible for restriction if and only if it fulfills all of the following conditions:
|E| forms a stacking context.
|E| is flattened in 3D.
|E| forms a backdrop root.
|E| has exactly one box fragment.
|E| is rendered.
To ensure these conditions hold, developers may use CSS such as the following snippet:
#target { isolation: isolate; /* Forms a stacking context. */ transform-style: flat; /* Flattened. */ }
We say that an {{Element}} |E| is a valid restriction target for a {{MediaStreamTrack}} |T|, if and only if all of the following conditions hold:
Informally, this means that |T| is an active video track associated with tab-capture, and |E| is an Element [=connected=] to the DOM in the captured tab.
Note that whether an Element |E| is a [=valid restriction target=] for a {{MediaStreamTrack}} |T| may change either before or after a capture starts, as well as before or after restriction starts. Examples include:
Invalidity will suppress additional frames until validity is restored.
[[mediacapture-region|Region Capture]] introduced the {{BrowserCaptureMediaStreamTrack}} interface. We extend it with a new method, {{BrowserCaptureMediaStreamTrack/restrictTo}}.
[Exposed = Window] partial interface BrowserCaptureMediaStreamTrack { Promise<undefined> restrictTo(RestrictionTarget? RestrictionTarget); };
All tasks queued below use the rendering task source associated with the same global object as the {{BrowserCaptureMediaStreamTrack}}.
Calls to this method instruct the user agent to start/stop restrict a video track.
When invoked with |restrictionTarget| as the first parameter, the user agent MUST execute the following algorithm:
If [=this=] is not a [=restrictable MediaStreamTrack=], return a {{Promise}} [=rejected=] with a new {{NotSupportedError}}.
Run the following steps in parallel:
Let |E| be |restrictionTarget|.{{RestrictionTarget/[[Element]]}}.
Update [=this=] video track's crop-state to uncropped.
Update [=this=] video track's [=restriction-state=] according to |restrictionTarget|:
Call the track's state before this method invocation |preState|, and after this method invocation |postState|. The user agent MUST queue a global task to resolve |p| when it is guaranteed that no more frames [=restricted=] (or [=unrestricted=]) according to |preState| will be delivered to the application, and that any additional frames delivered to the application will therefore be [=restricted=] (or [=unrestricted=]) according to either |postState| or a later state.
Whenever the user agent is about to produce a new |frame| for a video track |T| that is [=restricted=] to a given target |restrictionTarget|, the user agent MUST execute the following algorithm:
The frame produced in the final step is constructed by rendering |E| and its descendants over an infinite transparent canvas, positioned so that the edges of the decorated bounding box are flush with the edges of the frame.
In some implementations, the underlying pixel format for the frame data will not be able to carry alpha channel information. In this case, the implementation can blend the rendered frame with an infinite canvas of black (`rgb(0,0,0)`).
Implementations may either re-use existing bitmap data generated for |E| or regenerate the display of the element to maximize quality at the frame's size (for example, if the implementation detects that the referenced element is an SVG fragment). However, the frame must look identical to |E| as rendered above, modulo rasterization quality.
Code in the capture-target:
const mainContentArea = navigator.getElementById('mainContentArea'); const restrictionTarget = await RestrictionTarget.fromElement(mainContentArea); sendRestrictionTarget(restrictionTarget); function sendRestrictionTarget(restrictionTarget) { // Either send the restriction-target using postMessage(), // or pass it on locally within the same document. }
Code in the capturing-document:
async function startRestrictedCapture(RestrictionTarget) { const stream = await navigator.mediaDevices.getDisplayMedia(); const [track] = stream.getVideoTracks(); if (!!track.restrictTo) { handleError(stream); return; } await track.restrictTo(RestrictionTarget); transmitVideoRemotely(track); }
For non-malicious applications, the APIs introduced by this specifications should be a pure positive, as they allow responsible applications to pare down the information recorded. This has positive properties.
For example, using pre-existing mechanisms, video-conferencing applications can:
Embed content in an iframe.
Prompt the user to capture the current tab. (Using {{MediaDevices/getDisplayMedia()}}.)
Crop the resulting capture to just the iframe that's intended for capture. (Using {{BrowserCaptureMediaStreamTrack/cropTo()}}.)
Transmit the resulting pixels to remote participants. (Using RTCPeerConnection.)
However, this is risky, because any content that happens to be drawn in front of the content intended for capture will also be transmitted remotely. Even if this happens but briefly, remote users might notice. And such content might be highly private - for example, chat notifications, reminders, speaker notes...
The mechanisms introduced in this specification allow a responsible application to structure itself in a way that would completely guarantee that such issues are impossible. Such an application can more easily make and keep privacy guarantees to its users.
The mechanisms introduced by this specification all rely on self-capture being provided by some other means - typically {{MediaDevices/getDisplayMedia()}}. The main concern with these, is that they allow an application read-access to cross-origin content.
When a malicious application tricks the user to approve self-capture, it can then load cross-origin content in an invisible iframe and then bring the content to the forefront, allowing the attacker to read the content before the user can react. Such attacks are already possible without any of the mechanisms introduced by this specification.
The main concern is that the mechanisms we introduce in this specification should not aggravate the old attack vectors described above. One naturally worries that the mechanisms we introduce allow the old attacks to be carried out surreptitiously. We contend that the mechanisms introduced here do not increase an attacker's power to hide the attack; such attack-concealment was always possible using any of the following techniques:
Displaying the content briefly. Attackers could always flash content to the screen for a timespan of a single frame. This is long enough to record it, but not long enough for users to understand it.
Displaying the content piecemeal. Attackers could always display break the content up into multiple small pieces, even one pixel each, and display them in different locations and times. Users would not be able to observe this manipulation, but it is trivial for software to collect these pixels and reconstruct a picture from it.
Displaying the content at low opacity. Attackers could always display content at an opacity that is imperceptible for a user, but which machines can still read.
Any of these techniques is enough on its own, but through a combination of them, malicious applications were always able to conceal their attacks effectively and still read content efficiently.
One might worry that a malicious app be able to remove occlusions in cross-origin iframes, without opt-in from that content. The shape of the API prevents such attacks - the cross-origin iframe would have to produce a {{RestrictionTarget}} and pass it to the would-be attacker. As {{RestrictionTargets}} serve no purpose other than as part of the API introduced by this specification, the minting and passing of a {{RestrictionTarget}} proves the cross-origin iframe's permission for its occlusions to be removed.
In designing the APIs introduced by this specification, a conscious decision was made to not reuse {{CropTarget}}, and define a dedicated token instead ({{CropTarget}}). This ensures that any existing Web applications that have previously been designed and implemented with {{BrowserCaptureMediaStreamTrack/cropTo()}} in mind, but not with {{BrowserCaptureMediaStreamTrack/restrictTo()}}, would not be effectively opting into allowing occlusions to be removed, as described in the previous section.