How Can We Download Pictures From Trulia

In a previous mail nosotros shared how Trulia leverages estimator vision and deep learning to select the hero image for listings on our site. Today nosotros'd similar to share how we've extended this technique to extract salient thumbnails from 360-degree panoramas of homes.

The evolution of automatic panorama generation in modernistic cameras has made 360-degree images ubiquitous. They are widely used in multiple industries, in everything from Google Street View, to virtual reality (VR) films. Last twelvemonth, Trulia's parent company Zillow Group, released 3D-Home , enabling users to upload and share 360-caste panoramas of their listings. Compared to traditional photography, 360-degree panoramas create an immersive experience by providing a complete view of the room.

This capability has added value for home-buyers who can see a room and domicile from all angles, non merely what the photographer chose to highlight. However, given the wide field of view, it is easy for the viewer to wait in the incorrect direction. This creates a need for a mechanism to direct viewers attention to the important parts of the scene. This is particularly true in the context where the 360-degree panoramas demand to be represented equally a static 2D image as shown below in Figure i. Here we ensure that these 360-degree panoramas are discoverable and engaging to the user past identifying the most salient viewpoints within the panorama.

Figure 1. A screenshot of the 3D dwelling house user experience.The thumbnails in the bottom row aid viewers get a sense of the important and salient aspects of each panorama.

Zillow Group 3D Home

With the release of the Zillow 3D Home last year, agents and sellers can capture 360-degree panoramas of a home and add them to their listing on Trulia or Zillow. The app guides the users through the capture process and so uploads these panoramas to ZG servers. If you are interested to know more than on how we utilize computer vision techniques to create panoramas and compose a rich 360-degree feel we encourage you to read our blog .

These panoramas help provide an immersive home viewing experience that isn't possible with the standard, limited field of view 2D photos. Panoramas like these help agents and sellers attract more views to a listing.

In Figure 1 we show an case of how these panoramas await on our site. The entry indicate to each panorama is marked with a salient thumbnail designed to provide an informative view of each panorama. Such a thumbnail tin can help increase the chances of a user to engage with the content and enter the 360-degree feel.

Figure 2. A 360 degree real estate panorama with 60 degree vertical field of view.

Characteristics of a Salient Thumbnail

Salient thumbnails are visually informative and aesthetically pleasing viewpoints of a spherical 360-degree panoramic image. We ascertain salient thumbnails equally having 3 key characteristics, they must exist representative, attractive, and diverse.

Representative : A salient thumbnail should be informative and representative of the visual scene being captured past the panorama.
Attractive : A salient thumbnail should capture aesthetically pleasing viewpoints inside a scene.
Various : If multiple salient thumbnails exist, they should capture different viewpoints within the scene.

Effigy iii. Extracted 2D viewpoints from the to a higher place panorama. Viewpoint C is selected by our algorithm as the salient viewpoint while Viewpoint D represents the thumbnail generated past the baseline approach. Viewpoint C is able to provide a representative view of the scene while being more aesthetically pleasing and relevant than other viewpoints.

Generating Thumbnails

An initial approach to generating a thumbnail would exist to but extract a single viewpoint in the promise that it volition result in a salient thumbnail. All the same, this presents certain challenges.

Panorama images correspond 360 degrees of horizontal view and 180 degrees vertical view resulting in a two:i aspect ratio. In some instances like in Effigy 2, an image may have a limited vertical field of view and crave padding with nix values, or black pixels to maintain the aspect ratio.

We could excerpt the central viewpoint in the panorama, which frequently corresponds to the direction in which the photographic camera was facing when the capture process began. However, this approach is reliant on user starting with a representative viewpoint in order for it to be salient.

Furthermore, nosotros cannot simply crop a rectangular section from a panorama paradigm. Unlike 2D images, panoramas are spherical, simply cropping the fundamental viewpoint leads to thumbnails which have spherical artifacts, e.one thousand. directly lines look curved.

To excerpt a viewpoint from a spherical panoramic image, we projection the panorama into a 2d image with a specific Field of View, 3D orientation, aspect ratio, and viewport size as shown in Effigy 4.

Figure iv. Extracting perspective images from 360-degree spherical panoramas.

While this addresses the spherical artifacts, thumbnails extracted from the primal viewpoint can often exist unrepresentative of the scene the panorama is trying to capture, equally shown in Figure 3 above.

Generating Salient Thumbnails

To resolve this we needed an algorithm that ranks all the potential perspective thumbnails extracted from the panorama based on our defined saliency criteria. Below we list out some of the key components of our approach at extracting salient thumbnails.

Optimal Field Of View (FOV) Interpretation:

Defended 360 cameras ofttimes offer total 180-degrees vertical Field of View (FOV). However, capture devices like smartphones often accept a express field of view. For example, a standard iPhone camera has a 60-degree vertical FOV. In such cases where the whole 180-degree vertical FOV is non present, it's common do to pad panorama images with zero value pixels as shown in Figure 2.

Due to these variations in vertical FOV, correctly estimating the vertical FOV before extracting the thumbnail images is paramount. Otherwise, nosotros may be left with uncaptured regions surrounding a thumbnail image when estimated FOV is likewise large, or a significant loss of views in the vertical direction if the estimated FOV is also small.

To accost this issue, for each panorama we identify the exact location of horizontal edge lines that dissever black regions from prototype regions and then use this location to obtain optimal FOV.

Calculating Saliency Scores

Once the optimal FOV has been estimated, the system extracts 90 thumbnail images sampled uniformly at a horizontal view interval of iv degrees from the panorama. The four-caste intervals were empirically chosen to achieve a tradeoff between computational overhead and the ability to capture all diverse salient viewpoints within the panorama. These images or viewpoints are then ranked based on a saliency score. The viewpoint with the highest score is and so called every bit the salient thumbnail.

Saliency Models

In society to compute a saliency score , we rely on three unlike Deep Convolutional Neural Networks (CNN) models that we briefly covered in a previous postal service . Each of these models are trained on a dissimilar dataset and aid capture the various characteristics of a salient thumbnail.

Scene Model: This model helps capture the representativeness of a viewpoint to the panorama. Since we are capturing real manor panoramas, viewpoints that are categorized as relevant scenes similar the kitchen or living room are considered more representative than viewpoints that focus on objects or not-scene related views similar blank walls, windows, or chandeliers. This model is trained to categorize over threescore dissimilar scene classes commonly constitute in real manor photos.

Attractiveness Model: This model captures the visual attractiveness of a viewpoint. Viewpoints that are pointing to un-attractive regions of the panoramas or have low visual quality (blurry, dark, etc) are penalized, while viewpoints that are attractive and aesthetically pleasing earn a higher score. Training such deep learning models ofttimes require large amounts of data. In general, in our datasets we take found that luxury homes often have loftier quality, professionally staged, and aesthetically pleasing photos, while fixer-uppers sometimes accept low quality, poorly captured images. Hence we apply the predictions from some other ML classification model that uses belongings features similar cost, location, property description, etc to categorize properties as Luxury or Fixer and then use it to label images from those properties as either Bonny (if it belongs to a Luxury property) or Non-Attractive (if it belongs to Fixer Upper). This allows us to train a deep learning model for prototype attractiveness on millions of images without the need of expensive data collection.

Appropriateness Model : This binary nomenclature model helps to differentiate between relevant viewpoints like views of a bedroom from irrelevant views like walls, macro objects, pets, humans, text, etc. Along with the scene model, this model is used to quantify the representativeness and relevance of a viewpoint to the panorama.

Each of the candidate thumbnails/viewpoints are processed by the higher up deep learning models to generate iii distinct scores. The saliency score is and then defined as the weighted sum of three scores obtained from the scene, attractiveness, and appropriateness models.

Smoothing Model Outputs

Since our deep CNN models are trained on 2D images that are sampled from real estate listings, they can be susceptible to errors when applied to 2nd paradigm frames extracted from continuous streams like video or 360-degree panoramas. One potential reason for this is the bias introduced by training data where second images practice non fully represent the distribution of all potential viewpoints that can be extracted from a 360-degree panorama.

This sampling bias during grooming and testing can cause the higher up deep CNN models to generate non-smooth outputs across adjacent frames. This is a general problem with deep networks and we plan to address it in our future versions past tuning these networks to produce smoother outputs across adjacent viewpoints.

To make our algorithm more robust and authentic to the aforementioned artifacts, nosotros apply a smoothing office, in this example, Gaussian Kernels, to the probability outputs of individual classes for these deep models forth horizontal sampled viewpoints. This smoothing ensures that the model predictions do not alter drastically across next viewpoints.

Scene type dependent weighting

As mentioned earlier, the scene model helps united states of america to quantify the representativeness of a viewpoint. For instance, a thumbnail image with a scene blazon of living room should be preferable to a thumbnail with a scene blazon of wall. To attain this we associate a preference weight for each scene type. In general, scene types such as living room, bedroom, dining room and kitchen should take higher weights than scene types such as a window, door, or wall. The representativeness score of a thumbnail is then computed every bit the product of the predicted probability of that scene blazon and the preference weight of that scene class.

Introducing Variety

While the apply case we showed above (Figure ane) requires one salient thumbnail per panorama, the algorithm supports returning multiple salient thumbnails. This is relevant in scenarios where the application might want to provide multiple salient viewpoints similar with a panorama summary that uses an image collage, or where more than than ane salient viewpoint exists, for instance a panorama capturing an open floor plan with viewpoints of both the living room and kitchen.

When choosing multiple salient thumbnails, it's important to consider diversity in the height North salient thumbnails that are returned to the user. In this case, nosotros introduce diversity by ensuring that the selected horizontal viewpoints are well separated from each other. To accomplish this, we first score all the viewpoints using the algorithm described above and then rank them by their saliency scores. After selecting the top salient thumbnail, the side by side salient thumbnail is called amongst thumbnails that have at least thirty degrees of separation in horizontal views compared to the already selected thumbnails. The same principle continues until either height North salient thumbnails have been selected or no more thumbnails are left that satisfy the above condition.

Evaluation through Human being Judgement

At present that nosotros have a model, how do know if it actually works? It is i thing to test a model confronting a labeled basis truth exam set, and an entirely different thing to larn how humans perceive the results. In order to examination with humans we performed a serial of qualitative and quantitative experiments.

We used a total of 137 panoramas to evaluate the performance of the new algorithm compared to the baseline algorithm which just extracts the central viewpoint. A full of 50 Amazon MTurk workers participated in the evaluation.

For each panorama, we testify to the users three images: the original panorama, a thumbnail image using the baseline algorithm, and a thumbnail image using the new algorithm. Each worker is then asked to choose which thumbnail prototype from two candidates he/she prefers, too every bit whether the worker strongly prefers or slightly prefers their selected thumbnail epitome

To ensure order doesn't bias results we randomly shuffle the order in which the thumbnails from the two algorithms are displayed for a panorama. To ensure crowdsourcing quality, nosotros as well randomly inject criterion or aureate standard images into the experiments to rail and monitor worker quality.

Overall results of the experiments showed a statistically significant improvement for our new algorithm when compared to the baseline algorithm. (Tabular array one)

Table i: Results from the man evaluation. Each panorama was shown to 50 users. For ~lxx% of the panoramas, majority of the users preferred salient thumbnail over the default one. In terms of overall votes, ~66% preferred the salient thumbnail over default thumbnail, out of which ~66% were potent preferences.

Deployment to Production

To deploy these models nosotros leverage our ML framework for deploying deep learning models at Trulia that we described in our previous post .

We expose the thumbnail generation service equally a Residuum API on top of our ml platform. The API allows the client applications to query the service with a thumbnail image and optional parameters like desired attribute ratio, summit, width, number of diverse salient thumbnails to be extracted, etc. The service then asynchronously computes the desired number of various salient thumbnails and returns them to the client via s3 or as a base64 encoded image via sqs.

Note that each panorama tin take large number of potential viewpoints. For example, sampling per caste of horizontal view, while keeping other parameters constant, tin can lead to 360 singled-out viewpoints to score. Fifty-fifty though the service is deployed on GPU instances and our CNN models can compute saliency scores per viewpoint inside milliseconds, sampling and scoring all potential viewpoints can soon become computationally expensive given our depression latency requirement for the thumbnail generation service. Furthermore, many of these viewpoints will have significant overlap and do not provide various enough scores to justify the additional overhead. Through our experiments we have found extracting 90 thumbnail images, uniformly at a horizontal view interval of four degrees each, to exist a skilful remainder between retrieve (capturing a salient viewpoint) of the algorithm and overall latency of the service.

Going Forward

The above results demonstrate the value of leveraging deep learning to extract salient thumbnails from 360 degree panoramas. Going forward we will improve upon this approach. First we'd like to develop a better understanding of saliency based on user feedback. We can capture user preferences to candidate thumbnails suggested past the model, and use that every bit a preparation signal for our models. 2d, instead of preparation 3 different models on disjoint datasets to define saliency, we will like to move toward a single end-to-stop cnn architecture to model saliency of an extracted thumbnail.

Stay tuned for time to come posts on how Trulia is leveraging AI and ML to assistance consumer discover a place they'll love to live.