October 2021


3D asset creation: current methods, challenges and future trends



Traditionally, people are used to consume visual content primarily in 2D form, namely images and videos. Although different from the way we are used to experience the world through our visual system, humans have adapted well to consume 2D visual content, as this has been a limitation imposed by the availability of technologies for capturing, transmitting and displaying visual content. Therefore, up until a few years ago, consumer 3D content has been limited to a few specific scenarios such as animation movies and video games, where the content creation process is carried out by professional studios.

However, in recent years there has been growing demand for 3D content driven by technological advances in computing power, bandwidth, hardware and software, giving rise to new fields such as 3D printing, augmented and virtual reality (AR/VR) as well as 3D web-based experiences. In addition, the appetite of the movies and gaming segments for 3D models grows exponentially over time. Figure 1 shows the increase in the number of artists involved in the releases of GTA over time as a surrogate measure to the increase in 3D models required.



Figure 1: Growth in number of technical artists through the different releases of GTA.


A particular emerging use case of 3D content is in online shopping, where the customers are empowered with the ability to observe and interact with the products in 3D, and optionally experience them virtually in their own personal settings with the use of AR (see Figure 2 for an illustration). See for example Houzz, Wayfair, Target, Ikea to name a few.



Figure 2: Example of Augmented Reality use in retail


In fact, 3D-based web experiences are growing at an exponential rate. This is evident when viewing the number of downloads over time of three.js, which is a very popular open-source JavaScript library, easily enabling web developers to integrate 3D viewing and interaction capabilities into websites (Figure 3).


Figure 3: Three.js downloads per year


This phenomenon can be regarded as the latest link in the natural evolution of web delivered content towards media types which are richer and more compelling to humans, analogous to the transition from mostly textual content in the early days of the internet to images, audio and video of increasing resolutions nowadays.

The limiting factor: 3D content creation

The realm of 3D-based consumer experiences has seen some major advances in the past years, mostly related to content delivery technologies such as 3D printers, AR/VR goggles and SW platforms such as ARCore and ARKit. In contrast, the methods for creation of the 3D models have stagnated and despite the improvement in 3D content authoring SW and HW tools, we have not seen a paradigm shift in this field. It is important to note that the acquisition of 3D visual content is an inherently more complex process than capturing 2D images - something has to be done to obtain the additional dimension in the data.

Current methods for 3D asset creation
The existing methods for 3D model creation can be roughly divided into three categories. Let’s briefly review them:

  • Photogrammetry
    The term photogrammetry pertains to the field of making measurements based on 2D images as input. Photogrammetric 3D reconstruction is the process of reconstructing the 3D geometry of an object given a relatively large number of images of the object, taken under different conditions (typically different angles and viewpoints) using primarily geometric considerations. A typical photogrammetric reconstruction pipeline is comprised of the following stages: 
  1. Detecting matching feature points across different input images: This process usually begins with detection of interest points within the input frames. Interest or feature points are locations within the image which have a characteristic structure (e.g. corners). The search for interest points is typically performed over different image locations and scales. For each detected point, a feature vector (also termed descriptor) is extracted from the image patch surrounding the point itself. Finally, the calculated descriptors are used to search for point matches between different input images. It can be understood that this stage of the overall algorithm is largely where the requirements over the input data stem from. Namely, although the different feature point detection algorithms are designed to be robust to variations between the different input images such as viewpoint, lighting and scale variations, in practice the input images need to have a significant amount of overlap and some level of photometric consistency in order for this stage of the algorithm to perform well.
  2. A sparse reconstruction problem is solved: The matched points are cast into an optimization process which aims to find the 3D coordinates of the imaged object corresponding to the detected feature points and the camera parameters simultaneously. The optimized quantity is typically some form of the reprojection error of the 3D points with respect to the camera plane. The output of this algorithm is a sparse set of 3D coordinates as well as the intrinsic and extrinsic parameters of the cameras for each input frame. This stage of the algorithm is commonly termed bundle adjustment or structure from motion (SfM).
  3. Dense reconstruction: The estimated camera parameters are used to densely match points across different input images. This is possible since the acquisition geometry is already known, which dramatically reduces the possible matches for a given image pixel amongst other input images. Once a point match between different images is established, the known acquisition geometry is utilized to triangulate the 3D location of the corresponding point on the image object by calculating intersections between camera rays. The output of this stage of the algorithm is a dense point cloud, and is commonly termed multi-view stereo (MVS).
  4. Meshing and texture mapping: This is the final stage of the 3D reconstruction process, where the 3D representation of the model, which is computed in terms of a 3D point cloud up to this point is converted to a format which is readily amenable for rendering. First, a mesh model is created from the 3D point cloud. Meshing is the process of determining the connectivity between different 3D points, making it possible to define a continuous surface passing along the points. This interpolating surface is typically piecewise planar or polynomial. Texture mapping is the action of projecting image pixels back onto the reconstructed surface in order to define the color at each surface vertex.


Photogrammetry is a relatively popular method for 3D model creation, as it is capable of producing high quality models under certain conditions. Its main limitations are the need for a relatively large number of images (few dozens to thousands, depending on the model size and geometry), acquired under controlled settings such as full coverage of the reconstructed object, overlap between the different views and consistent lighting.

The main advances in photogrammetry in recent years have been around accessibility of the technology to end users, mostly via dedicated mobile applications (See examples here and here).


  •  3D scanners
    3D scanners are dedicated devices which are designed to capture geometric information. Those devices typically contain an active illumination component, a photon sensor, some electronics and computational SW to reconstruct the 3D geometry from raw measurements. Several imaging methods may be employed to implement a 3D scanner, most notably structured illumination and time of flight. A detailed discussion of 3D scanning technologies is out of the scope of this post. However, it is important to mention that besides requiring dedicated HW, 3D scanning also typically poses limitations on the imaging conditions (e.g. ambient lighting), object reflectance properties (i.e. specularity) achievable resolution and/or imaged object size. Those limitations are inherent to the specific design and geometry of the scanner itself. In addition, obtaining a high quality, 360-degree model using a commercial 3D scanner is usually a time consuming and cumbersome task, as the user is required to scan the object from all directions.

         Similar to photogrammetry, 3D scanning has advanced in the past years from an accessibility standpoint. Namely, technological advances                     in microelectronics have enabled integrating 3D sensors into mobile devices.

  • Manual creation
    Manual creation of 3D content using dedicated SW (e.g. 3ds Max, Maya) is the most common method at the moment and the vast majority of online 3D models are created this way. This process requires a professional 3D technical artist to use a few 2D images as the guiding data in the model creation process, which requires a significant amount of work using specialized SW. Due to the expertise and amount of labor required for model creation utilizing this kind of process, it is relatively expensive and time consuming.


The caveats of current 3D asset creation methods
Besides the technical limitations of the different asset creation methods discussed above, they all bear inherent drawbacks. Photogrammetry and 3D scanning both have the trivial, yet often overlooked limitation that they require the physical presence of the imaged object. This may be a non-issue for a consumer creating a one-of model for recreational use or other small scale use cases. However, for larger scale uses, such as a retailer with thousands of items to model, or a game developer requiring many 3D models just to create a realistic scene, those approaches are infeasible. In addition, both of those methods require a relatively long and involved acquisition process.

For those reasons, most at-scale uses of 3D content rely on manual modeling, where existing content in the form of 2D images depicting the object to be modelled suffices. Naturally, the main limitation here is the large amount of professional human labor required, which of course, dominates the time and cost for creating the content. Creation of one simple 3D model of an object typically costs at least tens of dollars and takes a few hours to model and may reach hundreds of dollars and a few days of work for complex objects. 

So, what’s next?

Looking forward, we can identify some clear trends forming around 3D asset creation solutions. We choose to separate those into small scale solutions, i.e. consumer grade solutions for creating a handful of 3D models, vs. large scale solutions suitable for businesses and enterprises looking to generate at least thousands of 3D models.

Small scale solutions

For consumer-level 3D model creation, we expect to continue to see the contemporary trend of enabling 3D capture using mobile devices becoming more and more common. This will involve either dedicated 3D sensing hardware embedded within the mobile devices or on-device consumer-facing photogrammetry SW, enabling creation of 3D models from many 2D images, or a video.

Looking at present academic publications in the field, it is likely we will see in the near future those mobile-based solutions improved in computational performance, features and quality using AI technologies such as Neural Radiance Fields.

Scalable solutions 

Following the analysis we presented in the previous sections, it is clear that any solution for 3D model creation at scale cannot rely upon 360-degree acquisition of the object to be reconstructed due to the lengthy and cumbersome nature of such an acquisition process. We also remind the reader that there are many applications where the object itself is not physically available. Therefore, we believe that any scalable solution must be based on existing and ubiquitous data – namely available 2D images of the objects to be reconstructed.

Once again, examining state of the art academic publications, we see that AI-based 3D reconstruction from a single image is an active research topic, see e.g. the papers here and here. It is noteworthy there are very few works utilizing several 2D images as input to the 3D reconstruction process and those typically require additional information regarding the acquisition setup (i.e. camera parameters), which are unknown in practice.

Our vision and mission

Our vision for the future of 3D asset creation is that the capability of creating 3D assets from a handful of existing images will become commercially available in the very near future. At first, this technology will be utilized for mass creation of 3D models of common items such as electrical appliances, cars, furniture, buildings, etc. by large enterprises looking to convert large repositories of existing images to 3D. Further ahead, this technology is likely to be adopted for consumer applications, unlocking additional capabilities on mobile devices.

Our mission around this is very simple: To make it happen. For that end, we teamed with 3D technical artists to learn how they approach 3D modelling. We also researched the latest computer vision and AI algorithms, and with trial and error integrated this knowledge to build a working pipeline. Further, during this process we realized even the best algorithms won’t do without some good datasets and the compute power required to train the algorithms. To address those issues, we complemented our algorithmic pipeline with a holistic infrastructure capable of handling our data and training our Deep Learning models in a quick and cost-effective way.

You are welcome to learn more about our solution in our technology page, and even try the technology yourself on our demo platform.