Commercialization in bioinformatics research: why and how
As part of the development of Independent Data Lab (IDL), I found myself often faced with skepticism and reluctance of bioinformatics group leaders to work together with IDL for the mere fact that it was a commercial organization and not an academic group. And this rivalry between academic groups and bioinformatic companies appears to go both ways. And while I understand motivations of both sides, I believe it is time for commercial and academic bioinformatics to come together as a necessary next step in the development of the field. It’s time to incorporate commercialization as part of the research relay race in moving bioinformatics forward. This is why I decided to write this post, in which I aim to
- examine the current state of commercialization in scientific research,
- provide some context of the role of commercialization in life sciences (namely biology),
- examine the reasons for its perceived lack in bioinformatics (or if there is a lack of it at all),
- and hopefully convince the skeptics that commercialization plays a key role in the progression of research by allowing scientists to focus on novel solutions rather than the maintenance of existing ones.
Commercialization in scientific research
Broadly, research covers two main categories: 1) the development of novel methods, tools and techniques, and 2) the generation of knowledge.
In life sciences, novel methods arise from academic institutes, but are quickly protected by patents and commercialized into a private product. The closest to our heart example is DNA sequencing — a lot of early development was made within academic research [1], but it was a commercial company, namely Serono, where the rise of massive DNA sequencing methods took place [2,3]. Currently, Illumina has a global market share of about 70% [4] and in 2019 spent roughly 18% of their revenue (~US $650 million) on R&D [5]. That is more than the NIH spent the same year on autism, cystic fibrosis, MS, arthritis, infertility and 220 out of 292 other research areas individually [6]. And this is not an isolated example. According to a 2017 report from the Organization for Economic Cooperation and Development (OECD), businesses account for around 70% of all R&D expenditures [7] (in OECD countries). According to that same report, over 60% of all R&D spendings account for methods development (or as OECD classifies it — experimental development). Therefore, it is reasonable to assume that a majority of that funding goes into specifically this area.
Theoretical bioinformatics vs applied bioinformatics
Similarly, bioinformatics research splits into the same two categories: 1) there is bioinformatics as methods development (what I call “theoretical bioinformatics”), and 2) there is what is called “applied bioinformatics”, where existing methods are applied to extract knowledge from available data. However, in bioinformatics, unlike in other life sciences, the split in commercialization between the two categories is different: generally, if you are a bioinformatics Principal Investigator (PI), you are expected to work primarily on methods development, while applying these methods to biological problems merely as a proof-of-concept. Theoretical bioinformatics, while more likely to be funded as independent research, does not benefit from the same commercialization as other areas of biological research — most commonly utilized tools continue to be maintained by academic labs and individual researchers and distributed under use-at-your-own-risk licenses.
Why commercialization
Bioinformatics has been available since the 70s, but next-generation sequencing (NGS) techniques have catalyzed the need for systematic approaches. Suddenly, bioinformatics became a necessity instead of just a possibility. Nevertheless, it is still dominated by academic development. Some of the most widely used tools in bioinformatics are developed and maintained by academic labs, with all of the corresponding drawbacks (in detail described in Philipp Zentner’s post) [8]. And while there are a lot of talented developers in the field who promote best scientific software development practices [9], the issues will most probably not be resolved as long as academic dominance over software persists.
The case for privatization and commercialization in bioinformatics is strong and supported by several necessary features that software has to adhere to:
- Liability. As NGS moves into healthcare, closer to influencing the lives of people, there has to be accountability. Mistakes can have a dramatic impact on people’s lives: from increased insurance costs due to genetic predisposition [10] to implications of prenatal diagnostics on the unborn life [11,12].
- Robustness and continuous improvement. Rapid development of computing technologies requires continuous adaptation of the software. Some of the tools were created before the technologies even emerged. Cloud computing became popularized around 2010, docker containers emerged in 2013, kubernetes appeared in 2014, operating systems update every half a year, and it is expected that the software will be adopted accordingly.
These important features are really hard to achieve in the academic settings, for many reasons, but I will name two that I find most influential:
- The software is often developed by early-career researchers, but academic contracts for post-docs and PhDs are limited to a few years. As a result, the authors of the software often change institutes, countries and fields (or move to industry).
- A career of the researcher depends on publications and releasing a software patch is not gonna get you one.
As a result, the plethora of good bioinformatics software becomes unmaintained and loses its relevance.
Bioinformatics core facilities — academic or industrial?
A bioinformatics core facility is a department that provides sequencing services on a commercial basis or part of academic collaborations. The emergence of such facility generally comes from a research group in the academic institution who purchases a sequencer, and since the capacity of a sequencer is usually higher than a need of a single group, it is filled up by collaborative projects.
Core facilities have been some of the most aggressive opposers of commercial bioinformatics, which is understandable, if we consider that bioinformatics analyses is one of the key services they provide. They are also key skeptics of commercialization of bioinformatics, because (and I quote) “Why would someone use a commercial service if we can do it for free?”
Core facilities indeed have an economic advantage — they are generally cheaper than the services of commercial partners. The main reason for this is that their services often come with “bioinformatics analyses included for free”. Another advantage comes from the licensing by academics under “free-for-academics-but-not-commercial-use” licenses.
However, as it often is, lower price-tag comes at a price (pun-intended). The hiring strategies of universities would not allow to build a team of senior researchers to support necessary bioinformatics workload, and the work is often done by students with limited ability for follow-ups or customized analysis. And we see a result of this in the growing number of bioinformatics service companies founded by PhD students, who saw the increasing demand from within by working for those facilities (Ecseq, Lifebit, Omiqa and many others).
In my vision, such shielding of core facilities from commercial partners is unnecessary — they play an important role in scientific research, but it is important to understand the appropriate use-cases for potential clients — when to use it and when to turn to commercial partners instead. But that’s a topic for a different post.
Commercialization of applied and theoretical bioinformatics
Attempts to commercialize in bioinformatics generally turn companies into consultancy rather than R&D software development. The reason for this, in my opinion, is the need for high quality customized bioinformatics, which clients are craving and core facilities struggle to provide. Founders of Bioinformatics-as-a-service (BaaS), who often base their experience on working in core facilities, realize that it was not the pipeline that the client was paying for — it was the interpretation. As a result, they spend most of their time consulting (once the pipeline is set-up and running). Such transition is not bad and also expected — it indicates a market need and that there is clearly no lack of commercialization when it comes to applied bioinformatics. It is heavily commercialized as platforms and contract research and is expected to grow[13].
I am more interested in reasons for failure to commercialize within the theoretical bioinformatics domain. And I believe the answer to this lies inside the bioinformatics community itself.
Importance of community and open-source
As it is often the case with academic research areas, bioinformatics has a strong, familiar and amicable community; it is a wonderful group to be a part of, encouraging, supportive, and responsive. But the commercial initiatives are generally not well-received here. The skepticism of the bioinformatics community towards commercial players is well-deserved — bioinformatics companies leverage on open-source development without contributing to it, close their source code from the public view and introduce licensing fees.
Such behavior of the companies, in my opinion, is a consequence of the lack of exposure of mostly academic bioinformatics founders to the world of scientific software engineering. Many carry a misconception that “commercial” means closed and with a license fee. And while licensing is a valid business model, in the field of theoretical bioinformatics, where our clients have a deep knowledge of the field, it does not fit well.
I am deeply convinced a different strategy is needed here — the code has to remain open-source, it has to gain the trust of the community, and the company has to contribute to the community before trying to leverage on it. If we examine the cases of the open-source community-contributed software, we will see that (a) it emerged from a commercial company, (b) it is maintained by the commercial company, and © it has a trust of the community. Just a few most popular examples: machine-learning library TensorFlow [14] (developed by Google), linux-based operating system CentOS [15] (developed by Red Hat), C++ library VTK [16] (developed by Kitware).
Business model based on an open-source software is not the most obvious one, but there are plenty of successful examples. My favorite in the field of scientific computing is Kitware. This is the company that provides key software libraries in the field of scientific visualization, and their software remains open-source. Such business model still has a component of a consultancy, as a lot of revenue comes from contractual research. But it is an example of commercialization that does not stand in the way of open science, but provides the means to maintain important contributions.
It is my deep belief and conviction that the best results can be achieved if academic and commercial bioinformatics recognize each other’s strengths, and start working together towards the progression of bioinformatics into people’s lives. I hope my post, if not convinced you of necessity, then at least de-demonized the commercialization in bioinformatics, and provided you a good overview of the role of commercialization in software progression. To me, moving from academic research into industrial settings was mainly motivated by a need for standardization, and a strong desire to “clean-up” and organize the code and/or the analysis, but lack of time to do so due to pressure to generate knowledge and publish papers. I am still a big fan of academic research, and I would like to stay as close to it as possible, but I believe that it is time for bioinformatics to start passing its relay to industrial partners, so that you — academics — can work on novel methodologies while we take the maintenance off your hands.
Thank you for reading.
I thank IDL’s community of freelance bioinformaticians (Patrick Grossmann, Matthias Stahl, Gregor Sturm, Manuel Belmadani, Philipp Zentner, Joel Sevinsky and Alvaro Sanchez) who helped me write this post.
References
(14) TensorFlow https://www.tensorflow.org/ (accessed Jul 21, 2020).
(15) The CentOS Project https://www.centos.org/ (accessed Jul 21, 2020).