Why we publish our algo design
As computer systems evolve, we layer abstractions between ourselves and the lowest level operations of bits. We delegate to hierarchies of teams, tools, and vendors. We allow knowledge to pool into silos of narrow specialization...
Computer science as a practice requires you to define precisely what you mean. You cannot tell a computer: “trade stocks for me, and give me best execution!” You must patiently and painstakingly instruct it to copy some portions of bits into other portions of bits, shuffle some third set of bits around, read in some other bits from somewhere else, and so on. Well, usually not actually you, but someone. Well, actually lots of someones. Someones who designed the operating system you’re working on, someones who designed the programming language you’re using, someones who designed the network protocols your trading system uses to communicate with other people’s systems, and so on.
These details are fundamentally knowable in nature, but not in scale by individual human beings. As computer systems evolve, we layer abstractions between ourselves and the lowest level operations of bits. We delegate to hierarchies of teams, tools, and vendors. We allow knowledge to pool into silos of narrow specialization, because otherwise, we could not keep up and get anything done.
The notion of an “algorithm” sits both atop this hierarchy and outside it. If you consult a computer science textbook, the definition of algorithm will likely use words like “procedure” or “recipe” that are not intrinsically tied to the realm of computers. It’s the instructions you give for accomplishing a task, it might say, or a sequence of steps to be followed. The language becomes awkward and vague, but not because the concept is new. Rather because the concept is old. So old, that we probably learned it before we gave it a name. We learned algorithms for tying our shoes, for adding two integers, for brushing our teeth. We learned these things as answers to the question of “how,” and we learned that answers could have varying degrees of specificity. At first we needed very specific instructions, but as we learned and got older, we could follow higher level instructions, subsuming the lowest details as familiar, predictable pieces that did not need to be said explicitly anymore. We also learned to subsume certain tools as given, and restrain ourselves from falling too deeply down every rabbit hole of “why.”
This subsumption has obvious benefits, and less obvious costs. Sometimes we feel the costs when we try to teach someone else something we “know” but find ourselves unable to explain. Sometimes we feel the costs when something changes, and we don’t know how to adapt our procedures effectively. In some cases, the lower level knowledge that informed our higher level understanding has evaporated from our minds, leaving only a derived residue that is brittle in its relation to context, and perhaps invisibly so. How long might it take us to recognize when we are operating on assumptions that are no longer true?
There is something more that can be lost in layers of abstraction if we are not careful: the value of forcing ourselves to be explicit. Anyone who has ever taught a young child or programmed a computer is well aware of the phenomenon - we think we know what we mean, until we see our own words parroted back at us in a literal translation with horrifying consequences. Sometimes this is born of our failure to properly define a sub-concept we are referencing (“ok Tommy, I admit I did not really mean it when I said you could color ‘anywhere’ that wasn’t on the wall”), and sometimes it is born of our failure to anticipate the circumstances under which our instructions will be applied (“oops! I thought I issued that delete instruction in the subfolder containing those old files, not the main folder containing the new ones!”) In navigating our lives and careers, we humans are quick to grasp that there is safety in distance from such specific commands. “But honey, I clearly told the babysitter to keep him out of trouble!” and “But I only instructed the intern to delete the unnecessary files!” We often pretend that all we gain from our distance is convenience and efficiency, but often we are seeking less accountability as well.
Specificity is hard and risky work. But somebody does it, even if we don’t. Between every layer, somebody has to translate higher level instructions into lower level ones until we get all the way down to the shuffling of bits or the changing of diapers. The supposedly shared context that “goes without saying” is meant to make this seamless, but this foundation can crumble rather easily and dramatically at times.
In the domain of algorithmic trading, shared context may be under quite a bit of strain. It is a nearly comical game of telephone that begins when someone puts money into a retirement account. “Give me growth or something,” the future retiree says. “Give them growth or something,” the fund manager tells his subordinates. “Give me exposure to these three factors that roughly mean growth or something,” the next person says. Some number of iterations later, someone says “Give me 100,000 shares of MSFT.” By this point, an important translation has occurred. The original customer goal, to the extent that it was ever formulated in the first place, has been pooled with other customers’ goals, melded through other intermediaries’ interpretations and blended with their own separate goals, and one or more specific orders to buy/sell stocks has emerged from this process. The benefit to the end customer so far is supposed to be twofold. One benefit is the supposed superiority of this chain of experts as compared to the customer’s own haphazard guess at translating high level goals into concrete orders. The other benefit is diversification: since stocks are ultimately bought and sold in indivisible units called shares, it isn’t possible for a single customer with a more limited amount of money and time to purchase and actively maintain the same diversity of assets that a fund manager can purchase and maintain with the pool of all their customers’ money. One can certainly debate the true extent of these benefits, and compare them to more automated solutions like indexing and various artificial notions of fractional shares. But it’s at least somewhat clear what problems these services are supposed to be solving. It would be quite unreasonable and inefficient for each individual retiree to build up the body of knowledge required to passably translate “growth or something” into a suitable portfolio of financial assets and continually re-balance it over time.
Conversely, there is danger lurking in the end customer’s ability to vaguely say “growth or something.” The portfolio that emerges may not serve the customers needs well at all. Or the customer may have wildly unrealistic expectations of performance or risk. Such things may be the result of honest miscommunication, deliberate subterfuge, or misaligned incentives (or all of the above).
This is not a problem that is particular to finance, or to computer driven systems. It is a fundamental tension inherent in all task delegation: when you save yourself the work of making your instructions explicit all the way down to the lowest detail, you introduce opportunities for an agent who does not fully understand or share your goals to deviate from what you want them to do in certain circumstances if you had taken the time to fully understand the details.
A popular mechanism for navigating this is competition and choice. Agents will compete for your business, and you can reward the ones who do a good job by continuing to use them, and punish the ones who do a bad job by terminating their services. This mechanism works well when a few conditions are satisfied: 1. there is a healthy range of options for service providers, 2. distinguishing between a good job and a bad job is something that can be done in a reasonable amount of time, and is a much easier problem than doing a good job in the first place.
In the case of a customer contributing to a retirement account, that second condition is problematic. The funds are supposed to perform well over a long term time horizon, and judging them on a short term basis is likely to be dominated by market noise and yield little insight. As a result, there are vast sub-industries of finance organized around addressing these tensions, and vast regulatory regimes in place to try to protect end customers and enforce at least a reasonable zone of interpretation at each layer of translation. Pretty much everyone agrees that this is necessary. Customers need to be able to delegate specialized financial tasks to professionals and trust there are bounds on how professionals should behave. Competition alone is not a sufficient mechanism, as customers aren’t readily equipped to evaluate sophisticated products without putting in an unreasonable amount of work. As new layers emerge and old layers evolve, it’s a constantly moving and delicate dance.
And in fact, we’ve only started. The game of telephone keeps going. The next person says “Give me 25,000 shares of MSFT today and probably 25,000 more tomorrow, we’ll see how it goes.” The next person says “allocate today’s 25,000 share order to one of our brokers and ensure best execution.”
Let’s pause again for a moment. Something weird happened there. Things were still getting more concrete, but then a new source of vagueness slipped in: the notion of best execution. It sounds pretty innocent: who wouldn’t want “best” execution? But what does it actually mean?
If you consult FINRA rule 5310 on Best Execution and Interpositioning, you find that a broker must “use reasonable diligence to ascertain the best market for the subject security and buy or sell in such market so that the resultant price to the customer is as favorable as possible under prevailing market conditions.” This language rules out some obviously bad and lazy things, like routing all customer orders to a particular dark pool without ever comparing the results to other possibilities. But it leaves a lot of wiggle room. There are two gaping holes in this guidance: one is lurking in the phrase “under prevailing market conditions.” Since the execution of trades is an interactive process between the many brokers submitting orders and the multiple venues matching orders, the timing of trades is highly variable. Timing of individual trades is not completely within a broker’s control (they can’t control when willing counter-parties arrive), but it is influenced heavily by the choices the broker makes in how to distribute a large order into many small orders over time and over trading venues, and how the broker communicates orders to trading venues (use of order types and order parameters). Since “prevailing market conditions” change rapidly in time, the influence a broker exerts over timing is also an influence on the “prevailing market conditions” under which the trade will be executing. In this way, brokers affect both the grade and the grading rubric for best execution at the same time.
The second gaping hole is that the best execution guidance doesn’t really grapple with the nature of large orders, which are unlikely to be traded in their entirety at once. When a broker designs an algorithm to break up a large order into smaller pieces and seek to trade the pieces gradually throughout the trading day, does the best execution responsibility apply to just the pieces individually or to the large order as a whole? Clearly in spirit, it should apply to the large order as a whole. But what does a “price to the customer... as favorable as possible under prevailing market conditions” even mean when you are looking at several individual prices over the course of a day where market conditions were changing dynamically? How can you know what would have been possible if the order had been chopped up in a different way? What is the space of “reasonable” alternatives that one should compare to and how does one do so while general market noise is likely to drown out small differences in outcomes due to the broker’s behavior? And if you give up on this harder problem and just evaluate each small trade in its temporally local context where things are clearer, surely you might be blind to important failures to choose the “best” local times and order sizes.
So what happens after this troublesome notion of “best execution” is introduced into the game of telephone? It’s not too hard to guess. It gets parroted down the line for a bit, then disappears into the black box of a secret “algo”. When the telephone game turns around and each person reports back to their superior, the “best execution” straw man re-emerges at the same point and gets passed back up. “Buy me 25,000 shares of MSFT today using your VWAP algorithm that provides best execution,” the next person tells the broker. The algorithm makes its choices of how to slice up the 25,000 shares, and spits outs a dynamic sequence of much more specific commands: “place a midpoint peg buy order for MSFT on NASDAQ at 10:01:02 am for 100 shares,” the algo says. These commands get transmitted through multiple network layers (and often multiple vendors and intermediaries) and finally land at a trading venue, where perhaps they result in a trade. Their fate gets passed back up to the algo, which may adjust its state and issue new orders. The algo passes its results back to the broker who is running it. The broker periodically runs some paltry and horribly noisy tests on these results to make sure they seem reasonable. Then the broker passes back up to the next person: “here’s your volume-weighted average price, achieved with best execution.” This continues to percolate up the levels to the originator of the 100,000 share mandate, who ultimately receives their 100,000 shares of MSFT, their bill, and an assertion of “best execution.” Here the “best execution” notion evaporates again, and the message morphs back into “here’s your growth or something” to the retiree, who can check the behavior of their account and try to keep it consistent with their goals at various time horizons.
There are two questions that arise when we critically examine this workflow. First: are algo designers really the best people to translate this vague notion of “best execution” into specific sequences of orders in a dynamic, distributed market? Second: how does the commonly black-box nature of algos contribute positively and negatively to the overall process? How much visibility should an algo provide, and to whom?
We firmly believe the answer to the first question is yes. The problem of algorithm design for electronic trading is a complex scientific problem. It involves the delicate dynamics of distributed systems, the complex economics of continuous trading and batched auctions, the fraught task of modeling market forces as randomized processes, and the herculean statistical challenge of evaluating alternative choices in a meaningful way when the degrees of freedom combined with the inherent variance conspire to overwhelm the sample size of a single firm’s trading activity. This is a problem that deserves to be tackled by scientists. Retirees, regulators, and even financial professionals with other specialities should not be expected to solve these kind of problems for themselves and then tie the algo designers’ hands.
But is competition a sufficient mechanism to ensure that algo designers will do a “good” job on behalf of the end clients? While there is a healthy number of agency brokers and algo products available for trading US equities, it is not at all clear that those who choose between the products can evaluate their effectiveness with a sufficient amount of accuracy within a reasonable use of energy and time.
Let’s do a thought experiment (informed by real market data) to help gauge the extent of the challenge to evaluation. Each trading day, the official opening price of a stock is set through an auction at 9:30 am, and the price fluctuates continuously throughout the day until the official closing price is set in an auction at 4:00 pm1.
If we look at the sequence of prices obtained for trades of a given stock on a given trading day, there is a significant amount of fluctuation. To get a rough sense of how much, we can look at the relative change from the opening price to the closing price. If we let Op denote the opening price and Cp denote the closing price, then this quantity is defined as:
We have made this relative to the opening price so that it is meaningful to compare this quantity across different stocks that have vastly different prices. On a given day, we will have over 10,000 values of Dp, one for each symbol. If we collect these Dp values over symbols and over trading days, we can view them as samples from a single probability distribution, weighted by notional value. In other words, we can build up an empirical estimate of the underlying distribution by placing probability mass on each observed Dp value that is proportional to the notional value traded in that symbol on that day.
We did this for all symbols and all trading days over the month of July 2021. Once we have this probability distribution in hand, we can sample it any N times and compute the average value of Dp (evenly weighting over our N samples). In some sense, N represents the number of orders we might use in a sample to try to measure an algo’s performance. This is not a perfect analogy, as each sample here is drawn according to notional value, and real institutional trading flow will probably be distributed differently over symbols than the general notional value distribution over the market. Nonetheless, this should give us some intuition for how much variance there might be in our performance metrics. We can do many experiments of drawing N samples, and look how much the resulting averages vary. In particular, we’ll look at the interquartile range of our resulting averages, which is the difference between the 75th and 25th percentiles.
For each N from 1 to 150, we did 1000 experiments, and took the difference between the 750th and 250th resulting values after sorting. Below is a graph of those interquartile ranges as N, the sample size for each experiment, grows from 1 to 150:
Unsurprisingly, as the sample size of each experiment increases, the variation between the resulting averages decreases. At sample sizes near 150, the interquartile ranges are a bit greater than 0.005 wide, and the improvement in precision as a function of growing sample size has slowed.
So what does this mean? For one thing, it suggests that it’s quite difficult to see meaningful performance differences between algos on metrics like slippage vs. arrival at these sample sizes, unless the performance differences are considerably larger than 50 bps. For A-B testing different algos, especially when we are limited to the flow of a single client over a period of a month or two, this is pretty sobering news. Without further correction and careful normalization (a fraught process itself), extraneous market forces are likely to pull the results around noisily enough to obscure even meaningful and consistent differences in algo performance. In fact, measuring algo performance may be as hard (or harder!) as designing algos in the first place. And the people with the skill set to tackle this hard problem? You guessed it - the same people with the skill set to design algos.
This creates a very uncomfortable position for the first person in the game of telephone who is directed to ensure “best execution.” To truly embody the full spirit of this directive seems to be a full time job equivalent to designing algos, and that’s supposed to be the thing that’s delegated to the black box! What to do?
There are a few common approaches to try to wriggle out of this conundrum. One is to hire in-house scientists to grade the outputs of the algo black boxes. Another is to outsource this job to an independent third party (a TCA provider). A third option is to hollow out the vague directive of “best execution” and replace it with a checklist that boils down to something more like “not obviously terrible execution.”
All of these options have major drawbacks. The use of in-house scientists is likely the best, but it is also costly, and if everyone did it there would be some comical effects: the overall population of quantitative scientists would spend much more resources on grading algos than designing algos, and that feels like an inefficient state for the market as a whole. Also, finding, training, and crucially listening to good data scientists is a much harder problem than proliferating data science boot camps would like you to believe. The outsourced TCA provider at least allows a single set of scientists to serve as algo evaluators for a large population of firms who need to evaluate algos, but the incentives are a little weird. While it is true that the third party evaluators should have no incentive to cherry-pick the stats in favor of any particular algo, they also have no real strong incentive to do a good job, nor any clear mandate on what a good job is. Human beings are creatures of inertia, afterall, and most of what clients want from TCA providers is a stamp of approval that what they are doing already is basically ok. Providing that stamp is much easier to do if one combines TCA with the third option: fixing a minimally defensible definition of “best execution” rather than a formulating a more satisfying but complex one and having to teach your clients that this is what they should want.
There is a fourth option that, as far as we know, has not really been tried before: open the black box. What if we didn’t limit ourselves to grading algos solely on noisy performance metrics? Naturally we’d always want to measure those to learn anything we reasonably can, but what if we could also cut through the noise and examine the raw source: the algo designs themselves, and the processes that drove their development?
As an illustrative comparison, consider the task of deciding where to send your child to school. You might look at test scores for each contender and these are likely to reveal any huge differences, but small differences are unlikely to be particularly meaningful. You could stop there and say, “I’ll send my child to this school that has reasonable test scores,” but wouldn’t you also want to know how the various schools approach their mission of education? Ideally you would want to visit the various schools, you would want to talk to the teachers. You would want to know what they think is important, what they think is unimportant. You would want to gauge how much thought they have put into their approach, and how aligned their values are with your values. You would want to see what’s underneath the test scores. Why settle for a noisy outcome evaluation when you can also directly assess the mechanisms that drive the outcomes?
It’s true that human beings are not great at this. Our minds are subconsciously manipulated by many heuristic habits that bias our assessments. We believe far too strongly in first impressions. We give undue weight to recent experiences, we are prone to falsely equate what is familiar with what is desirable, etc. But it is a fantasy to think that “data” on its own can save us from these cognitive traps. Our minds are instinctual and persuasive storytellers, and we can spin a story around ambiguous data about as easily as we can in a vacuum. For this reason, we should not wholly replace the challenging process of subjective assessment with blind reliance on noisy metrics.
So what can a non-algo designer reasonably hope to extract from a disclosed algo design and an account of the research behind it? Hopefully at least a few things like: 1. a sense of what kind of scientific processes the designers employ, 2. an understanding of what goals the designers are prioritizing, 3. an awareness of the assumptions the designers are making, 4. a rough idea of the extent and level of competency of the research, and last but certainly not least: 5. an opportunity to collaborate more directly with the designers and aim their expertise more effectively at achieving particular goals.
Many would argue that the potential downsides of publicly disclosing algo designs outweigh the value of these kind of assessments and collaborations. The most common arguments given are 1. competitors can copy a disclosed algo design and 2. a disclosed algo design is more vulnerable to being “gamed” by other traders. In a direct sense, 1. is only a problem for the company providing the algo, not its customers. But indirectly, one might worry that copied designs will remove incentive for innovation. This concern is circular though, because the incentive for innovation is already weak in the absence of a broadly accessible and reliable mechanism for gauging algo quality.
Concern 2. above is directly relevant to the clients of an algo, and it is certainly worth taking seriously. Let’s think about what it means for a design to become “gameable” due to public information about its development. The process would be: someone reads the newly public information, combines it with their own current knowledge, and comes up with an idea to behave differently in their own trading algorithms and potentially improve their own outcomes at the expense of the disclosed algo’s customers. If this worked to a significant extent, it would have to either 1. essentially work against a large portion of agency algos or 2. involve a step of approximately identifying the disclosed algo (or something very like it) in the wild. We must keep in mind here that someone looking to exploit the disclosed algorithm will not know what side/stocks/amounts the algo is actively trading on any given day. This is private information that comes from the customers and is never disclosed.
If 1. is true, then the role of the algo design disclosures is likely coincidental. General knowledge about how agency algos typically work is available already, and the set of people across the industry who have direct experience working on or around agency algos is not small. If 2. is true, then the disclosed algo is doing something unusually noticeable, either in its general behavior or in its response to conditions that a would-be gamer manufactures. In this case, the design has a problem, and should be fixed. Not disclosing the design is a flimsy protection in this case. Whatever the noticeable and exploitable behavior is, it could also be discovered by someone searching for such a thing, even if that person didn’t know ahead of time what exactly to look for. In our age of big data and fast technology, such an unguided search could take longer than a targeted one, but perhaps not that much longer.
If we assume that any serious exploit will be eventually discovered, then our goal should be to discover and patch it ourselves as quickly as possible. Publication of our algo design and research supports this goal, as it enables us to collaborate with others more freely, and to vet our design through a larger audience. This is the same approach that is used to produce strong encryption algorithms like AES (the advanced encryption standard) that we all rely on to secure our sensitive communications (e.g. using our credit cards for online transactions). The design of AES is fully public and has been subject to extensive public vetting from the cryptologic research community for decades. The sensitive information encrypted via AES is protected by secret key values which are unknown to would-be attackers, but everything about how the secret keys and the sensitive information is combined to form an inscrutable ciphertext is known.
We believe that everyone up the telephone chain from the algo black box would be better served by a translucent box - and that’s why we commit to publishing the research that goes into the design of our algorithms, as well the design of the ways we evaluate performance. The rest of this paper will detail the process we used to design the scheduling component of our new trading algorithm, as well as the twists and bumps we encountered along the way.
Our design process is heavily driven by the desire to learn as much as we can from historical market data, which is available to us in a quantity that is orders of magnitude greater than our own live trading data will represent for a long time. By evaluating potential features of the design extensively on historical market data, rather than solely relying on noisy A-B tests in live trading, we can improve our design much more quickly and more robustly. Historical market data can tell us a lot about how the market is likely to react to common situations. Once we’ve developed such information, we can start to model how the market may react to potential choices that our algo may make. Finally, we can derive the choices that our algo will make by comparing the modeled market reactions to the available choices and choosing the path that our modeling predicts will be most favorable for our execution goal.
Our execution goal is formulated mathematically in a later section, but its simplified version is basically: “don’t shit where you eat.” In this context, it means: try not to pay more as a buyer because you have pushed the price up. This is close in spirit to minimizing “impact,” but lots of people use that term without converging on a single mathematical meaning, so we want to be a little more specific. Its most direct meaning is also not quite what we care about - we may not care if our activity moved the price after we were mostly done trading. We care more about how our activity so far drives up the prices we will incur in our remaining activity. In other words, we care about prices over time proportionately to how much we trade at those times. So we will seek to model how our behavior affects prices at a forward marching sequence of times, and we will define a cost function that ultimately calculates: according to our model of market reactions, what’s the additional premium we expect to pay as a buyer due to our own actions driving up the stock over our sequence of trades? Naturally, we design the algo to choose the actions that minimize our estimate of this cost function, subject to accomplishing the desired total amount of trading over the time period.
There is one big question this paper will not answer. It is a question we get asked a lot: by would-be investors, colleagues in the industry, potential clients, and even our families on occasion. “Just me give me ballpark,” they start, “how much money do you think you can ultimately save your clients?” We sigh. It’s obvious why everyone wants an answer. It would certainly make our lives easier if we just gave an answer. We could hedge it in all the typical ways: “This is just a projection but...” and“If you assume ...” But frankly, there’s currently no scientifically responsible way to answer this question. We could point to the paper “Trading Costs” by Frazzini, Israel, and Moskowitz 2., which estimates AQR’s average market impact over many years of trading data to be roughly 9 bps, with about 1.26 bps of that being “transitory” impact that reverses soon after AQR’s trading activity completes. This seems to suggest at least that trading costs overall do represent a significant term in the overall costs of institutional investing. But how much of this term is inherent, and how much is attributable to differences between algos? We don’t know. It’s very hard to know! We will work diligently to combat the confusion of market noise in our own iterative research process, and we are optimistic that we will be able to achieve reasonable and compelling estimates of how much better each version of our algo is compared to the last. But we won’t be able to compare ourselves to other algos because, well, *cough*, those algos are hiding in their black boxes
So we can’t tell you how much money we would save for a potential client. Because we don’t know how much money their current brokers are really saving/costing them. And they don’t know either. And isn’t that unsettling? If we know that this cost term may be big enough to matter, and we know that we don’t know how to control it with a noisy competition between black boxes, isn’t that a good enough reason to force the boxes open?
It’s great that a person doesn’t need to become an expert in portfolio management, algorithmic trading, settlement and clearing, market microstructure, and more in order to accomplish an investment goal of “growth or something.” And it’s true that most investors have no interest in going down the telephone chain and understanding how their goal gets translated into something more concrete at each layer. But shouldn’t that translation be knowable in principle? Shouldn’t someone be empowered to check that each lower layer is doing a reasonable job of embodying the higher layer’s wishes?
We think so. But we don’t expect that other algo designers will shed their black boxes anytime soon. We’ll just be here in the meantime, tinkering away in public, and happy to hear your thoughts on what we’re building.
Proof Trading, Inc.
176 Broadway #15D
New York, NY 10038