8 posts categorized "Web/Tech"

14 July 2014

NoSQL Document Database - Manhattan MarkLogic

Bit late in posting this up, but given I did something about RainStor I thought I should write up my attendance at a MarkLogic event day in downtown Manhattan from several weeks back - their NoSQL database is used to serve up content on the bbc web site if you wanted some context. They are unusual for the NoSQL “movement” in that they are a proprietary vendor in a space that is dominated by open source databases and the companies that offer support for them. The database they most seem to compete with in the NoSQL space seems to be MongoDB, where both have origins as “document databases” (- managing millions of documents is one of the most popular uses for big data technology at the moment, though not so much publicized as more fashionable things like swallowing a twitter feed for sentiment analysis for example).

In order to cope with the workloads needing to be applied to data, MarkLogic argue that data has escaped from the data centre in terms of need separate data warehouses and ETL processes aligned with each silo of the business. They put forward the marketing message that MarkLogic allows the data to come back into the data center given it can be a single platform for where all data lives and all workloads applied to it. As such it is easy to apply proper data governance if the data is in one place rather than distributed across different databases, systems and tools.

Apparently MarkLogic started out with the aims of offering enterprise search of corporate data content but has evolved much beyond just document management. Gary Bloom, their CEO, described the MarkLogic platform as the combination of:

• Database
• Search Engine
• Application Services

He said that the platform is not just the database but particularly search and database together, aligned with the aim of not just storing data and documents but with the aim of getting insights out of the data. Gary also mentioned the increasing importance of elastic compute and MarkLogic has been designed to offer this capability to spin up and down with usage, integrating with and using the latest in cloud, Hadoop and Intel processors.

Apparently one of the large European investment banks is trying to integrate all of their systems for post-trade analysis and regulatory reporting. The bank apparently tried doing this by adopting a standard relational data model but faced two problems in that 1) the relational databases were not standard and 2) that it was difficult to get to and manage an overarching relational schema. On the schema side of things, the main problem they were alluding to seemed to be one schema changing and having to propagate that through the whole architecture. The bank seems now to be having more success now that they have switched to MarkLogic for doing this post-trade analysis – from a later presentation seems like things like trades are taken directly from the Enterprise Service Bus so saving the data in the message as is (schema-less).

One thing that came up time and time again was their pitch that MarkLogic is “the only Enterprise NoSQL database” with high availability, transactional support (ACID) and security built in. He criticized other NoSQL databases for offering “eventual consistency” and said that they aspire to something better than that (to put it mildly). I thought it was interesting over a lunch chat that one of MarkLogic guys said that "MongoDB does a lot of great pre-sales for MarkLogic" meaning I guess that MongoDB is the marketing "poster child" of NoSQL document databases so they get the early leads, but as the client widens the search they find that only MarkLogic is "enterprise" capable. You can bet that the MongoDB team disagree (and indeed they do...).

On the consistency side, Gary talked about “ObamaCare” aka HealthCare.gov that MarkLogic were involved in. First came some performance figures of how they were handling 50,000 transactions/sec with 4-5ms response time for 150,000 concurrent users. This project suffered from a lot of technical problems which really came down to problems of running the system based on a fragile infrastructure with weaknesses in network, servers and storage. Gary said that the government technologists were expecting data consistency problems when things like the network went down, but the MarkLogic database is ACID and all that was needed was to restart the servers once the infrastructure was ready. Gary also mentioned that he spent 14 years working at Oracle (as a lot of the MarkLogic folks seem to have) but it was only really until Oracle 7 that they could really say they offered data consistency.

On security, again there was more criticism of other NoSQL database for offering access to either all of the data or none of it. The analogy used was one of going to an ATM and being offered access to everyone’s money and having to trust each client to only take their own. Continuing the NoSQL criticism, Gary said that he did not like the premise put around that “NoSQL is defined by Open Source” – his argument was that MarkLogic generates more revenue than all the other NoSQL databases on the market. Gary said that one client said that they hosted a “lake of data” in Hadoop but said that Hadoop was a great distributed file system but still needs a database to go with it.

Gary then talked about some of the features of MarkLogic 7, their current release. In particular that MarkLogic 7 offered scale out elasticity but with full ACID support (apparently achieving one should make it not possible to achieve the other), high performance and a flexible schema-less architecture. Gary implied that the marketing emphasis had changed recently from “big data” pitch of a few years back to include both unstructured and structured data but within one platform, so dealing with heterogeneous data which is a core capability of MarkLogic. Other features mentioned were support for XML, JSON and access through a Rest API. Usage of MarkLogic as a semantic database (a triple store) and support for the semantic query language Sparql. Gary mentioned that semantic technology was a big area of growth for them. He also mentioned support for tiered stored on HDFS.

The conversation them moved on to what’s next with version 8 of Mark Logic. The main thing is “Ease of Use” for the next release with the following features:

• MarkLogic Developer – freely downloadable version
• MarkLogic Essential Enterprise – try it for 99c/hour on AWS
• MarkLogic Global Enterprise – 33% less (decided to spend less time on the sales cycle)
• Training for free – all classes sold out – instructor led online

Along this ease of use theme, MarkLogic acknowledged that using their systems needs to be easier and that in addition to XML/XQuery programming they will be adding native support for JavaScript, greatly expanding the number of people who could program with MarkLogic. In terms of storage formats, then in addition to XML they will be adding full JSON support. On the semantics side they will offer full support for RDF, Sparql 1.1. and inferencing. Bi-temporal support will also be added with a view to answering the kind of regulatory driven questions such as “what did they know and when did they know it?”.

Joe Pasqua, SVP of Product Strategy, then took over from Gary for a more technical introduction to the MarkLogic platform. He started by saying that MarkLogic is a schema-less database with a hierarchical data model that is very document-centric, and can be used for both structured and unstructured data. Data is stored in compressed trees with the system. Joe then explained how the system is indexed explaining the “Universal Index” which lists where to find the following kinds of data as in most good search engines:

• Words
• Phrases
• Stemmed words and phrasing
• Structure (this is indexed too as new documents come in)
• Words and phrases in the context of structure
• Values
• Collections
• Security Permissions

Joe also mentioned that a “range index” is used to speed up comparisons, apparently in a similar way to column store. Geospacial indices are like 2D range indices for how near things are to a point. The system also supports semantic indices, indexing on triples of subject-predicate-object.

He showed how the system has failover replication within a database cluster for high availability but also full replication for disaster recover purposes. There were continual side references to Oracle as a “legacy database”.

On database consistency and the ACID capability Joe talked about MVCC (Multi Version Concurrency Control). Each “document” record in MarkLogic seems to have a start and end time for how current it is, and these values are used when updating data to avoid any reduction in read availability. When a document is updated a copy of it is taken but made hidden until ready – the existing document remains available until the update is ready, and then the document “end time” in the old record is marked and the “start time” marked on the new record. So effectively always doing append in serial form not seeking on disk, and the start and end time for the record enables bitemporal functionality to be implemented. Whilst the new record is being created it is already being indexed so there is zero latency searching once the new document is live.

One of the index types mentioned by Joe was a “Reverse Index” where queries are indexed and as a new document comes in it is passed over these queries (sounds like the same story from the complex event processing folks) and can trigger alerts based on what documents fit each query.

In summary, the event was a good one and MarkLogic seems interesting technology and there seems to be a variety of folks using it in financial markets with the post trade analysis example (bit like RainStor I think though, as an archive) and others using it more in the reference data space. Not sure how much MarkLogic is real-time capable – seems to be a lot of emphasis on post trade. Also brought home to me the importance of search and database together which seems to be a big strength of their technology. 

06 December 2013

F# in Finance New York Style

Quick plug for the New York version of F# in Finance event taking place next Wednesday December 11th, following on from the recent event in London. Don Syme of Microsoft Research will be demonstrating access to market data using F# and TimeScape. Hope to see you there!

27 March 2012

Data Visualisation from the FT

Data visualisation has always been an interesting subject in financial markets, one that seems to always have been talked about about as the next big thing in finance, but one that always seems to fail to meet expectations (of visualisation software vendors mostly...). I went along to an event put on by the FT today about what they term "infographics", set in the Vanderbilt Hall at Grand Central Station New York:


One of my first experiences of data visualisation was showing a partner company, Visual Numerix (VNI), around the Bankers Trust 's London trading floor in 1995. The VNI folks were talking grandly about visualising a "golden corn field of trading oportunities, with the wind of market change forcing the blades of corn to change in size and orientation" - whilst maybe they had been under the influence of illegal substances when dreaming up this description, their disappointment was palpable at trading screen after trading screen full of spreadsheets containing "numbers". Sure there was some charting being used, but mostly and understandably the traders were very focussed on the numbers of the deal that they were about to do (or had just done).

I guess this theme ultimately continues today to a large extent, although given the (media hyped) "explosion of data", visualisation is a useful technique for filtering down a large (er, can I use the word "big"?) data problem to get at the data you really want to work with (quick plug - the next version of our TimeScape product includes graphical heatmaps for looking for data exceptions, statistical anomolies and trading opportunities, which confirms Xenomorph buys into at least this aspect of the "filtering" benefits of visualisation).

Coming back to the presentation, Gillian Tett of the FT said at the event today that "infographics" is cutting edge technology - not sure I would agree although given the location some of the images were very good, like this one representing the stock pile of cash that major corporations have been hoarding (i.e. not spending) over recent years:


There was also some "interactive" aspects to the display where by stepping on part of the hall floor changed the graphic displayed. Biggest problem the FT had with this was persuading anyone to step into the middle of the floor to use it (more of an English reaction to such a request, so the reticience from New Yorker's surprised me):


Videos from the presentation can be found at http://ftgraphicworld.ft.com/ and the journalist involved, David McCandless is worth a listen to for the different ways he looks at data both on the FT site but also in a TED presentation.

17 May 2010

Cloudy definitions

Given that I am English and can tend to start many personal introductions with a short conversation about the weather (generally either "awful" or "not bad for this time of year"...), then maybe I should be very receptive to the use of weather-related expressions in technology such as the "cloud". Maybe not however since the "cloud" and "cloud computing" have reached that zenith of marketing hype, when everyone is talking about a new technology regardless of if they are sure what it actually is (or might be, or could become...).

Anyway, I finally swallowed my cynicism and on Thursday morning went along to "Migrating Business to the Cloud", an event by Microsoft hosted at Bafta (small venue where the UK deals out its equivalent (?) of the Oscars). The master of ceremonies was Mark Taylor of Microsoft, who gave a general introduction to what Microsoft are doing in the "cloud", and of particular note he described the four types of computing scenarios where cloud computing can optimally be applied:

  • Predictable Bursting - where computing needs come and go in predictable waves of usage/demand
  • Growing Fast - where computing needs are rising exponentially like in a successful internet start-up
  • Unpredictable Bursting - where computing demand comes in unpredictable bursts, such as that associated with say usage of a backup computer centre in disaster recovery
  • On and Off - where you might run a process once a month or at an interval you decide

The above definitions seem ok to me but there is (probably understandably) some overlap in usage cases. The "Growing Fast" case for start-ups is interesting and more of that later.

Mark handed over to David Chappell who gave his perspective on cloud platforms as they are today in the market. David was a very entertaining and knowledgeable speaker, despite wearing a dodgy suit (what happened to those trousers?!) and having a peculiar wide foot stance when speaking. Anyway I digress, on to what he said. David started by saying what the "Cloud" is comprised of:

  • Cloud Applications - basically this is Software as a Service (SaaS) and some current examples of this would be Salesforce.com CRM, Microsoft Exchange Online and Google Apps.
  • Cloud Platforms - a platform for developing cloud applications, with the following characteristics that it:
    • is aimed at developers for creating and running cloud applications, not end consumers
    • provides self-service access to computing resources
    • allows very granular, on-demand allocation of computing resources
    • charges for the consumption of computing resources in a very granular manner

David then explained that due to its ambiguity he disliked the usage of the term "Private Cloud" in the ongoing debate about publicly available cloud services (such as those provided my Amazon, Microsoft and Google) vs. private clouds deployed within private institutions. David said the main difference was that private clouds do not have the economics of public clouds (i.e. pay for what you use only when you need it). That point seemed straightforward, however I would have thought that with a large global organisation with many different departmental computing demands the economics of a private cloud would be similar to a public one.

David then went on to explain that there are two kinds of Cloud Platform:

  • Infrastructure as a Service (IaaS) - this is a cloud platform the provides a developer with a virtual machine (VM) that has (almost) full access within it; put another way the development environment gives the developer total control but with that control comes responsibility.
  • Platform as a Service (PaaS) - this is a cloud platform that runs an application that a developer has created; it is easy to use but has limited control for the developer.

David put forward that there has been only 5 major software technology platforms over the past 50 years:

  • Mainframe
  • Mini-Computer
  • PC
  • PC-based Server
  • Mobile

He perceives that the Cloud is the 6th major software technology platform, and as such he is extremely enthusiastic about the opportunity and benefits that this presents to the whole of the software industry and its consumers.

David categorised Microsoft's cloud platform as (mostly) PaaS, which had three main components:

  • Windows Azure - for environment for running cloud applications within the platform
  • SQL Azure - relational storage within the platform
  • Windows Azure Platform AppFabric – (David noted the long name and sympathised with trying to name things sensibly) this provides and manages the infrastructure within the platform

He then moved on to describe the main usage scenarios for Windows Azure, for applications that:

  • need massive scale, such as Web 2.0 applications
  • need high reliability
  • have highly variable loading
  • have short or unpredictable lifetimes
  • need parallell processing
  • will either fail fast or scale fast
  • do not fit easily in a single organisation's data centre, such as joint venture
  • need external storage

David said that in the fail quickly or scale quickly scenario, this was squarely aimed at technology start-ups where using Cloud technologies would effectively increase the frequency at which new ideas could be tried out at less economic cost if they go wrong, but are ready to scale massively if they become the new "Facebook" - so much so that many of the VCs in Silicon Valley are now insisting that start-ups use cloud technology as a condition of funding.

Amazon's Elastic Compute Cloud (Amazon EC2) was the first major commercial cloud platform, and David categorised this as IaaS, where effectively you get a Virtual Machine (VM) environment that provides a lot of control but requires more effort to control than an PaaS such as Azure.

David said that he was surprised that the Google App Engine, which has Python and now Java as its programming languages, did not come with any traditional relational storage (unlike most other cloud platforms) but on speaking with Google he found that the storage engine and the whole platform is again designed primarily for Web 2.0 apps and as such storage usage was more about retrieving photos, video etc and less about querying across many records.

David was very complimentary about the cloud platform from Salesforce.com called Force.com, He said that the sales pitch from Salesforce.com would be straight to business users, effectively saying that they could build scaleable, resilient applications without involving the IT department and without needing programming expertise. He asked the audience if anyone had used these tools and a few folks confirmed that they were extremely impressed by what the platform offered.

Bob Muglia (President, Server and Business Tools, Microsoft) then gave a quick talk on Microsoft's plans for Azure. He mentioned how Microsoft's new search engine, Bing, was based on several hundred thousand servers running in Azure, but only had a handful of operating staff in contrast with the usual economics (taken from Gartner) that usually 1 operations person was needed for every 50 servers. He emphasised that Microsoft was committed to the further development of "on premises" operating systems but that Microsoft was totally committed to cloud computing, its development and its support.

He said that some of the tools found in the Microsoft technology suite, such as SQL Reporting Services, are not yet available in the cloud on Azure/SQL Azure (due end of year though) - he said that he hoped that people understood that re-engineering an existing application for the cloud sometimes took time to ensure the scaleable and reliability demanded when providing the functionality through the cloud. The vision put forward by Bob for development of cloud applications seemed very compelling, with Microsoft aiming to make things such enabling resilience for a globally available cloud application as simple as ticking a check-box in Microsoft Visual Studio. He put forward that the major barrier to cloud adoption was the human aspect of trust of moving applications "off premises". He said that he saw a fundamental shift across all industries to cloud development and deployment, but added there may be some areas such as government and finance where this process takes a lot longer.

The event then switched to presentations by EasyJet, RiskMetrics and SeeTheDifference. The head of IT at EasyJet gave his pitch first. His department get an annual budget of 0.75% (small?) of turnover of £2.5bn (larger, so translating to £18.75m) and has around 60 people. He presented how EasyJet has taken an incremental approach to the adoption of cloud computing, utilising both "on-premises" and cloud ("off-premises") technology together (exposing end points of applications into the cloud at first). He advised this approach since it:

  • was a smaller step than full-blown adoption
  • was lower risk
  • demonstrated big value in a short time-frame
  • leveraged the rich functionality available in Azure
  • accelerated acceptance of cloud technology

Dr Rob Fraser of RiskMetrics was next up. He explained whilst Moore's Law says that computing power doubles every 18 months, the calculations needed for risk management have doubled every six months. This has driven the need for parallel computing to meet this calculation need, and that RiskMetrics' RiskBurst service uses around 2,500 64-bit Opteron cores in their data centre but combines this with use of Azure to meet the peaks in calculation needed during each day (the similarities with power consumption management were pretty apparent). He said that average CPU consumption was around 18% of peak, hence a combination of both on and off premises compute power was a good solution for them. He mentioned that the management of this hybrid combination of technologies, and in particular being able to show real-time billing for it was a key area of investment for RiskMetrics.

The final presentation was by SeeTheDifference. The main point of this presentation was that this charitable organisation had zero permanent staff involved in IT, but regardless was able to deliver a very professional, reliable and scaleable website using external consultants to build on Azure.

Final section of the morning was a roundtable discussion with questions from the audience. The EasyJet guy said that the human mindset was key to the adoption of cloud computing. In terms of what keeps him awake at night was the thought that what would happen/how would attitudes change if any of the cloud infrastructure failed - so far it has experienced 100% up time. Rob of RiskMetrics was concerned about the stability of the platform, trying to ensuring that any changes introduced do not damage reliability. He added that he disagreed with Bob Muglia and thought that financial institutions would adopt public clouds quickly – he cited their experience of their revenues now being 90% based from service provision not on-premises applications. David said that he took some of the comments from Bob to indicate that Microsoft would also offer more of a pure VM (IaaS) soon in addition to the PaaS approach of Azure. David said that trust was the major issue in cloud adoption and he advised an incremental approach so "get your feet wet" then build from there.

On the whole the presentations were good and my knowledge of cloud technology has improved a bit - certainly it is fantastically appealing to develop globally available applications with no scaling, no resilience or data replication issues - it sounds too good to be true which generally means it is, so I guess there is much more work to do in gaining trust and acceptance for this technology. So my (pragmatic?) cynicism remains - but cloudy days are certainly coming and for a change maybe this is something to very much look forward to.


29 March 2010

CEP - Part of the technology furniture?

The CEP market is apparently maturing - don't miss this post "CEP: LaserDisc or DVD?" by Adam Honoré at Aite Group with an interesting view of the future of CEP technology.

11 March 2010

How not to do marketing #1

I ran into this very funny post on the rebranding of Fortis into "ageas". Worth reading (and learning from it)! Also don't miss some of the comments posted for how other banks in the news could be renamed - join the debate and enter your suggestions too!  

26 June 2009

Which email have you hidden behind?

Pet subject (partly because I have been guilty of it), but good reference article by Luke Johnson of the FT on email and how many of us hide behind it rather than speak face to face to colleagues and clients.

25 June 2009

Twittering the Wisdom of Crowds

Deserving an award for title alliteration, an article on Finextra has announced that Streambase Systems have connected their system to Twitter, the fashionable microblogging site. Regardless of the intent, it is an excellent marketing exercise by Streambase (er, maybe one that I should remember for the future!...).

Reasonable comments from Finextra at the end of the article, saying that Twitter is a notoriously bad source of information, very open to (designed for?) rumour, and as such it would be difficult to see what real information traders could extract from the noise. At one level, then rumour and counter-rumour are the basis of markets, although the recent financial crisis has illustrated how powerful rumours can be. I would suggest it begs the question as to when rumour and counter-rumour is part of the price formation process, and when it becomes market manipulation.

On a related note, the Efficient Market Hypothesis (EMH), the financial theory that all information (including rumours) is reflected in current prices, has been coming under some attack in the press recently. With a fund-management and Monty-Pythonesque slant, James Montier of Société Générale takes EMH to task in his recent article in the FT (see Pablo Triana for an alternative view).

My opinion is that EMH has still got some legs in it as a model, but behavioural finance probably has a lot more to explain (or rationalise?) about this theory and others in light of recent events. Anyone got a different opinion, or do I need to open a Twitter account to find out?...

Xenomorph: analytics and data management

About Xenomorph

Xenomorph is the leading provider of analytics and data management solutions to the financial markets. Risk, trading, quant research and IT staff use Xenomorph’s TimeScape analytics and data management solution at investment banks, hedge funds and asset management institutions across the world’s main financial centres.


Blog powered by TypePad
Member since 02/2008