2017 Outlook: pandas, Arrow, Feather, Parquet, Spark, Ibis
4.3 Key Insight: The Python data ecosystem’s next leap depends on interoperable, high‑performance columnar standards—led by Arrow—to keep pandas and related tools scalable and sustainable.
McKinney argues that 2017 will be pivotal for Python data tooling as pandas, Arrow, Parquet, Feather, and PySpark converge on a shared, high‑performance columnar foundation. He frames his new role at Two Sigma as aligned with long‑term open source development and stresses that companies must engage with open source to stay competitive and attract top engineers. The post lays out pandas 2.0 goals focused on fixing technical debt, improving memory efficiency, and enabling true multithreading to keep pandas relevant at larger data scales. Apache Arrow is positioned as the interoperability layer that will make cross‑language, high‑performance IO practical, including for Spark and pandas. He also highlights ongoing work on Parquet, consolidation of Feather into Arrow, and plans to accelerate PySpark and deepen Ibis. The conclusion is an outlook of coordinated ecosystem work that improves performance, composability, and sustainability across the Python data stack.
7 Many of the best software engineers won't work for a company that forbids them from working on open source projects (I certainly would not).
6 Companies not participating in open source (as users and/or developers) are getting left behind.
5 My goal is to deliver the same quality pandas user experience on 10x as much data.
Data InfrastructureApache Arrow