🎆

Streamlit + Snowpark for Python を 2023 年な感じで動かしてみる

2023/01/05に公開

Python

🎍 2023 年な感じ？

Snowflake Advent Calendar 2022 でこんな記事を作成していました。

こちらで、

Streamlit の引数として取るのは Pandas DataFrame

と書いていたのですが、実は 2022 年末にリリースされた v1.16.0 で Pandas DataFrame のみならず、Snowpark DataFrame および PySpark DataFrame を引数として取ることができるようになっています🎉

もう少し前のバージョンからできる API もあったようですが、晴れて全対応！と言えるようになったのはこのバージョンからのようです。

ということで、書いたばかりの記事が早速古の技となってしまいました。とはいえ、開発者として便利な機能は駆使したいですよね。
ということで、Snowpark DataFrame をそのまま Streamlit API に渡すという、2023 年な感じでリファクタしてみる記事となっています。

🎯 リファクタする対象

昨年の記事と同様こちらの Quickstart で利用しているコンテンツを利用します。

こちらのコードをリファクタしてみます。

🎈 Streamlit のバージョンを上げる

こちらの記事にまとまっております。

パッケージマネージャによってやり方が違いますね。
私は Conda で実施していたので、Conda セクションの手順で実施できました。1 点注意点としては、Conda で環境を作成後、pip install streamlit で Streamlit をインストールしていると conda update で更新できないので注意してください。Quickstart の準備手順どおり実施していれば問題なく実施できると思います。

🖌️ リファクタしてみる

2 つのパートに分けてリファクタ内容を見てみます。

DataFrame をそのまま表示している部分

シンプルに Pandas DataFrame を Snowpark DataFrame にするだけです。
Pandas DataFrame に変換して表示する場合は以下のようなコードでした。

# Convert Snowpark DataFrames to Pandas DataFrames for Streamlit
pd_df_co2 = snow_df_co2.to_pandas()
st.dataframe(pd_df_co2)

これが直接渡せることで、

st.dataframe(snow_df_co2)

シンブルに記述できます。

チャートの引数として利用している部分

こちらもシンプルに記述できますが少し書きっぷりが変わります。
元の Pandas DataFrame を渡しているときはこのような記載でした。
Snowpark DataFrame を to_pandas した後に、Pandas DataFrame に set_index して Streamlit の bar_chart の引数にしています。

pd_df_co2_top_n = snow_df_co2.filter(col('Total CO2 Emissions') > emissions_threshold).to_pandas()
st.bar_chart(data=pd_df_co2_top_n.set_index('Location Name'), width=850, height=500, use_container_width=True)

一方 Snowpark DataFrame だと以下のような記載になります。可読性のために改行を入れていますが実際の実行行数は 1 行です。

st.bar_chart(data=snow_df_co2.filter(col('Total CO2 Emissions') > emissions_threshold),
x='Location Name', width=850, height=500, use_container_width=True)

この場合、Snowpark DataFrame を直接渡しています。その一方で set_index は指定せずに bar_chart の引数 x に値を指定しています。
このような実装となっている理由は以下の通りです。

Snowpark DataFrame には index という概念がない
Streamlit の bar_chart は Pandas DataFrame の index を暗黙的に x 軸と認識する

そのため Snowpark DataFrame では、 x 軸を明示的に指定しています。
厳密に比較すると x 軸指定だとラベルが出力されてしまうため、Altair で頑張るようなやり方もあるようですが本筋から脱線するので一旦ご愛嬌で。。

そのため、DataFrame の観点ではシンプルになります。その先のチャートの API 仕様によって少し工夫が必要になります。

✨ リファクタされたコード

Pandas をインポートすらしていません。お手元で動かすときは接続系の情報は適切に入力してくださいね。

# Snowpark
from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import avg, sum, col,lit
import streamlit as st

st.set_page_config(
    page_title="Environment Data Atlas",
    page_icon="🧊",
    layout="wide",
    initial_sidebar_state="expanded",
    menu_items={
        'Get Help': 'https://developers.snowflake.com',
        'About': "This is an *extremely* cool app powered by Snowpark for Python, Streamlit, and Snowflake Data Marketplace"
    }
)

# Create Session object
def create_session_object():
    connection_parameters = {
        "account"   : "",
        "user"      : "",
        "password"  : "",
        "role"      : "",
        "warehouse" : "",
        "database"  : "KNOEMA_ENVIRONMENT_DATA_ATLAS",
        "schema"    : "ENVIRONMENT"
    }
    session = Session.builder.configs(connection_parameters).create()
    return session
  
# Create Snowpark DataFrames that loads data from Knoema: Environmental Data Atlas
def load_data(session):
    # CO2 Emissions by Country
    snow_df_co2 = session.table("ENVIRONMENT.EDGARED2019").filter(col('Indicator Name') == 'Fossil CO2 Emissions').filter(col('Type Name') == 'All Type')
    snow_df_co2 = snow_df_co2.group_by('Location Name').agg(sum('$16').alias("Total CO2 Emissions")).filter(col('Location Name') != 'World').sort('Location Name')

    # Forest Occupied Land Area by Country
    snow_df_land = session.table("ENVIRONMENT.\"WBWDI2019Jan\"").filter(col('Series Name') == 'Forest area (% of land area)')
    snow_df_land = snow_df_land.group_by('Country Name').agg(sum('$61').alias("Total Share of Forest Land")).sort('Country Name')

    # Total Municipal Waste by Country
    snow_df_waste = session.table("ENVIRONMENT.UNENVDB2018").filter(col('Variable Name') == 'Municipal waste collected')
    snow_df_waste = snow_df_waste.group_by('Location Name').agg(sum('$12').alias("Total Municipal Waste")).sort('Location Name')

    # Add header and a subheader
    st.header("Knoema: Environment Data Atlas")
    st.subheader("Powered by Snowpark for Python and Snowflake Data Marketplace | Made with Streamlit")

    # Use columns to display the three dataframes side-by-side along with their headers
    col1, col2, col3 = st.columns(3)
    with st.container():
        with col1:
            st.subheader('CO2 Emissions by Country')
            st.dataframe(snow_df_co2)
        with col2:
            st.subheader('Forest Occupied Land Area by Country')
            st.dataframe(snow_df_land)
        with col3:
            st.subheader('Total Municipal Waste by Country')
            st.dataframe(snow_df_waste)

    # Display an interactive chart to visualize CO2 Emissions by Top N Countries
    with st.container():
        st.subheader('CO2 Emissions by Top N Countries')
        with st.expander(""):
            emissions_threshold = st.slider(label='Emissions Threshold',min_value=5000, value=20000, step=5000)
            st.bar_chart(data=snow_df_co2.filter(col('Total CO2 Emissions') > emissions_threshold),
            x='Location Name', width=850, height=500, use_container_width=True)

if __name__ == "__main__":
    session = create_session_object()
    load_data(session)

✅ さいごに

ということでより Streamlit と Snowpark for Python が近づいてきた感じが実感いただけたのではないかと思っています。
ちなみに今回取り上げた v1.16.0 ですが、バージョンを上げて動かしてみるとちょっとしたサプライズがあります。

読み込み長めのアプリの起動時の右上に注目です。

ということでぜひ 2023 年の書き初めをしていただければ幸いです！
Happy New Year 🎆 & Happy Streamlit-ing! 🎈

Discussion

ログインするとコメントできます