{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "XfmMtuxKYRoO" }, "source": [ "# 【確認課題】Pandas\n", "\n", "このPandas確認課題は、データサイエンス100本ノックの問題で最低限必要な問題を抜粋したものですが、[Introduction_to_Pandas](./11_Introduction_to_Pandas.ipynb) に掲載されていない機能が必要な問題もあります。\n", "初めて触るライブラリを調べながら使うというのはよくある光景です。この課題では皆さんにもそれに挑戦していただきます。 \n", "ヒントとして検索キーワードなどを載せておくので、自力で調べながら解いてみましょう。 \n", "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9ktBdoeha1jL" }, "source": [ "## 必要モジュールのインポート\n", "\n", "この問題で使うモジュールをインポートします." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "id": "2IhABxEEaq19" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "llLJCZTEa5Rb" }, "source": [ "## データの読み込み" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": {}, "colab_type": "code", "id": "uN-SvpE_a50E" }, "outputs": [], "source": [ "df_customer = pd.read_csv('https://raw.githubusercontent.com/The-Japan-DataScientist-Society/100knocks-preprocess/master/docker/work/data/customer.csv')\n", "df_product = pd.read_csv('https://raw.githubusercontent.com/The-Japan-DataScientist-Society/100knocks-preprocess/master/docker/work/data/product.csv')\n", "df_receipt = pd.read_csv('https://raw.githubusercontent.com/The-Japan-DataScientist-Society/100knocks-preprocess/master/docker/work/data/receipt.csv')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "xm5FMZJobU3Y" }, "source": [ "---\n", "## 問1. 条件抽出\n", "> P-006: レシート明細データフレーム「df_receipt」から売上日(sales_ymd)、顧客ID(customer_id)、商品コード(product_cd)、売上数量(quantity)、売上金額(amount)の順に列を指定し、以下の条件を満たすデータを抽出せよ。\n", "> - 顧客ID(customer_id)が\"CS018205000001\"\n", "> - 売上金額(amount)が1,000以上または売上数量(quantity)が5以上" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 198 }, "colab_type": "code", "id": "KSOu6AWabVaD", "outputId": "91faaad5-57ca-4a67-c046-195afa4d21df" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sales_ymdcustomer_idproduct_cdamount
3620180911CS018205000001P0714010122200
984320180414CS018205000001P060104007600
2111020170614CS018205000001P050206001990
6811720190226CS018205000001P0714010202200
7225420180911CS018205000001P0714010051100
\n", "
" ], "text/plain": [ " sales_ymd customer_id product_cd amount\n", "36 20180911 CS018205000001 P071401012 2200\n", "9843 20180414 CS018205000001 P060104007 600\n", "21110 20170614 CS018205000001 P050206001 990\n", "68117 20190226 CS018205000001 P071401020 2200\n", "72254 20180911 CS018205000001 P071401005 1100" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "7WsKq3Voj0LF" }, "source": [ "---\n", "## 問2. ソート\n", "> P-18: 顧客データフレーム(df_customer)を生年月日(birth_day)で若い順にソートし、先頭5件を全項目表示せよ。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 694 }, "colab_type": "code", "id": "uLdYmXgdjxaw", "outputId": "7a064e8c-4db3-4350-a212-688cb9b49980" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idcustomer_namegender_cdgenderbirth_dayagepostal_cdaddressapplication_store_cdapplication_datestatus_cd
15639CS035114000004大村 美里1女性2007-11-2511156-0053東京都世田谷区桜**********S13035201506196-20091205-6
7468CS022103000002福山 はじめ9不明2007-10-0211249-0006神奈川県逗子市逗子**********S14022201609090-00000000-0
10745CS002113000009柴田 真悠子1女性2007-09-1711184-0014東京都小金井市貫井南町**********S13002201603040-00000000-0
19811CS004115000014松井 京子1女性2007-08-0911165-0031東京都中野区上鷺宮**********S13004201611201-20081231-1
7039CS002114000010山内 遥1女性2007-06-0311184-0015東京都小金井市貫井北町**********S13002201609206-20100510-1
\n", "
" ], "text/plain": [ " customer_id customer_name gender_cd gender birth_day age \\\n", "15639 CS035114000004 大村 美里 1 女性 2007-11-25 11 \n", "7468 CS022103000002 福山 はじめ 9 不明 2007-10-02 11 \n", "10745 CS002113000009 柴田 真悠子 1 女性 2007-09-17 11 \n", "19811 CS004115000014 松井 京子 1 女性 2007-08-09 11 \n", "7039 CS002114000010 山内 遥 1 女性 2007-06-03 11 \n", "\n", " postal_cd address application_store_cd application_date \\\n", "15639 156-0053 東京都世田谷区桜********** S13035 20150619 \n", "7468 249-0006 神奈川県逗子市逗子********** S14022 20160909 \n", "10745 184-0014 東京都小金井市貫井南町********** S13002 20160304 \n", "19811 165-0031 東京都中野区上鷺宮********** S13004 20161120 \n", "7039 184-0015 東京都小金井市貫井北町********** S13002 20160920 \n", "\n", " status_cd \n", "15639 6-20091205-6 \n", "7468 0-00000000-0 \n", "10745 0-00000000-0 \n", "19811 1-20081231-1 \n", "7039 6-20100510-1 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1lRGtDSphhyQ" }, "source": [ "---\n", "## 問3. 全件数\n", "> P-021: レシート明細データフレーム(df_receipt)に対し、件数をカウントせよ。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "m-ihL_fVhhyQ", "outputId": "da0669b7-2c34-4dc6-becb-4da002103202" }, "outputs": [ { "data": { "text/plain": [ "104681" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "MkjDW-oyhhyS" }, "source": [ "## 問4. ユニーク件数\n", "> P-022: レシート明細データフレーム(df_receipt)の顧客ID(customer_id)に対し、ユニーク件数をカウントせよ。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "0xSI9r8UhhyS", "outputId": "0c5a57f4-2088-4109-a40e-eb76c4845af0" }, "outputs": [ { "data": { "text/plain": [ "8307" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "ヒント\n", "「ユニーク」というのはそのまま検索に使える単語です。 \n", "
" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "h9E9b_yUhhyq" }, "source": [ "---\n", "## 問5. 〇〇ごとに集計\n", "> P-035: レシート明細データフレーム(df_receipt)に対し、顧客ID(customer_id)ごとに売上金額(amount)を合計して全顧客の平均を求め、平均以上に買い物をしている顧客を抽出せよ。ただし、顧客IDが\"Z\"から始まるのものは非会員を表すため、除外して計算すること。なお、データは先頭5件だけ表示せよ。\n", "\n", "会員のみを抽出する方法は、例えば以下の2通りの方法があります。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df_receipt_only_member = df_receipt[~df_receipt[\"customer_id\"].str.startswith(\"Z\")]\n", "df_receipt_only_member = df_receipt.query(\"not customer_id.str.startswith('Z')\", engine=\"python\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 225 }, "colab_type": "code", "id": "7lYKkmsohhyq", "outputId": "06bbc1ea-d6d0-4841-a6dd-1598974714b6" }, "outputs": [ { "data": { "text/plain": [ "customer_id\n", "CS001115000010 3044\n", "CS001205000006 3337\n", "CS001214000009 4685\n", "CS001214000017 4132\n", "CS001214000052 5639\n", "Name: amount, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "ヒント1\n", "「pandas 要素ごと 集計」 などで今回使える機能に関する記事が見つかります。\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "ヒント2\n", "メソッド名は \"groupby\" です。\n", "
" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "iNO7ESvWhhyw" }, "source": [ "---\n", "## 問6. DataFrameの結合\n", "> P-038: 顧客データフレーム(df_customer)とレシート明細データフレーム(df_receipt)から、各顧客ごとの売上金額合計を求めよ。ただし、買い物の実績がない顧客については売上金額を0として表示させること。また、顧客は性別コード(gender_cd)が女性(1)であるものを対象とし、非会員(顧客IDが'Z'から始まるもの)は除外すること。なお、結果は先頭5件だけ表示せよ。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df_customer_only_member = df_customer[~df_customer[\"customer_id\"].str.startswith(\"Z\")]\n", "df_customer_only_member = df_customer.query(\"not customer_id.str.startswith('Z')\", engine=\"python\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 728 }, "colab_type": "code", "id": "hmc6LUaEhhyw", "outputId": "f1b1dc56-af65-4fbf-9d8a-5c0490a2ad17" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idcustomer_namegender_cdgenderbirth_dayagepostal_cdaddressapplication_store_cdapplication_datestatus_cdamount
0CS021313000114大野 あや子1.0女性1981-04-2937.0259-1113神奈川県伊勢原市粟窪**********S1402120150905.00-00000000-00.0
1CS031415000172宇多田 貴美子1.0女性1976-10-0442.0151-0053東京都渋谷区代々木**********S1303120150529.0D-20100325-C5088.0
2CS028811000001堀井 かおり1.0女性1933-03-2786.0245-0016神奈川県横浜市泉区和泉町**********S1402820160115.00-00000000-00.0
3CS001215000145田崎 美紀1.0女性1995-03-2924.0144-0055東京都大田区仲六郷**********S1300120170605.06-20090929-2875.0
4CS015414000103奥野 陽子1.0女性1977-08-0941.0136-0073東京都江東区北砂**********S1301520150722.0B-20100609-B3122.0
\n", "
" ], "text/plain": [ " customer_id customer_name gender_cd gender birth_day age postal_cd \\\n", "0 CS021313000114 大野 あや子 1.0 女性 1981-04-29 37.0 259-1113 \n", "1 CS031415000172 宇多田 貴美子 1.0 女性 1976-10-04 42.0 151-0053 \n", "2 CS028811000001 堀井 かおり 1.0 女性 1933-03-27 86.0 245-0016 \n", "3 CS001215000145 田崎 美紀 1.0 女性 1995-03-29 24.0 144-0055 \n", "4 CS015414000103 奥野 陽子 1.0 女性 1977-08-09 41.0 136-0073 \n", "\n", " address application_store_cd application_date \\\n", "0 神奈川県伊勢原市粟窪********** S14021 20150905.0 \n", "1 東京都渋谷区代々木********** S13031 20150529.0 \n", "2 神奈川県横浜市泉区和泉町********** S14028 20160115.0 \n", "3 東京都大田区仲六郷********** S13001 20170605.0 \n", "4 東京都江東区北砂********** S13015 20150722.0 \n", "\n", " status_cd amount \n", "0 0-00000000-0 0.0 \n", "1 D-20100325-C 5088.0 \n", "2 0-00000000-0 0.0 \n", "3 6-20090929-2 875.0 \n", "4 B-20100609-B 3122.0 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "ヒント1\n", "タイトル通り 「pandas DataFrame 結合」などと調べれば必要な機能に関する記事が見つかります。 \n", "
\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "ヒント2\n", "\"merge\", \"join\"という似たメソッドがあります。 \n", "今回の場合\"merge\"が便利でしょう。\n", "
" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "umDmd8kohhzA" }, "source": [ "---\n", "## 問7. 時系列データ\n", "> P-046: 顧客データフレーム(df_customer)の申し込み日(application_date)はYYYYMMD形式の文字列型でデータを保有している。これを日付型(dateやdatetime)に変換し、顧客ID(customer_id)とともに抽出せよ。なお、データは先頭5件を表示せよ。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 348 }, "colab_type": "code", "id": "pVAxV-TWhhzA", "outputId": "2d2e2281-7181-41b4-81e4-a9e834b93927" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idapplication_date
0CS0213130001142015-09-05
1CS0376130000712015-04-14
2CS0314150001722015-05-29
3CS0288110000012016-01-15
4CS0012150001452017-06-05
\n", "
" ], "text/plain": [ " customer_id application_date\n", "0 CS021313000114 2015-09-05\n", "1 CS037613000071 2015-04-14\n", "2 CS031415000172 2015-05-29\n", "3 CS028811000001 2016-01-15\n", "4 CS001215000145 2017-06-05" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "ヒント1\n", "「pandas datetime」などで該当の機能が見つかるかと思います。\n", "
\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "ヒント2\n", "\"pd.to_datetime\"というメソッドが使えるでしょう。このメソッドを適用する際ですが、for文を使わずに実装しましょう。" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9v_q6BLjhhzU" }, "source": [ "---\n", "## 問8. 関数\n", "> P-061: レシート明細データフレーム(df_receipt)の売上金額(amount)を顧客ID(customer_id)ごとに合計し、合計した売上金額を常用対数化(底=10)して顧客ID、売上金額合計とともに表示せよ。ただし、顧客IDが\"Z\"から始まるのものは非会員を表すため、除外して計算すること。なお、結果は先頭5件を表示せよ。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 437 }, "colab_type": "code", "id": "d5_2HQ-2hhzU", "outputId": "56ba3b92-2071-4a8d-c555-d9007bb43316" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amountlog_amount
customer_id
CS00111300000412983.113275
CS0011140000056262.796574
CS00111500001030443.483445
CS00120500000419883.298416
CS00120500000633373.523356
\n", "
" ], "text/plain": [ " amount log_amount\n", "customer_id \n", "CS001113000004 1298 3.113275\n", "CS001114000005 626 2.796574\n", "CS001115000010 3044 3.483445\n", "CS001205000004 1988 3.298416\n", "CS001205000006 3337 3.523356" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "boe923CMhhzq" }, "source": [ "---\n", "## 問9. 欠損数\n", "> P-079: 商品データフレーム(df_product)の各項目に対し、欠損数を確認せよ。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 169 }, "colab_type": "code", "id": "bxl__vC5hhzq", "outputId": "d8bb408a-6897-4e5a-8416-ce67a8a5fce4" }, "outputs": [ { "data": { "text/plain": [ "product_cd 0\n", "category_major_cd 0\n", "category_medium_cd 0\n", "category_small_cd 0\n", "unit_price 7\n", "unit_cost 7\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "afprSqIvhhzs" }, "source": [ "---\n", "## 問10. 欠損値の除去\n", "> P-080: 商品データフレーム(df_product)のいずれかの項目に欠損が発生しているレコードを全て削除した新たなdf_product_1を作成せよ。なお、削除前後の件数を表示させ、前設問で確認した件数だけ減少していることも確認すること。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": {}, "colab_type": "code", "id": "q3_9sLdHhhzt" }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": {}, "colab_type": "code", "id": "qFNnYstw1vTG" }, "outputs": [ { "data": { "text/plain": [ "(10030, 10023)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_product), len(df_product_1)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GtiO20ZKhhzu" }, "source": [ "---\n", "## 問11. 欠損値の穴埋め\n", "> P-081: 単価(unit_price)と原価(unit_cost)の欠損値について、それぞれの平均値で補完した新たなdf_product_2を作成せよ。なお、平均値について1円未満は四捨五入とせよ。補完実施後、各項目について欠損が生じていないことも確認すること。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": {}, "colab_type": "code", "id": "puFf-7Ewhhzu" }, "outputs": [ { "data": { "text/plain": [ "product_cd 0\n", "category_major_cd 0\n", "category_medium_cd 0\n", "category_small_cd 0\n", "unit_price 0\n", "unit_cost 0\n", "dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 余談\n", "ChatGPTやBing AIに聞けば大抵のことは教えてくれます。 \n", "何回か入力文章を吟味しないといけないこともありますが、知らないことを調べる場合は自分で検索するよりも早いです。 \n", "ただ、ChatGPTなどは嘘をつく場合があるので、自分でソースを参照する姿勢は必要です。 \n", "\n", "これはBingAIの回答例です。\n", "\n", "![BingAIの回答例](./images/BingAI.png)" ] } ], "metadata": { "colab": { "name": "pandas_quiz.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }